33
Smashing Molecules How Molecular Fragments Allow us to Explore Large Chemical Spaces Rajarshi Guha & Trung Nguyen NIH Center for Translational Therapeutics Chemaxon UGM September 2011

Smashing Molecules - ChemAxon

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Smashing Molecules - ChemAxon

Smashing Molecules How Molecular Fragments Allow us to Explore Large

Chemical Spaces

Rajarshi Guha & Trung Nguyen

NIH Center for Translational Therapeutics

Chemaxon UGM

September 2011

Page 2: Smashing Molecules - ChemAxon

Outline

• Fragments as the building blocks of chemistry

• Fragments and SAR

• Fragments and activity profiles

Page 3: Smashing Molecules - ChemAxon

Big Data for Some Problems

• Halevy et al discuss the effectiveness of extremely large datasets

• Their application focuses on machine translation – see the Google n-gram corpus

• They suggest that such extremely large datasets are useful because they effectively encompass all n-grams (phrases) commonly used

• Domain is relatively constrained

Halevy et al, IEEE Intelligent Systems, 2009, 24, 8-12

Page 4: Smashing Molecules - ChemAxon

Google Scale in Chemistry?

• What would be the equivalent of an n-gram corpus in chemistry? – Fragments

– A more direct analogy can be made by using LINGO’s

• It is possible to generate arbitrarily large (virtual) compound and fragment collections

• But would such a collection span all of “commonly used” chemistry? – Depending on the initial compound set, yes

– But we’re also interested in going beyond such a “commonly used” set

Fink T, Reymond JL, J Chem Inf Model, 2007, 47, 342

Page 5: Smashing Molecules - ChemAxon

What Do We Do with Fragments?

• Assuming we obtain fragments from a large enough collection what do we do?

– Learning from fragments – QSARs, generative models

– Use fragments as filters, alternative to clustering

– Explore chemotypes and activity

– Scaffold level promiscuity

White, D and Wilson, RC, J Chem Inf Model, 2010, 50, 1257-1274

Page 6: Smashing Molecules - ChemAxon

Scaffold Activity Diagrams

• Network oriented view of fragment (scaffold) collections – Similar in idea to

Scaffold Hunter etc

– Not purely hierarchical

• Color by arbitrary properties

• Quickly assess utility of a scaffold

• Try it online

Page 7: Smashing Molecules - ChemAxon

What Makes a Good Scaffold?

• What makes a good scaffold?

– Size, complexity, …

– Do the members represent an SAR or not?

– Intuition and experience also play a role

Page 8: Smashing Molecules - ChemAxon

Scaffold QSAR

Evaluate topological and physicochemical descriptors for the R-groups

Fit PLS or ridge regression model

Characterize the SAR landscape

Page 9: Smashing Molecules - ChemAxon

Scaffold QSAR - Drawbacks

• Many scaffolds have few (5 to 10) members

• Invariably, more features than observations

• If the number of R-groups is large, the feature matrix can be very sparse

– Less of a problem for combinatorial libraries

• A linear fit may not be the best approach to correlating R-groups to the activities

– Difficult to choose a model type a priori

Page 10: Smashing Molecules - ChemAxon

Fragment Activity Profiles

• Using scaffolds in HTS triage usually leads to two questions

– What is known about the chemical series with respect to the intended target?

– What compound classes are known to modulate the intended target & how similar are they to series in question

• We’re interested in exploring summaries of activity, grouped by scaffolds and targets

Page 11: Smashing Molecules - ChemAxon

Fragment Activity Profiles

• We use ChEMBL (08) as the source of bioactivity across multiple targets

• Preprocess the database

– Generate scaffolds (exhaustive enumeration of combinations of SSSR’s)

– Normalize activity data so that we compare the activity of a molecule across different assays

Page 12: Smashing Molecules - ChemAxon

Database Setup

• Preprocessing steps available as a Java servlet

– http://tripod.nih.gov/files/chembl-servlets.zip

• Need ChEMBL installed in Oracle; we add some extra tables

– Fragment structures and computed properties

– Aggregated assay activity summary

• Only consider assays with IC50’s in nM and uncensored data, more than 5 observations and a MAD > 0

– (Robust) z-scored activities

Page 13: Smashing Molecules - ChemAxon

Some Fragment Statistics

• Considered Z-score range of -40 to 15

• There were 12,887 molecules lying outside this range

log(Number of molecules)

Perc

en

tag

e o

f assa

ys

0

5

10

15

1.0 1.5 2.0 2.5

Z-score

Nu

mb

er

of

com

po

und

s

0

10

20

30

40

50

-40 -30 -20 -10 0 10

Page 14: Smashing Molecules - ChemAxon

Some Fragment Statistics

• Next, identify fragments with 8 to 20 atoms and occurring in 100 to 900 molecules

• Gives us 1,746 fragments

Num Molecules

Pe

rcen

tag

e o

f F

rag

me

nts

0

10

20

30

40

200 400 600 800

Page 15: Smashing Molecules - ChemAxon

Some Fragment Statistics

• We can query the fragment tables to get activity summaries for individual fragments

• For these examples we consider the full range of Z- scores

Z-Score

Pe

rce

nt o

f T

ota

l

0

10

20

30

40

50

60

-30 -20 -10 0 10

N = 1280

778

-600 -400 -200 0

N = 1918

2723

-50 0 50

N = 2641

4058

-5 0 5 10 15

N = 1489

5390

0 10 20

N = 1578

5486

-60 -40 -20 0 20

0

10

20

30

40

50

60N = 1455

13485

0

10

20

30

40

50

60

-20 0 20

N = 1457

40169

-40 -20 0 20

N = 1595

64473

-20 -10 0 10

N = 1515

115654

Page 16: Smashing Molecules - ChemAxon

Exploring Activity Profiles

Fragments from ChEMBL

Activity distributions of parent molecules across all targets Z-scores for individual

molecules against a specific target

Page 17: Smashing Molecules - ChemAxon

Exploring Activity Profiles

• User can draw a molecule and fragment on the fly

• Use generated fragments to create activity histograms

Page 18: Smashing Molecules - ChemAxon

Target Selection

• Employs the ChEMBL target hierarchy

• Can select target families or individual targets

Page 19: Smashing Molecules - ChemAxon

Similar Fragments with Similar Profiles?

• Consider 658 fragments with > 10 atoms and occurring in 500 to 1200 molecules

• Overall, the fragments tend to be dissimilar

– 95th percentile is just 0.50

• 1,873 pairs do exhibit Tc > 0.8

Tanimoto Similarity

Pe

rce

nta

ge o

f p

airs

0

5

10

15

20

25

0.0 0.2 0.4 0.6 0.8 1.0

Page 20: Smashing Molecules - ChemAxon

Comparing Activity Profiles

• Compare activity profiles with the K-S statistic

• Color corresponds to p-value of the K-S test

• No obvious correlation between fragment similarity & activity profile similarity

• Probably not rigorous when a scaffold has few parent molecules Tanimoto Similarity

K-S

sta

tistic

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.80 0.85 0.90 0.95 1.00

0.0

0.2

0.4

0.6

0.8

1.0

Page 21: Smashing Molecules - ChemAxon

Exploring Profiles for Fragment Pairs

• Compare activity distributions across all targets in a pairwise fashion

• Can also generate comparison for a single target, but requires data for all the fragments

Page 22: Smashing Molecules - ChemAxon

Looking for Selective Fragments

• Interesting to visually explore fragment pairs

• Can become tedious, especially in a database as big as ChEMBL

• Can we automate this type of analysis?

– Identify fragment pairs with very different activity distributions?

– Identify fragments with a preference for a certain target (class)?

Page 24: Smashing Molecules - ChemAxon

Targetwise Activity Profiles

• Identified benzylpyrrolidine as a fragment with preference for a specific target class

• But reported as dopamine agonists

Page 25: Smashing Molecules - ChemAxon

Fragment or Scaffold?

• I’ve been using fragment & scaffold interchangeably – not always true

• Chemists have an intuitive idea of what a scaffold is

• Can we encode the idea of scaffold-like or fragment-like

• We use the concept of Signal-to-Noise Ratio

SNR=m

s

Size of fragment

SD of number of atoms not in the fragment, considered over the parent molecules

Page 26: Smashing Molecules - ChemAxon

Fragment or Scaffold

• Partial distribution of SNR values for fragments with atom count > 8 & < 20

SNR

Pe

rce

nta

ge

of

Fra

gm

ents

0

10

20

30

40

50

60

0 1 2 3 4 5 6

Page 27: Smashing Molecules - ChemAxon

• Large SNR’s associated with Murcko-like fragments

• A useful SNR cutoff is an open question

SNR = 8.50

Fragment or Scaffold

SNR = 12.09 SNR = 9.10

SNR = 0.36 SNR = 0.43 SNR = 0.83

Page 28: Smashing Molecules - ChemAxon

Activity Profiles & SNR

• Given a fragment, evaluate SD of the number of atoms in the parent molecules that are not part of the fragment

• Label the parent molecules based on

– If number of atoms not in the fragment > SD, non core-like

– Otherwise core-like

• Visualize the activity distributions of the parent molecules, grouped by the label

Page 29: Smashing Molecules - ChemAxon

Z-Score

Pe

rcenta

ge o

f T

ota

l

20

40

60

80

-50 0 50

Core-like

20967

-50 0 50

Not core-like

20967

-50 0 50

Core-like

44591

-50 0 50

Not core-like

44591

Z-Score

Perc

enta

ge o

f T

ota

l

20

40

60

80

-30 -20 -10 0 10

Core-like

801

-30 -20 -10 0 10

Not core-like

801

-30 -20 -10 0 10

Core-like

68604

-30 -20 -10 0 10

Not core-like

68604

Z-Score

Pe

rcenta

ge o

f T

ota

l

20

40

60

80

-50 0 50

Core-like

20967

-50 0 50

Not core-like

20967

-50 0 50

Core-like

44591

-50 0 50

Not core-like

44591

Z-Score

Perc

enta

ge o

f T

ota

l

20

40

60

80

-30 -20 -10 0 10

Core-like

801

-30 -20 -10 0 10

Not core-like

801

-30 -20 -10 0 10

Core-like

68604

-30 -20 -10 0 10

Not core-like

68604

Z-Score

Pe

rcenta

ge o

f T

ota

l

20

40

60

80

-50 0 50

Core-like

20967

-50 0 50

Not core-like

20967

-50 0 50

Core-like

44591

-50 0 50

Not core-like

44591

Z-Score

Perc

enta

ge o

f T

ota

l

20

40

60

80

-30 -20 -10 0 10

Core-like

801

-30 -20 -10 0 10

Not core-like

801

-30 -20 -10 0 10

Core-like

68604

-30 -20 -10 0 10

Not core-like

68604

Z-Score

Pe

rcenta

ge o

f T

ota

l

20

40

60

80

-50 0 50

Core-like

20967

-50 0 50

Not core-like

20967

-50 0 50

Core-like

44591

-50 0 50

Not core-like

44591

Z-Score

Perc

enta

ge o

f T

ota

l

20

40

60

80

-30 -20 -10 0 10

Core-like

801

-30 -20 -10 0 10

Not core-like

801

-30 -20 -10 0 10

Core-like

68604

-30 -20 -10 0 10

Not core-like

68604

Activity Profiles & SNR

High SNR

Low SNR

Page 30: Smashing Molecules - ChemAxon

Downloads

• Scaffold activity networks

• Fragment Activity Profiler

– SQL & servlet sources

– Client sources

– Online version

Page 31: Smashing Molecules - ChemAxon
Page 32: Smashing Molecules - ChemAxon

Fragment Diversity

• Consider a set of bioactives such as the LOPAC collection, 1280 compounds

• Using exhaustive fragmentation we get 2,460 unique fragments

• On the MLSMR (~ 372K compounds), we get 164,583 fragments

log Fragment Frequency

Perc

en

t o

f T

ota

l

0

10

20

30

40

0 1 2 3 4

Page 33: Smashing Molecules - ChemAxon

PC 1

PC

2

-4

-2

0

2

4

-4 -2 0 2 4

Fragment Diversity

• Distribution of MLSMR fragments in BCUT space

PC 1

PC

2

-4

-2

0

2

4

6

-4 -2 0 2

All fragments Fragments occurring in 5 to 50 molecules