Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Smashing Molecules How Molecular Fragments Allow us to Explore Large
Chemical Spaces
Rajarshi Guha & Trung Nguyen
NIH Center for Translational Therapeutics
Chemaxon UGM
September 2011
Outline
• Fragments as the building blocks of chemistry
• Fragments and SAR
• Fragments and activity profiles
Big Data for Some Problems
• Halevy et al discuss the effectiveness of extremely large datasets
• Their application focuses on machine translation – see the Google n-gram corpus
• They suggest that such extremely large datasets are useful because they effectively encompass all n-grams (phrases) commonly used
• Domain is relatively constrained
Halevy et al, IEEE Intelligent Systems, 2009, 24, 8-12
Google Scale in Chemistry?
• What would be the equivalent of an n-gram corpus in chemistry? – Fragments
– A more direct analogy can be made by using LINGO’s
• It is possible to generate arbitrarily large (virtual) compound and fragment collections
• But would such a collection span all of “commonly used” chemistry? – Depending on the initial compound set, yes
– But we’re also interested in going beyond such a “commonly used” set
Fink T, Reymond JL, J Chem Inf Model, 2007, 47, 342
What Do We Do with Fragments?
• Assuming we obtain fragments from a large enough collection what do we do?
– Learning from fragments – QSARs, generative models
– Use fragments as filters, alternative to clustering
– Explore chemotypes and activity
– Scaffold level promiscuity
White, D and Wilson, RC, J Chem Inf Model, 2010, 50, 1257-1274
Scaffold Activity Diagrams
• Network oriented view of fragment (scaffold) collections – Similar in idea to
Scaffold Hunter etc
– Not purely hierarchical
• Color by arbitrary properties
• Quickly assess utility of a scaffold
• Try it online
What Makes a Good Scaffold?
• What makes a good scaffold?
– Size, complexity, …
– Do the members represent an SAR or not?
– Intuition and experience also play a role
Scaffold QSAR
Evaluate topological and physicochemical descriptors for the R-groups
Fit PLS or ridge regression model
Characterize the SAR landscape
Scaffold QSAR - Drawbacks
• Many scaffolds have few (5 to 10) members
• Invariably, more features than observations
• If the number of R-groups is large, the feature matrix can be very sparse
– Less of a problem for combinatorial libraries
• A linear fit may not be the best approach to correlating R-groups to the activities
– Difficult to choose a model type a priori
Fragment Activity Profiles
• Using scaffolds in HTS triage usually leads to two questions
– What is known about the chemical series with respect to the intended target?
– What compound classes are known to modulate the intended target & how similar are they to series in question
• We’re interested in exploring summaries of activity, grouped by scaffolds and targets
Fragment Activity Profiles
• We use ChEMBL (08) as the source of bioactivity across multiple targets
• Preprocess the database
– Generate scaffolds (exhaustive enumeration of combinations of SSSR’s)
– Normalize activity data so that we compare the activity of a molecule across different assays
Database Setup
• Preprocessing steps available as a Java servlet
– http://tripod.nih.gov/files/chembl-servlets.zip
• Need ChEMBL installed in Oracle; we add some extra tables
– Fragment structures and computed properties
– Aggregated assay activity summary
• Only consider assays with IC50’s in nM and uncensored data, more than 5 observations and a MAD > 0
– (Robust) z-scored activities
Some Fragment Statistics
• Considered Z-score range of -40 to 15
• There were 12,887 molecules lying outside this range
log(Number of molecules)
Perc
en
tag
e o
f assa
ys
0
5
10
15
1.0 1.5 2.0 2.5
Z-score
Nu
mb
er
of
com
po
und
s
0
10
20
30
40
50
-40 -30 -20 -10 0 10
Some Fragment Statistics
• Next, identify fragments with 8 to 20 atoms and occurring in 100 to 900 molecules
• Gives us 1,746 fragments
Num Molecules
Pe
rcen
tag
e o
f F
rag
me
nts
0
10
20
30
40
200 400 600 800
Some Fragment Statistics
• We can query the fragment tables to get activity summaries for individual fragments
• For these examples we consider the full range of Z- scores
Z-Score
Pe
rce
nt o
f T
ota
l
0
10
20
30
40
50
60
-30 -20 -10 0 10
N = 1280
778
-600 -400 -200 0
N = 1918
2723
-50 0 50
N = 2641
4058
-5 0 5 10 15
N = 1489
5390
0 10 20
N = 1578
5486
-60 -40 -20 0 20
0
10
20
30
40
50
60N = 1455
13485
0
10
20
30
40
50
60
-20 0 20
N = 1457
40169
-40 -20 0 20
N = 1595
64473
-20 -10 0 10
N = 1515
115654
Exploring Activity Profiles
Fragments from ChEMBL
Activity distributions of parent molecules across all targets Z-scores for individual
molecules against a specific target
Exploring Activity Profiles
• User can draw a molecule and fragment on the fly
• Use generated fragments to create activity histograms
Target Selection
• Employs the ChEMBL target hierarchy
• Can select target families or individual targets
Similar Fragments with Similar Profiles?
• Consider 658 fragments with > 10 atoms and occurring in 500 to 1200 molecules
• Overall, the fragments tend to be dissimilar
– 95th percentile is just 0.50
• 1,873 pairs do exhibit Tc > 0.8
Tanimoto Similarity
Pe
rce
nta
ge o
f p
airs
0
5
10
15
20
25
0.0 0.2 0.4 0.6 0.8 1.0
Comparing Activity Profiles
• Compare activity profiles with the K-S statistic
• Color corresponds to p-value of the K-S test
• No obvious correlation between fragment similarity & activity profile similarity
• Probably not rigorous when a scaffold has few parent molecules Tanimoto Similarity
K-S
sta
tistic
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.80 0.85 0.90 0.95 1.00
0.0
0.2
0.4
0.6
0.8
1.0
Exploring Profiles for Fragment Pairs
• Compare activity distributions across all targets in a pairwise fashion
• Can also generate comparison for a single target, but requires data for all the fragments
Looking for Selective Fragments
• Interesting to visually explore fragment pairs
• Can become tedious, especially in a database as big as ChEMBL
• Can we automate this type of analysis?
– Identify fragment pairs with very different activity distributions?
– Identify fragments with a preference for a certain target (class)?
Targetwise Activity Profiles
• Evaluate mean activity of parent molecules within a target class
• Count number of parent molecules tested against the target
• Selectivity of 1-phenylimidazole for CYP450 has been noted
Wilkinson et al, Biochem Pharmacol, 1983, 32, 997-1003
Targetwise Activity Profiles
• Identified benzylpyrrolidine as a fragment with preference for a specific target class
• But reported as dopamine agonists
Fragment or Scaffold?
• I’ve been using fragment & scaffold interchangeably – not always true
• Chemists have an intuitive idea of what a scaffold is
• Can we encode the idea of scaffold-like or fragment-like
• We use the concept of Signal-to-Noise Ratio
SNR=m
s
Size of fragment
SD of number of atoms not in the fragment, considered over the parent molecules
Fragment or Scaffold
• Partial distribution of SNR values for fragments with atom count > 8 & < 20
SNR
Pe
rce
nta
ge
of
Fra
gm
ents
0
10
20
30
40
50
60
0 1 2 3 4 5 6
• Large SNR’s associated with Murcko-like fragments
• A useful SNR cutoff is an open question
SNR = 8.50
Fragment or Scaffold
SNR = 12.09 SNR = 9.10
SNR = 0.36 SNR = 0.43 SNR = 0.83
Activity Profiles & SNR
• Given a fragment, evaluate SD of the number of atoms in the parent molecules that are not part of the fragment
• Label the parent molecules based on
– If number of atoms not in the fragment > SD, non core-like
– Otherwise core-like
• Visualize the activity distributions of the parent molecules, grouped by the label
Z-Score
Pe
rcenta
ge o
f T
ota
l
20
40
60
80
-50 0 50
Core-like
20967
-50 0 50
Not core-like
20967
-50 0 50
Core-like
44591
-50 0 50
Not core-like
44591
Z-Score
Perc
enta
ge o
f T
ota
l
20
40
60
80
-30 -20 -10 0 10
Core-like
801
-30 -20 -10 0 10
Not core-like
801
-30 -20 -10 0 10
Core-like
68604
-30 -20 -10 0 10
Not core-like
68604
Z-Score
Pe
rcenta
ge o
f T
ota
l
20
40
60
80
-50 0 50
Core-like
20967
-50 0 50
Not core-like
20967
-50 0 50
Core-like
44591
-50 0 50
Not core-like
44591
Z-Score
Perc
enta
ge o
f T
ota
l
20
40
60
80
-30 -20 -10 0 10
Core-like
801
-30 -20 -10 0 10
Not core-like
801
-30 -20 -10 0 10
Core-like
68604
-30 -20 -10 0 10
Not core-like
68604
Z-Score
Pe
rcenta
ge o
f T
ota
l
20
40
60
80
-50 0 50
Core-like
20967
-50 0 50
Not core-like
20967
-50 0 50
Core-like
44591
-50 0 50
Not core-like
44591
Z-Score
Perc
enta
ge o
f T
ota
l
20
40
60
80
-30 -20 -10 0 10
Core-like
801
-30 -20 -10 0 10
Not core-like
801
-30 -20 -10 0 10
Core-like
68604
-30 -20 -10 0 10
Not core-like
68604
Z-Score
Pe
rcenta
ge o
f T
ota
l
20
40
60
80
-50 0 50
Core-like
20967
-50 0 50
Not core-like
20967
-50 0 50
Core-like
44591
-50 0 50
Not core-like
44591
Z-Score
Perc
enta
ge o
f T
ota
l
20
40
60
80
-30 -20 -10 0 10
Core-like
801
-30 -20 -10 0 10
Not core-like
801
-30 -20 -10 0 10
Core-like
68604
-30 -20 -10 0 10
Not core-like
68604
Activity Profiles & SNR
High SNR
Low SNR
Downloads
• Scaffold activity networks
• Fragment Activity Profiler
– SQL & servlet sources
– Client sources
– Online version
Fragment Diversity
• Consider a set of bioactives such as the LOPAC collection, 1280 compounds
• Using exhaustive fragmentation we get 2,460 unique fragments
• On the MLSMR (~ 372K compounds), we get 164,583 fragments
log Fragment Frequency
Perc
en
t o
f T
ota
l
0
10
20
30
40
0 1 2 3 4
PC 1
PC
2
-4
-2
0
2
4
-4 -2 0 2 4
Fragment Diversity
• Distribution of MLSMR fragments in BCUT space
PC 1
PC
2
-4
-2
0
2
4
6
-4 -2 0 2
All fragments Fragments occurring in 5 to 50 molecules