Fiorella Ruggiu, Gilles Marcou, Alexandre Varnek & Dragos Horvath
Laboratoire d’InfoChimie
UMR 7177 CNRS – Université de Strasbourg
Institut de Chimie, 4, rue Blaise Pascal, 6700 Strasbourg, FR
Fragment Descriptors – a Golden Standard in Drug Design
O-C*C*C-N 1O-C*N*C-N 1…
Example: ISIDA Sequence and Augmented Atom counts
Pros: Open-ended, comprehensive &intuitive capture of structural information(atoms & bonds – scaffold-oriented)
Cons: Atom symbols are not informativeof the actual chemical context of theatom. Context information is not lost, butdispersed…
Insensitive to the actual ionization status
Fuzzy pH-dependent Pharmacophore Triplets (FPT)
3 3
3
4
6
7
4
3 4
5
5 3
0 0 0 … 0 0 … +6 … … +3 … … … … 0 …
5
5 4
Di(m) = total occupancy of basis triplet i in molecule m.
Pros: Explicit labeling of groups bytheir physico-chemical nature, pH-sensitive, fuzzy (scaffold hopping)
Cons: Fixed format/sizes
Property-Labeled Fragments: Combining the best of two worlds
Open-ended enumeration of linear (Sequences) orbranched (Augmented Atoms) fragments, while labelingatoms by their context-dependent physico-chemicalproperties:
Pharmacophore Type
Gasteiger Charge-based Topological Potential
logP contributions (in progress)
pH-dependence granted by fingerprinting each populatedmicrospecies, then returning the population-weightedaverage fingerprint
Fuzziness supported by the use of “wildcard” atoms
Workflow
ChemAxon-Readable
Compound Set
Annotated .sd file with typing info
Fragment Counts
Type: ChemAxon API-based Java
tool
FragType: Free Pascal Program
T
If exception, retry withoutTakeResonantStructures
For each molecule
For each µspecies of population level >1%
Submit to pKa plugin
Submit to Standardizer
Submit to pmapper
Submit to pmapper
Pharmacophore Types
Force Field Types
Submit to charge plugin
GasteigerCharges
Electrostatic Potential Flags
End µspecies loop
Add population value & flag strings as new property fields of molecule
Write molecule with property fields
End loop
The Typing Tool
aRomatic, Hydrophobic,
Acceptor, Donor, Negative, Positive
CVFF-types
Workflow
ChemAxon-Readable
Compound Set
Annotated .sd file with typing info
Fragment Counts
Type: ChemAxon API-based Java
tool
FragType: Free Pascal Program
....M END> <POP1>95
> <PHTYP1>H;H;A;D;R;R;R;R;A/D;R;R
> <FFTYP1>c3;c';o';n;cp;cp;cp;cp;oh;cp;cp
> <EPTYP1>0;n/0;N;0;n/0;n/0;n/0;n/0;N;n/0;n/0
> <POP2>5
> <PHTYP2>H;H;A;D;R;R;R;R;A/N;R;R....$$$$
Topological electrostatic potentials For each atom i, the potential Vi is:
with qj the partial charge of j, dij the topologicaldistance, and do the “own field distance” (0.4)
In function of Vi, atom i will be classified into: N(strongly negative), n (negative), 0 (neutral), p(positive), P (strongly positive)
NN/n
-0.32 -0.28 0.08 0.12
-0.12 -0.08 0.28 0.32V
n n/0 0 o/p p p/PP
Molecular Fingerprint
Microspecies-Specific Labeling of Fragments…
A-R*R*R*R-D +95D-R*R*R*R-D +95A-R*R*R*R-D +95D-R*R*R*R-D +95A-R*R*A*R-D +95D-R*R*A*R-D +95…
Population: 95% 5%
R
R
A/D
R
R
D
R
R/A
A/D
R
R
D
R
R/A
N
R
R
DN-R*R*R*R-D +5N-R*R*A*R-D +5…
µSpecies increment counters of contained fragments by their population levelsLower & Upper Fragment sizes are
user-defined
Sequencing Options: (1) The Bond Information Toggle
With Bond Info
A-R*R*R*R-DD-R*R*R*R-DA-R*R*A*R-DD-R*R*A*R-D…
R
R/A
A/D
R
R
D
User may decide to capture (-b flag) or ignore bond information.
Without Bond Info
ARRRRDDRRRRDARRARDDRRARD
…
Sequencing Options: (2) Wildcardatoms for Fuzziness Control
Strict Typing
ARRRRDDRRRRD
…
With the wildcard option (-w flag), non-terminal sequence atoms are alsomatched by the generic wildcard type “?”
Wildcards Allowed
ARRRRD A?RRRD AR?RRD ARR?RD ARRR?DA??RRD A?R?RD A?RR?DAR??RD AR?R?D ARR??D AR???D A??R?D AR???DA????D…DRRRRDD?RRRD … ……
R
R
A/D
R
R
D
A4D –Pair Counts may be explicitly (and exclusively) generated by the fragmentor.
Augmented Atoms…
Strict Typing with Bond Info (-b)
D(-R(*R)*R)(-H(-H)=A)D(-R(*R)*A)(-H(-H)=A)
Branched fragments, representing an atom and (an user-defined number of ) its successive coordination spheres
H
RR/A R
D A
H
Strict Typing, noBond Info
D(R(R)R)(H(H)A)D(R(R)A)(H(H)A)
All but Central and Terminal Atoms may be
wildcards (-b -w)
D(-R(*R)*R)(-H(-H)=A)D(-?(*R)*R)(-H(-H)=A)D(-?(*R)*A)(-H(-H)=A)
…
“Tree” descriptors have wildcards for all but Central & Terminal:
D(-?(*R)*A)(-?(-H)=A)…
FragType in a Nutshell Uses atom typing schemes (symbols, pharmacophore
types, electrostatic potential – others to follow), whereone atom may represent several types.
Several atom typing schemes are allowed for a molecule(µspecies-specific, weighted by population level at pH)
The ChemAxon API is a versatile atom typing tool.
Generates either sequences or augmented atoms, whichmay include or ignore bond information.
With the wildcard option, all the generic fragmentsignoring one or more atom types are also counted:
Sequences ↔ Fuzzy wildcard sequences ↔ Topological Pairs
Augmented Atoms ↔ Fuzzy branched fragments ↔ Trees
Based on a core of 2500 molecules (all the ~200 actives,completed with randomly picked inactives) of acombinatorial library based on an Ugi synthesis.
Experimental screen against 5 proteases (Chymotrypsin,Factor Xa, Trypsin, Tryptase, Urokinase-type PlasminogenActivator )
For each active M (pIC50>=4.9) of each target T, LocalAscertained Optimality was calculated around M, accordingto different descriptor spaces {descriptor D, dissimilaritymetric S}
(1) Neighborhood Behavior
M1 D1-S1:X D1-S2:X
* … D1-Sn:X* D2-S1:X* D2-S2:X
* … Dk-Sn:X*
M2 D1-S1:X D1-S2:X
* … D1-Sn:X* D2-S1:X* D2-S2:X
* … Dk-Sn:X*
M3 D1-S1:X D1-S2:X
* … D1-Sn:X* D2-S1:X* D2-S2:X
* … Dk-Sn:X*
Descriptor Dimension Descriptor Dimension
pairEP28 49 treeSY03 744
pairSY28 110 seqbPH25 751
seqSY25 123 aabPH02 784
pairPH28 169 seqwPH25 1311
aaSY02 249 treePH03 2201
seqEP25 268 seqEP37 2209
seqSY37 293 seqbEP25 2691
aabSY02 358 aaPH03 3409
seqbSY25 363 aabPH03 3716
seqwSY25 385 aaEP02 3785
seqPH25 443 aabEP02 6704
seqwEP25 566 treeEP03 6761
aaPH02 698 aaEP03 41667
Benchmarked Descriptors:“New” “Classical”
Pharmacophore Pairs: CATS (Prof. G. Schneider
et. al.)
ChemAxon PF
Pharmacophore Triplets: pH-sensitive & rule-based
3D Pharmacophore des-criptors: LIQUIDS (Prof. G.
Schneider et. al.)
SEL – subspace of relevantTryptase QSAR des-criptors
DPRED: Predicted TryptasepIC50
Dissimilarity Metrics Six dissimilarity metrics were based upon:
Two descriptor rescaling schemes: Z-transformation (Avg/Varrescaling) or No rescaling.
Three distance formulas: Euclidean, Dice, binary block(FDIFF)
M
i
M
i
m
m
i
m
m
i
M
iM
i DdD
DDd
21
otherwise
dxordifwhere
dd
dd
dd
M
i
m
iMm
i
i
Mm
i
FDIFF
Mm
i
M
i
i
m
i
i
M
i
m
iDice
Mm
i
M
i
m
i
Eucl
Mm
0
13
2
121
,,
,
22,
2
,
S
SS
Formula One Descriptor Grand Prix
M1 D1-S1:X D1-S2:X
* … D1-Sn:X* D2-S1:X* D2-S2:X
* … Dk-Sn:X*
M2 D1-S1:X D1-S2:X
* … D1-Sn:X* D2-S1:X* D2-S2:X
* … Dk-Sn:X*
M3 D1-S1:X D1-S2:X
* … D1-Sn:X* D2-S1:X* D2-S2:X
* … Dk-Sn:X*
XXXS
*/**2
XX 22121:''** DrankDbeatsDthenif
DD
M1 Rank(D1) Rank(D2) … Rank(Dk)
M2 Rank(D1) Rank(D2) … Rank(Dk)
M3 Rank(D1) Rank(D2) … Rank(Dk)
One Active Molecule = One ‘Grand Prix’ race
D Champion-ship Points
Champion-ship RANK
D1
D2
Dk
MSX*
#1: 10 points, #2: 6 points, #3 to #6: 4 to 1 points, respectively
One Target= One ‘Grand Prix’ Championship
Local
Optimality
Scores
Chymotrypsin Championship(12 Actives, i.e. Grand Prix Races)
RANK Descriptor #Gold #Silver #Bronze POINTS Avg. Opt
1 aaSY02 4 2 1 62 0.24
2 seqSY25 2 1 1 42 0.23
3 treeSY03 1 1 1 32 0.23
4 seqSY37 0 2 3 30 0.23
5 aabSY02 0 2 2 29 0.22
6 seqbSY25 1 2 0 27 0.21
7 seqwSY25 0 0 1 8 0.21
8 pairSY28 1 0 0 13 0.2
9 treePH03 2 1 0 26 0.18
10 aabPH02 0 0 1 7 0.17
11 aaPH02 0 0 0 3 0.16
12 aabPH03 0 0 0 1 0.16
13 aaPH03 0 0 0 0 0.16
14 seqbPH25 0 0 0 0 0.14
15 seqPH25 0 0 0 0 0.14
Factor Xa Championship(81 Actives, i.e. Grand Prix Races)
RANK Descriptor #Gold #Silver #Bronze POINTS Avg. Opt
1 DPRED 35 11 8 463 0.37
2 SEL 19 26 5 389 0.33
3 CATS-P1 10 7 11 212 0.29
4 FPT-nopK 1 3 6 104 0.29
5 FPT1 0 1 5 58 0.27
6 treeEP03 2 1 4 89 0.26
7 PF 0 3 4 76 0.26
8 aabPH03 0 2 6 51 0.25
9 seqbEP25 0 4 3 47 0.25
10 aaPH03 0 2 3 47 0.25
11 CATS-P2 0 4 1 43 0.25
12 CATS-R1 0 0 2 42 0.25
13 aabEP02 0 0 0 16 0.25
14 CATS-A1 0 3 2 49 0.24
15 treePH03 2 0 1 37 0.24
Trypsin Championship(3 Actives, i.e. Grand Prix Races)
RANK Descriptor #Gold #Silver #Bronze POINTS Avg. Opt
1 aaPH03 1 0 0 11 0.22
2 treePH03 0 1 1 10 0.22
3 aabPH03 0 0 1 4 0.22
4 treeSY03 0 1 0 10 0.21
5 SEL 1 0 0 10 0.2
6 aabEP02 1 0 0 10 0.19
7 aaEP03 0 1 0 8 0.19
8 treeEP03 0 0 1 5 0.19
9 CATS-P1 0 0 0 3 0.19
10 seqbEP25 0 0 0 2 0.19
11 aaSY02 0 0 0 2 0.19
12 aabPH02 0 0 0 0 0.19
13 aaPH02 0 0 0 0 0.19
14 aaEP02 0 0 0 3 0.18
15 CATS-P2 0 0 0 0 0.18
Tryptase Championship(100 Actives, i.e. Grand Prix Races)
RANK Descriptor #Gold #Silver #Bronze POINTS Avg. Opt
1 DPRED 63 3 1 656 0.35
2 treeSY03 2 16 10 196 0.26
3 aabSY02 0 16 12 188 0.26
4 aaSY02 2 5 15 170 0.26
5 aabEP02 1 8 5 100 0.25
6 treeEP03 2 5 6 97 0.25
7 pairPH28 0 5 6 83 0.25
8 seqwPH25 1 2 3 74 0.24
9 seqPH25 3 1 2 65 0.24
10 aaPH02 1 3 0 46 0.24
11 treePH03 4 0 0 43 0.24
12 aabPH02 0 1 2 31 0.24
13 SEL 0 2 5 46 0.23
14 seqbPH25 1 1 3 44 0.23
15 seqbSY25 1 0 2 39 0.23
UPA Championship(11 Actives, i.e. Grand Prix Races)
RANK Descriptor #Gold #Silver #Bronze POINTS Avg. Opt
1 SEL 3 0 0 32 0.28
2 CATS-P1 4 0 0 43 0.26
3 CATS-P2 0 3 2 26 0.26
4 CATS-A1 0 1 2 17 0.24
5 CATS-P3 0 0 0 8 0.24
6 FPT-nopK 2 1 0 30 0.23
7 treeEP03 1 0 0 15 0.23
8 CATS-P4 0 0 0 6 0.23
9 aabEP02 1 1 0 16 0.22
10 aaEP03 0 0 1 5 0.22
11 treePH03 0 0 0 2 0.22
12 CATS-R2 0 0 0 2 0.22
13 CATS-A2 0 0 0 2 0.22
14 treeSY03 0 1 0 9 0.21
15 CATS-R1 0 0 1 7 0.21
Overall Ranking…Descriptor Ranks with Targets: From Best to Worst => Average Rank Rank Variance
treeSY03 2 3 4 14 18 8.20 6.52
treeEP03 6 6 7 8 21 9.60 5.75
treePH03 2 9 11 11 15 9.60 4.27
aabEP02 5 6 9 13 19 10.40 5.12
SEL 1 2 5 13 37 11.60 13.38
aabPH03 3 8 12 17 19 11.80 5.84
aaPH03 1 10 13 18 20 12.40 6.71
aaSY02 1 4 11 21 32 13.80 11.41
aaEP03 7 10 16 25 26 16.80 7.68
aabSY02 3 5 16 31 31 17.20 12.11
aaEP02 14 17 17 19 20 17.40 2.06
seqbEP25 9 10 16 18 34 17.40 8.98
FPT-nopK 4 6 23 28 28 17.80 10.63
aabPH02 10 12 12 20 43 19.40 12.29
aaPH02 10 11 13 23 44 20.2 12.77
(2) QSAR – External Validation
logP SQS linear consensus models:
Trained on 3225 molecules
Validated on 9677 compoundsfrom the PhysProp database
aaSY02 0.8188
treePH03 0.8148
aaPH02 0.8109
treeSY03 0.7987
seqPH25 0.7891
seqSY37 0.7245
pairSY28 0.6981
pairPH28 0.6788
treeEP03 0.1169
aaEP02 0.0393
seqEP37 -0.1632
seqwEP25 -0.9547
pairEP28 -1.3567
seqEP25 -2.0424
Descriptor R2
QSAR – External Validation hERG categorical consensus models:
Trained on 562 molecules (courtesy T. Oprea, UNM)
Validated on 1889 PubChem molecules
Descriptor Well Classified INACTIVES Well Classified ACTIVES Balanced
fraction Nr. out of 1698 fraction Nr. out of 191 Accuracy
aaSY02 0.65 1104 0.72 137 0.68
seqPH25 0.62 1058 0.73 140 0.68
treePH03 0.76 1288 0.59 113 0.68
aaPH03 0.68 1150 0.67 128 0.67
seqPH37 0.7 1192 0.63 121 0.67
seqSY25 0.64 1095 0.69 132 0.67
treeSY03 0.66 1125 0.67 128 0.67
aaPH02 0.69 1164 0.63 121 0.66
pairEP28 0.55 927 0.76 146 0.66
pairPH28 0.63 1078 0.68 130 0.66
seqSY37 0.69 1170 0.6 115 0.65
pairSY28 0.62 1053 0.66 127 0.64
seqEP37 0.88 1492 0.25 47 0.56
seqEP25 0.96 1624 0.08 15 0.52
treeEP03 0.95 1615 0.09 17 0.52
aaEP02 0.95 1613 0.04 7 0.49
Conclusions Pharmacophore-Colored Tree descriptors seem to be
the most versatile ones.
they score well in both NB tests – against other coloredfragments, but also against other pharmacophore terms,
they also score well in the two QSAR studies, againstother colored fragments
Various symbol- and pharmacophore-coloredAugmented Atoms, Sequences and Pairs also werequite successful in QSAR and reasonably steady in NBtests.
Electrostatic potential-colored descriptors failed inQSARs, but some were useful NB monitors. Why?
Molecular Similarity & Neighborhood Behavior…
• In chemoinformatics, molecular dissimilarity is a
metric (distance) S(m,M) between the points m and
M representing compounds in a descriptor space (DS).
• The concept of Neighborhood Behavior* (NB) in a DS
is the quantitative equivalent (of statistical nature) of
the Similarity Principle:
– If the probability to pick a pair of compounds with similar
activity levels increases with decreasing S(m,M), then this
space and its metric are told to display significant NB with
respect to the considered activity.
* Patterson, D.E., Cramer, R.D., Ferguson, A.M., Clark, R.D., Weinberger, L.E., Neighborhood Behavior: AUseful Concept for Validation of “Molecular Diversity” Descriptors, J. Med. Chem. 1996, 39, 3049-3059
The Similarity Principle
Calculated Structural Dissimilarity S(m,M)
Pro
per
tyD
issi
mil
arit
y
*
*
**
*
*
*
* *
*
*
*
*
*
*
** *
**
*
**
*
* *
*
**
*
**
*
**
*
**
* *
**
**
*
*
*
False Positives (FP)
True Negatives (TN)
True Positives (TP)
Potentially (!) False Negatives (FN)
Molecule Pairs M,m
*
*
** **
*
*
*
*
*
*
*
Some Random Ranking Criterion for pairs (m,M)
Pairs with different Properties L(m,M)=|P(m)-P(M)| ≥l
Pairs with similar Properties L(m,M)=|P(m)-P(M)| <l
Unfortunately, there is no Absolute Similarity Scale, nor a “Quantum of Chemical Change”
Nr. Sequence
Count
in M1 … M2 … M3
1 HDH 1 1 0
2 DHA 1 1 0
3 HHD 1 1 0
4 HHA 1 1 1
5 HHH 4 4 4
6 RHH 2 2 2
7 RRH 4 6 4
8 RRR 6 6 6
9 RRA 2 2 2
10 RAH 2 2 2
11 HAH 1 1 1
12 HDHA 1 1 0
....
Pharmacophore Sequences
s
W1.0
)()(
)()(
)(E
FN
E
FP
FNFP
NN
NNs
W
SSS
The Optimality Index W
L(M,m) l L(M,m)> l
S(M
,m)
s
True
Positives
(TP)
False
Positives
(FP)
False (?)
Negatives
(FN)
True
Negatives
(TN)
)()(
)()(
)(E
FN
E
FP
FNFP
NN
NNs
W
SSS
s
Activity (profile) differences L(m,M)
Global & Local Optimality
*
)()(
)()(, )(
pairsall
E
FN
E
FP
FNFPG
NN
NNs
W
SSS
MallformMpairs
E
FN
E
FP
FNFPM
NN
NNs
),,(
)()(
)()(, )(
W
SSS
* For binding affinities, pairs of inactives should be ignored….
Similarity-based
Virtual Screening
(VS) with query M(active molecule)
The Ascertained Optimality Excess X
Compound Pairs selected at cutoff s
Random S
values
MeaningfulS values
Var(W)WX
Fraction of Compound Pairs selected at cutoff s
sVars randrand WWWX
X