ChemAxon’s Chemical ChemAxon’s Chemical
Fingerprints-Based Clustering Fingerprints-Based Clustering
to Assess AurSCOPE Databases to Assess AurSCOPE Databases
Chemical DiversityChemical Diversity
Knowledge Base
Integration Platform
Query Interface
Analysis/Display Applications
The Aureus Pharma SystemThe Aureus Pharma System
AurSCOPE Statistics: March 2006
Publications Activities Ligands
GPCRGPCR
17 25017 250publications publications
includingincluding3525 patents3525 patents
635 000635 000 152 300152 300
Ion ChannelIon Channel
7 100 pub7 100 pubincludingincluding
2 685 patents2 685 patents217 600217 600 58 40058 400
KinaseKinase
2 565 pub2 565 pubIncluding 1069 Including 1069
patentspatents163 700163 700 51 80051 800
ADME/ADME/Drug-Drug Drug-Drug
InteractionsInteractions6 530 pub6 530 pub 179 000179 000
9 1009 100parent parent
compound + compound + metabolitesmetabolites
HERGHERG 800 pub800 pub 14 30014 300 3 5303 530
AurQUESTQuery management software for AurSCOPE
Web-based application integrating ChemAxon technology Powerful Query Builder
- Biological and Chemical Queries
- Structural search using ChemAxon tools
Efficient Navigation Different Export Formats (SDF, RDF, …)
• Counterions• MW > 700• Inorg• NAS
• Stereo-duplicates• Identical mol. but different salts…
AurSCOPE database
2D unique structures
1
2
3
4
Data Preprocessing
1151911519 molecules(*) (9897 uniques)
Protocols: Binding or Electrophysiology
Target: All
Target type: Wild
Parameter filterKi, EC50, IC50
< 300 nM< 300 nM(*) November 2005
AurSCOPE Ion Channels: Retrieving Active Molecules
0
500
1000
1500
2000
2500
3000
3500
4000
GABA
Nicotin
ic Ace
tylch
oline
rece
ptor
5HT3
NMDA
Calcium
Cha
nnel
Potas
sium
Cha
nnel
AMPA/K
A
Vanillo
id re
cept
or
Sodium
Cha
nnel
Ryano
dine
rece
ptor P2X IP
3
Acid S
ensin
g Io
n Cha
nnel
Glycine
rece
ptor
Chlorid
e Cha
nnel
AurSCOPE Ion Channels: Activity Distribution
Standardization of molecules.
Generating Chemical Fingerprints (CF).
Optimization of different CF parameters.
CF-based Jarvis-Patrick clustering with various
adjusted parameters.
Encoding Chemical Space and Clustering
Parameters for Generating Hashed Chemical Fingerprints• Fingerprint length
- The number of bits in the bit string.
- Bigger fingerprint increases the capacity for storing information on molecules.
• Maximum pattern length
- The maximum length of atoms in the linear paths that are considered during the fragmentation of the molecule. (The length of cyclic patterns is not limited.).
- Longer and more patterns hold more information on the molecule.
• Bits to be set for patterns
- After detecting a pattern, some bits of the bit string are set to "1". The number of bits used to code patterns is constant.
- Higher number of bits increases the coded information from a pattern.
• Darkness of the fingerprint
- The percentage of "1" digits in the bit string. We consider fingerprints with more ones "darker" than those with less ones.
FP lengthFP length Max #bondsMax #bonds Max #bitsMax #bits Aver. DarknessAver. Darkness Max. DarknessMax. Darkness
512 7 3 68.5 97.5512 7 4 82.2 99.4512 7 5 84.9 99.4512 8 3 76.1 99.2512 8 4 87.7 99.4512 8 5 89.8 99.4
1024 7 3 46.1 83.31024 7 4 61.5 94.81024 7 5 65.5 95.91024 8 3 54.8 91.91024 8 4 70.2 98.51024 8 5 73.8 98.9
2048 7 3 26.8 58.62048 7 4 39.1 78.620482048 77 55 42.442.4 81.681.62048 8 3 33.4 73.72048 8 4 47.5 89.62048 8 5 50.9 91.6
Chemical Fingerprints: Effect of Parameters
1.1. For each structure, collect the set of nearest neighbors that has a dissimilarity (distance) less than a T threshold value. Two structures cluster together if they are in each others list of nearest neighbors.
2.2. They have at least Rmin of their nearest neighbors in
common, where Rmin is a ratio of the length of the
shorter list.
CF-based Jarvis-Patrick Clustering
T Rmin # Clusters # Singletons
0.150.15 0.2 932 16630.3 938 16630.4 945 16630.5 977 1663
0.160.16 0.3 865 14990.5 910 1500
0.170.17 0.3 819 13720.5 860 1373
0.180.18 0.3 787 12380.5 826 1238
0.190.19 0.3 752 11400.5 780 1141
0.200.20 0.3 722 10510.5 752 1051
Chemical fingerprint length in bits: 2048Maximum number of bonds in patterns: 7Maximum number of bits to set for each pattern: 5
CF-based Jarvis-Patrick Clustering
0
50
100
150
200
250
300
350
1 54 107
160
213
266
319
372
425
478
531
584
637
690
743
796
849
902
size
CF-based Jarvis-Patrick ClusteringCF-based Jarvis-Patrick ClusteringSimilarity threshold = 0.85Similarity threshold = 0.85(*)(*)
0
50
100
150
200
250
300
350
400
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49(*) Martin Y.C. et al. Do structurally similar molecules have similar biological activity? J. Med. Chem. 2002, 45, 4350-4358.
Most Populated ClustersMost Populated Clusters
Jarvis-Patrick Clustering: missclassifications ??
Jarvis-Patrick Clustering: Diverse Singletons
Most Populated Clusters: Biological " Most Populated Clusters: Biological " Projection"Projection"
Gamma aminobutyric acid A receptorVoltage-gated calcium channel
Nicotinic acetylcholine receptor Gamma aminobutyric acid A receptor
Gamma aminobutyric acid A receptor Nicotinic acetylcholine receptor Gamma aminobutyric acid A receptor
Gamma aminobutyric acid A receptor Potassium channel Gamma aminobutyric acid A receptor Voltage-gated calcium channel
5-HT3 Nicotinic acetylcholine receptor Gamma aminobutyric acid A receptor
Conclusions JKlustor integrates computationally rapid and efficient clustering tools.
Shortcomings to be addressed to deal with artificial singletons.
Future work: combination with Maximum Common Substructure approach (LibMCS).
Other algorithms (Ward,…)