Upload
erica-waters
View
226
Download
0
Embed Size (px)
Citation preview
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Markus Sitzmann1, Wolf-Dietrich Ihlenfeldt2, andMarc C. Nicklaus1
[1] Computer-Aided Drug Design Group, Chemical Biology Laboratory,NCI-Frederick, NIH, DHHS[2] Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany
NCI/CADD Chemical Identifier Resolver:Indexing and Analysis of Available Chemistry Space
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Chemistry Space Analysis
• how many small-molecules are there currently?• since the early 2000s: enormous increase of the number of
databases containing small molecules, e.g. PubChem, ChemSpider, ChEMBL, DrugBank – what is the overlap?
• many ambiguities in the representation of small molecules (e.g. tautomerism, salts, ionic resonance forms)
• growing number of chemical structure identifiers (InChI/InChIKey, PubChem SID/CID, ChemSpider ID, ChEBI ID, …)
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
chemical structure
Chemical Identifier Resolver
NCI/CADD Identifiers
InChI/InChIKey
ChemSpider ID
PubChem SID/CID
chemical names
CAS Registry Number
NSC number
FDA UNII
ChemNavigator SID
SMILES
SD File
Chemical FormulaChEBI ID
PDB Ligand ID
MRV
CML
SYBYL Line Notation
GIF image
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
http://cactus.nci.nih.gov/chemical/structure
Works as a resolver for different chemical structure identifiers. Allows one to convert a givenstructure identifier into anotherrepresentation or structureidentifier.
Chemical Identifier ResolverNCI/CADD Web Resources
first beta release: July 2009current release (beta 4): April 2011
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
• it is usable by a simple URL API:
example: http://cactus.nci.nih.gov/chemical/structure/Tamiflu/cas
204255-11-8
http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”
MIME type: text/plain
Chemical Identifier ResolverNCI/CADD Web Resources
XML format: http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”/xml
• if a request is not resolvable: HTTP404 status message
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
resolver
chemical namesIUPAC names (by OPSIN)
CAS numbersSMILES strings
IUPAC InChI/InChIKeysNCI/CADD Identifiers
CACTVS HASHISYNSC number
PubChem SIDChemSpider ID
ChemNavigator SIDFDA UNII
/smiles/names, /iupac_name/cas/inchi, /stdinchi/inchikey, /stdinchikey/ficts, /ficus, /uuuuu /image/file, /sdf/mw, /monoisotopic_mass /formula/twirl, /3d/urls/chemspider_id/pubchem_sid/chemnavigator_sid
“identifier” “representation”
http://cactus.nci.nih.gov/chemcial/structure
Chemical Identifier ResolverNCI/CADD Public Web Resources
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
identifier representation
http request
http response
detection ofthe identifier
type
identifier is afull structure
representation(e.g. SMILES, InChI)
calculation of therequested structure
representation
identifier is ahashed structure
representation(e.g. InChIKey),
trivial nameetc.
database lookup
MIME type
Chemical Identifier ResolverNCI/CADD Web Resources
structure
e.g. InChI, GIF image
e.g. CAS number,chemical nameCACTVS
NCI/CADD Chemical Structure Database (CSDB)
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
• ChemNavigator iResearch Librarycompilation of commercially availablescreening compounds from ~300 inter-national chemistry suppliers
• PubChem databaseincluding Open NCI database, EPA DSSTox databases, NIAID HIVdatabases, NIST Webbook, NLM ChemIDplus, ChemSpider …
• Commercial Sources / othersAsinex, Comgenex, eMolecules,ChEMBL, …
currently:~150 chemical structure databases
~120 million structure records ~81.6 million unique structures by
NCI/CADD FICuS Identifier~84 million unique structures by Std. InChIKey
ChemNav.iResearch Lib.~56%
PubChem~38%
others
~6%
Chemical Structure Database (CSDB)Chemical Identifier Resolver
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
NCI/CADD Structure Identifiers
FICTS, FICuS, uuuuu
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
• based on hashcodes calculated by the chemoinformatics toolkit CACTVS
• CACTVS hashcodes: represent a chemical structure uniquely as
16-digit hexadecimal number (64-bit unsigned) high sensitivity to structural features of a compound change if connectivity changes
NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures
HNN NH2
OH
O
9850FD9F9E2B4E25
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
structurenormalization
parentstructure
NCI/CADDIdentifier
hashcodecalculation
E_HASHISY
• calculation of a set of parent structures with differentsensitivity to chemical features
• representation of chemical structures on different levels
FICTS
original structure
record
MolfileSDFSMILESChemDraw cdxPDB
FICuS
uuuuu
SDFSMILESdatabase
NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
• adjustable levels of sensitivity:
Fragments
sensitive
keep only largestorganic fragment
Isotopes
ignoreisotope labels
sensitive
D
D
D
D
D
D
Charges
uncharge
sensitive
find canonicaltautomer
O O
Stereochemistry
sensitive
COOH
NH2
discard stereoinformation
O-
O
NH3+
OH
O
NH2
un-sensitive un-sensitive un-sensitive un-sensitive
sensitive
O OH
O OH
Tautomers
COOH
HNH2
COOH
NH2
HNa+
O
O-
O
OH
un-sensitive
NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Fragments Isotopes Charges
sensitive sensitive sensitive
D
D
D
D
D
D
O OCOOH
NH2
un-sensitive un-sensitive un-sensitive un-sensitive
O-
O
NH3+
OH
O
NH2
Tautomers Stereochemistry
sensitive sensitive
O OH
O OH
COOH
HNH2
COOH
NH2
HNa+
O
O-
O
OH
NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Fragments Isotopes Charges
sensitive sensitive sensitive
D
D
D
D
D
D
O OCOOH
NH2
F I C
representation of the exact drawing
un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive
T
O-
O
NH3+
OH
O
NH2
≠ ≠ ≠
Tautomers Stereochemistry
sensitive sensitive
O OH
O OH
COOH
HNH2
COOH
NH2
H
≠
≠
S
Na+
O
O-
O
OH
≠
≠
FICTS
NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Fragments Isotopes Charges
sensitive sensitive sensitive
D
D
D
D
D
D
O OCOOH
NH2
F I C
comes closest to how a chemist perceives a compound
un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive
u
O-
O
NH3+
OH
O
NH2
Tautomers Stereochemistry
sensitive sensitive
O OH
O OH
COOH
HNH2
COOH
NH2
H= ≠
S
Na+
O
O-
O
OH
FICuS
≠ ≠ ≠ ≠=
NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Fragments Isotopes Charges Tautomers Stereochemistry
Na+
sensitive sensitive sensitive sensitive sensitive
O
O-
D
D
D
D
D
D
O-
O
NH3+
O OH
O OH
COOH
HNH2
COOH
NH2
H
O
OH
O OCOOH
NH2OH
O
NH 2
=
=== = = =
=
closely related forms of the same compound
u uuuu
un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive
uuuuu
NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Fragments Isotopes Charges StereoTautomers
FICTS
FICuS
uuuuu
sensitive / not sensitive
<CACTVS hashcode (E_HASHISY)>-<tag>-<version>-<checksum>
HNN NH2
O-
ONa+ 4A122D094098B50D-FICTS-01-1D
0E26B623DF7FAD30-FICuS-01-709850FD9F9E2B4E25-uuuuu-01-27
NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
HNN NH2
OH
O
NNH NH2
OH
O
HNN
OH
O
NH2
HNN
OH
O
NH2
HNN NH2
O-
ONa+
HNN NH3
+O-
O
O
HNN NH2
ONa
HNN NH
OH
ONH
N 15NH2
OH
O
charged form
tautomer
isotope
salt
stereoisomers
“errors”
histidine
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
A3DAE0788050DDE4-FICTS E5F83F10C5DB080A-FICTS
B2FDA68AEDA06DB9-FICTS
9850FD9F9E2B4E25-FICTS
E5F83F10C5DB080A-FICTS
E92E4BA2869F3611-FICTS8A7AD1EB498CC76A-FICTS6C16DE2351F9FF50-FICTS
HNN NH2
OH
O
NNH NH2
OH
O
HNN
OH
O
NH2
HNN
OH
O
NH2
HNN NH2
O-
ONa+
HNN NH3
+O-
O
O
HNN NH2
ONa
HNN NH
OH
ONH
N 15NH2
OH
O
9850FD9F9E2B4E25-FICTS
charged form
tautomer
isotope
salt
stereoisomers
FICTS
“errors”
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
A3DAE0788050DDE4-FICuS E5F83F10C5DB080A-FICuS
B2FDA68AEDA06DB9-FICuS
9850FD9F9E2B4E25-FICuS
E5F83F10C5DB080A-FICuS
E92E4BA2869F3611-FICuS8A7AD1EB498CC76A-FICuS9850FD9F9E2B4E25-FICuS
HNN NH2
OH
O
NNH NH2
OH
O
HNN
OH
O
NH2
HNN
OH
O
NH2
HNN NH2
O-
ONa+
HNN NH3
+O-
O
O
HNN NH2
ONa
HNN NH
OH
ONH
N 15NH2
OH
O
9850FD9F9E2B4E25-FICuS
charged form
tautomer
isotope
salt
stereoisomers
FICuS
“errors”
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
9850FD9F9E2B4E25-uuuuu9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-FICuS
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu9850FD9F9E2B4E25-uuuuu9850FD9F9E2B4E25-uuuuu
HNN NH2
OH
O
NNH NH2
OH
O
HNN
OH
O
NH2
HNN
OH
O
NH2
HNN NH2
O-
ONa+
HNN NH3
+O-
O
O
HNN NH2
ONa
HNN NH
OH
ONH
N 15NH2
OH
O
9850FD9F9E2B4E25-uuuuu
charged form
tautomer
isotope
stereoisomers
salt
uuuuu
“errors”
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
HNDVDQJCIGZPNO-UHFFFAOYSA-N
HNDVDQJCIGZPNO-CDYZYAPPSA-N
HNDVDQJCIGZPNO-RXMQYKEDSA-N HNDVDQJCIGZPNO-YFKPBYRVSA-NHNDVDQJCIGZPNO-UHFFFAOYSA-N
HNN NH2
OH
O
NNH NH2
OH
O
HNN
OH
O
NH2
HNN
OH
O
NH2
HNN NH2
O-
ONa+
HNN NH3
+O-
O
O
HNN NH2
ONa
HNN NH
OH
ONH
N 15NH2
OH
O
HNDVDQJCIGZPNO-UHFFFAOYSA-N
charged form
tautomer
isotope
stereoisomers
salt
Std. InChIKey
“errors”
HNDVDQJCIGZPNO-UHFFFAOYSA-N
UHPNKBYGGMJTIM-UHFFFAOYSA-M
UHPNKBYGGMJTIM-UHFFFAOYSA-M
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
FICTS
original record
original record
original record
original record
FICTS
original record
original record
original record
original record
original record
original record
original record
FICTS
FICTS
FICTS
FICTS
FICTS
FICTS
FICuS
FICuS
FICuS
FICuS
FICuS
FICuS
uuuuu
uuuuu
uuuuu
uuuuu
83.1 millionFICTS
parent structures
81.6 millionFICuS
parent structures
76.2 millionuuuuu
parent structures
119.8 million originalstructure records
in CSDB
NCI/CADD Chemical Structure Database
Structure Normalization
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
FICTS
original record
original record
original record
original record
FICTS
original record
original record
original record
original record
original record
original record
original record
FICTS
FICTS
FICTS
FICTS
FICTS
FICTS
FICuS
FICuS
FICuS
FICuS
FICuS
FICuS
uuuuu
uuuuu
uuuuu
uuuuu
tautomer-invariant
83.1 millionFICTS
parent structures
81.6 millionFICuS
parent structures
76.2 millionuuuuu
parent structures
119.8 million originalstructure records
in CSDB
NCI/CADD Chemical Structure Database
Structure Normalization
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Tautomer Analysis
How much “chemical space” is “just generated” by drawing tautomers?
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
• CACTVS: generation of all formal tautomers for a given organic compound (prototropic tautomerism)
• rule set of 21 transforms encoded as (CACTVS-extended) SMIRKS• rule set is systematically applied to the original structure
(and all tautomers that have been generated in previous steps)• tautomer generation is limited to 1000 SMIRKS transform
operations/structure• all tautomers are ranked by a scoring function• the highest ranked tautomer is defined as the
canonical tautomer
NCI/CADD Chemical Structure Database
Tautomer Analysis
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
rule 12: furanones
rule 11: 1.11 (aromatic) heteroatom H shiftrule 10: 1.9 (aromatic) heteroatom H shiftrule 9: 1.7 (aromatic) heteroatom H shiftrule 8: 1.5 aromatic heteroatom H shift (2)rule 7: 1.5 (aromatic) heteroatom H shift (1)rule 6: 1.3 heteroatom H shiftrule 5: 1.3 aromatic heteroatom H shiftrule 4: special iminerule 3: simple (aliphatic) iminerule 2: 1.5 (thio)keto/(thio)enolrule 1: 1.3 (thio)keto/(thio)enol
• 21 SMIRKS transform rules:
rule 21: phosphonic acidsrule 20: isocyanidesrule 19: formamidinesulfinic acidsrule 18: cyanic/iso-cyanic acidsrule 17: oxim/nitroso via phenolrule 16: oxim/nitrosorule 15: pentavalent nitro/aci-nitrorule 14: ionic nitro/aci-nitro
rule 13: keten/ynol exchange
NCI/CADD Chemical Structure Database
Tautomer Analysis
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
[O,S,Se,Te;X1:1]=[C;z{1-2}:2][CX4R{0-2}:3][#1:4]>>[#1:4][O,S,Se,Te;X2:1][#6;z{1-2}:2]=[C,cz{0-1}R{0-1}:3]
[N,n,S,s,O,o,Se,Te:1]=[NX2,nX2,C,c,P,p:2][N,n,S,O,Se,Te:3][#1:4]>>[#1:4][N,n,S,O,Se,Te:1][NX2,nX2,C,c,P,p:2]=[N,n,S,s,O,o,Se,Te:3]
32
O1
H 43
2O1H 4
N2
S1 N 3
H
H4
HN2
S1 N3
H
H4
H
1.3 keto/enol
1.3 heteroatom H shift
rule 1: 1.3 (thio)keto/(thio)enol
rule 6: 1.3 heteroatom H shift
NCI/CADD Chemical Structure Database
Tautomer Analysis
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
FICTS
FICTS
FICTS
FICTS
FICTS
FICTS
FICTS
FICTS
72.0 millionFICTS
parent structures
NCI/CADD Chemical Structure Database
Tautomer Analysis
FICuS
FICuS
FICuS
FICuS
FICuS
FICuS
8.6% change tautomericform during FICuSnormalization
FICTS parent structures
70.6 millionFICuS
parent structures
structure counts are on basis of the 2009 version of CSDB(103.9 million structure records)
FICuS parent structures
1.5% have an one-to-manyrelationship to severalFICTS parent structures(“conflict”)
98.5% have an one-to-onerelationship to a singleFICTS parent structure
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
NCI/CADD Chemical Structure Database
Tautomer Analysis
numberdatabasereleases
0
10
20
30
40
50
60
70
80
90
0.0 0.5 1.0 1.5 2.0
frequency
tautomeric overlap within each individual database release (%)
average: ~0.3% of original structure records
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
NCI/CADD Chemical Structure Database
Tautomer Analysis
numberdatabasereleases
0
10
20
30
40
50
60
70
80
90
0.0 0.5 1.0 1.5 2.0
frequency
tautomeric overlap within each individual database release (%)
average: ~0.3% of original structure records
AsinexChemBridgeComGenexChemNavigatorColumbia University Molecular Screening CenterEPA DSSToxSpecs
AmbinterBINDBindingDBChemNavigatorKEGGNCI Open DatabaseNIST WebBookNLM ChemIDplusNMRShiftDBThomson PharmaWombat
NCI/DTPPASS Training SetSGC-Ox
ChemDBZINC
ChEBIChemSpider
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
NCI/CADD Chemical Structure Database
Tautomer Analysis
0
5
10
15
20
25
30
0.5 2.5 4.5 6.5 8.5 10.5 12.5 14.5 16.5 18.5 20.5 22.5 24.5
frequencynumber
databasereleases
percentage of FICuS parent structure in each database releaseoccurring somewhere in CSDB with a conflict
occurrence of “tautomerism-critical” molecules within each individual database release (%)
average: ~9.5% of FICuS parent structures
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
HNN O
O
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
• HPMBP is used in liquid membranes(selective removal of metal ions)
• selectivity and efficiency depends on the tautomeric form of HPMBP
• the tautomeric form depends on solvent and concentration of HPMBP
He, D.; Li Z.; Ma M.; Huang J.; Yang Y. Study of extraction characteristics of HPMBP.1. Tautomer and extraction characteristics. J. Chem. Eng. Data 2009, 54(10), 2944-2947
Example for a Tautomer “Conflict”
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
NN OH
O
HNN O
O
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
CACTVS generates 7 tautomers
Example for a Tautomer “Conflict”
canonical tautomer
by CACTVS 5 have potential stereo center on atoms or bonds
HNN O
OR/S
HNN OH
OHR/S
HNN O
OHE/Z
NN O
OHE/Z
NN O
OR/S
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
HNN O
O
HNN O
O
H
4551-69-133064-14-1
127117-31-1
859 references49 references
3 references
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
3 have CAS Registry Numbers assigned
Example for a Tautomer “Conflict”
(no stereo)
(Z)
HNN O
OR/S
HNN OH
OHR/S
NN O
OHE/Z
NN O
OHE/Z
NN O
OR/S
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
NN OH
O
NN O
O
HNN O
O
NN O
OH
HNN O
OH
HNN OH
OH
HNN O
O
6 databases16 databases (no stereo)3 databases (R)2 databases (S)
12 databases
1 database(no stereo)
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
Example for a Tautomer “Conflict”
occurrences in databasesindexed in CSDB
R/S
R/S
E/ZE/Z
R/S
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
6 databases16 databases (no stereo)3 databases (R)2 databases (S)
12 databases
occurrences in databasesN
N OH
O
NN O
OR/S
HNN O
O
NN O
OHE/Z
HNN O
OHE/Z
HNN OH
OHR/S
HNN O
OR/S
1 database(no stereo)
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
Example for a Tautomer “Conflict”
ACD 3DAmbinterBindingDBChemBankChemDBChemSpiderChemNavigatorMLSMRNIAID Scripps Screening CenterThomson PharmaZINC
ChemDB
ACD 3DACXAmbinterBioByte QSARChemBankChemBridgeChemDBChemSpiderDiscoveryGateEPA GCESMLSMRNCI Open DatabaseNIST MS-LibNLM ChemIDplusSigma-AldrichThomson Pharma
AmbinterChemDBChemSpiderDiscoveryGateChemNavigatorThomson Pharma
ChemSpiderZINC
ChemSpiderECOTOXZINC
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
FICuS
FICuS
FICuS
FICuS
FICuS
FICuS
70.6 millionFICuS
parent structures
NCI/CADD Chemical Structure Database
Tautomer Analysis
• how many tautomers are generated?
• how often is each rule applied(type of tautomerism)?
• how many tautomers perstructure?
starting from the set of FICuS parent structures we systematically generated all tautomers based on the 21 SMIRKS rule set available in CACTVS
generated680 million tautomers
for 1.7% of the FICuS parent structuresthe enumeration was not exhaustive
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
2.617,860,604rule 12: furanones
0.21,374,235rule 11: 1.11 (aromatic) heteroatom H shift
0.75,061,731rule 10: 1.9 (aromatic) heteroatom H shift
8.457,242,472rule 9: 1.7 (aromatic) heteroatom H shift
<0.126,819rule 8: 1.5 aromatic heteroatom H shift (2)
4.027,542,770rule 7: 1.5 (aromatic) heteroatom H shift (1)
36.8250,453,882rule 6: 1.3 heteroatom H shift
3.825,678,446rule 5: 1.3 aromatic heteroatom H shift
0.64,306,155rule 4: special imine
5.335,917,415rule 3: simple (aliphatic) imine
1.711,541,452rule 2: 1.5 (thio)keto/(thio)enol
25.4173,002,712rule 1: 1.3 (thio)keto/(thio)enol
%count
generated tautomerstautomer rule
Tautomer AnalysisNCI/CADD Chemical Structure Database
• usage of SMIRKS rules (1/2):
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
<0.154,926rule 21: phosphonic acids
<0.1229rule 20: isocyanides
<0.11392rule 19: formamidinesulfinic acids
<0.1181rule 18: cyanic/iso-cyanic acids
<0.1131,502rule 17: oxim/nitroso via phenol
<0.1505,695rule 16: oxim/nitroso
<0.1129rule 15: pentavalent nitro/aci-nitro
<0.1428,266rule 14: ionic nitro/aci-nitro
<0.157,989rule 13: keten/ynol exchange
%count
generated tautomerstautomer rule
Tautomer AnalysisNCI/CADD Chemical Structure Database
• usage of SMIRKS rules (2/2):
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
<0.13801–832 tautomers
<0.1362701-800 tautomers
<0.11,400601-700 tautomers
<0.14,323501-600 tautomers
<0.117,241401-500 tautomers
<0.135,144301-400 tautomers
<0.1104,875201-300 tautomers
0.8565,199101-200 tautomers
1.61,136,06651-100 tautomers
3.72,622,58725-50 tautomers
15.410,870,31211-25 tautomers
47.533,532,2842-10 tautomers
15.210,721,845one tautomer
13.89,756,186no tautomers
%countFICuS structures with
NCI/CADD Chemical Structure Database
Tautomer Analysis• number of
tautomers perstructure:
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
<0.13801–832 tautomers
<0.1362701-800 tautomers
<0.11,400601-700 tautomers
<0.14,323501-600 tautomers
<0.117,241401-500 tautomers
<0.135,144301-400 tautomers
0.1104,875201-300 tautomers
0.8565,199101-200 tautomers
1.61,136,06651-100 tautomers
3.72,622,58725-50 tautomers
15.410,870,31211-25 tautomers
47.533,532,2842-10 tautomers
15.210,721,845one tautomer
13,89,756,186no tautomers
%countFICuS structures with
NCI/CADD Chemical Structure Database
Tautomer Analysis• number of
tautomers perstructure:
NH
O
N
OH
many minor tautomeric forms(but you find them in databases)
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
45.6310,725,465>0.9-1.0
31.5214,747,976>0.8-0.9
16.4111,954,384>0.7-0.8
5.336,448,651>0.6-0.7
0.96,304,436>0.5-0.6
<0.1369,331>0.4-0.5
<0.1 6,580>0.3-0.4
<0.16>0.2-0.3
0.00>0.0-0.2
%CountTanimoto index range
Tautomer Analysis
Tanimoto Similarities of Tautomers• canonical tautomer vs. generated tautomers (680 million tautomer set)
PubChem/CACTVS E_SCREEN bitvector (881 bits)
~ 23% below 0.8 Tanimotosimilarity (although thesame molecule)
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Scaffold Analysis
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Scaffold AnalysisNCI/CADD Chemical Structure Database
molecular scaffold tree
archetype scaffold
simple scaffold
Schuffenhauer et al.J. Chem. Inf. Model. 2007, 47, 47-58
Bemis et al.J. Med. Chem. 1996, 39, 2887-2893
Bemis et al.J. Med. Chem. 1996, 39, 2887-2893
SO O
NNO
NNHO
NNH
O NNH
level 2 level 1
example
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
NCI/CADD Chemical Structure Database
molecular scaffold tree
archetype scaffold
simple scaffold
76.2 million
8.1 million scaffolds
6.8 million scaffolds
0.8 million scaffolds
CSDB
Scaffold Analysis
uuuuu compound
set
NNHO
O NNH
NNH
level 2level 1
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
NCI/CADD Chemical Structure Database
76.2 million
number of unique scaffolds per hierarchy level
CSDB
Scaffold Analysis
uuuuu compound
set
NNHO
O NNH
8.1 million scaffolds
0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 3 4 5 6 7 8 9 10
Hierarchy Level
Nu
mb
er
of
Un
iqu
e S
caf
fold
s (
in m
illi
on
s)
0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
Nu
mb
er o
f un
iqu
e s
truc
ture
s (in
millio
n)
level 2level 1
molecular scaffold tree
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
NCI/CADD Chemical Structure Database
1667 58
51
2
33
11
2NNO
R2R1
R9
R8
R7R6
R5R4
NNR10
R2R1
R9
R8
R7R6
R5R4
R3 21
R3
96
53
4
25
1693
16
7
73
44
2,281 uuuuu parent structures
2,726 uuuuuparent structures
744,469 uuuuuparent structures
5334 structure recordsin 64 databases
6007 structure recordsin 66 databases
1,069,046 structure recordsin 66 databases
Scaffold Analysis
SO O
NNO
NNHO
NNH
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Atom Neighborhoods
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Multilevel Neighborhoods of Atoms (MNA)
HC C(C(CC-H)C(CC-C)-H(C))HO C(C(CC-H)C(CN-H)-H(C))CHCC C(C(CC-H)C(CN-H)-C(C-O-O))CHCN C(C(CC-H)N(CC)-H(C))CCCC C(C(CC-C)N(CC)-H(C))CCOO N(C(CN-H)C(CN-H))NCC -H(C(CC-H))OHC -H(C(CN-H))OC -H(-O(-H-C))
-C(C(CC-C)-O(-H-C)-O(-C))-O(-H(-O)-C(C-O-O))-O(-C(C-O-O))
NCI/CADD Chemical Structure Database
Filimonov D., Poroikov V., Borodina Yu., Gloriozova T. J.Chem. Inf. Comput. Sci., 1999, 39 (4), 666-670.
N
OH
O
HH
MNA level 1 MNA level 2
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Multilevel Neighborhoods of Atoms (MNA)NCI/CADD Chemical Structure Database
76.2 million
CSDB
uuuuu compound
set
Unique MNAs
level 1
level 2
13,426
918,5162.3 billion relationships
1.3 billion relationships~ 17 per uuuuu parent structure
~ 30 per uuuuu parent structure
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Multilevel Neighborhoods of Atoms (MNA)NCI/CADD Chemical Structure Database
424,784 MNAs (level 2) are exclusive to a set of 1,3 million structures in ChemSpider
76.2 million
CSDB
uuuuu compound
set
Unique MNAs
level 1
level 2
13,426
918,5162.3 billion relationships
1.3 billion relationships~ 17 per uuuuu parent structure
~ 30 per uuuuu parent structure
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Chemical Structure Web Services
NCI/CADDweb service
NCI/CADDweb service
NCI/CADD Chemical StructureDatabase (CSDB)
CACTVS
externalweb services
http
ChemicalIdentifierResolver
othersoftwarepackages
e.g. OPSIN
Chemical Structure Web ServicesIndexing Chemical Space
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
http://cactus.nci.nih.gov/chemical/structure
Chemical Identifier ResolverNCI/CADD Web Resources
http://cactus.nci.nih.gov/blog
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Acknowledgments
ChemNavigatorScott HuttonTad Hurst
CADD Group, CBL, NCIIgor Filippov
Thanks to all database providers!
http://cactus.nci.nih.gov
Our web site:
University of CambridgeDaniel LowePeter Murray-Rust
Noel’ O Boyle (University College Cork, Ireland) Richard Apodaca (Metamolecular)Hans-Juergen Himmler
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Acknowledgments - Software
CACTVS
Python Web FrameworkChemWriter
Python SQL Library
Javascript library
Peter Ertl (Novartis)
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9