Upload
hoangmien
View
214
Download
1
Embed Size (px)
Citation preview
WWW.CHEMDIV.COM
11558, Sorrento Valley Road, San Diego, CA 92121, 11558, Sorrento Valley Road, San Diego, CA 92121, USA Tel.: 858USA Tel.: 858--756 7996, Fax: 858756 7996, Fax: 858--794794--4931 4931
EE--mail: mail: [email protected]@chemdiv.com, , [email protected]@chemdiv.com
Screens as a Secure Screens as a Secure Descriptor of Chemistry Descriptor of Chemistry
SpaceSpaceNikolay Osadchiy and Sergey TrepalinNikolay Osadchiy and Sergey Trepalin
Typical Tasks:• Calculate diversity of a chemistry space• Describe the chemistry space of hit series• Determine ‘voids’ in a chemistry space• Select the most similar compounds• Select a set of the most dissimilar
compounds
IntroductionIntroduction
• External Security – encryption – out of the scope, we consider not transfer, but data security
• Nature of data should allow for open format• Open Format of Exchange file• Complex derivative descriptors – are not always
trusted by the disclosing party• Data should be unaltered
What is Secure?What is Secure?
• Centroidal Structural Fragment• Concept is implemented in ChemoSoft and as a
standalone tool• Arbitrary sphere size – trade-off between
informativeness, security, and performance• Default implementation 4-bond length radius• Question: What is the ‘optimal’ size of the
structural fragment?
What is ‘Screen’?What is ‘Screen’?
Exchange File StructureExchange File Structure
SDF formatInformation gathered:• Screen – Structure• Screen – Weight
• Total Number of Molecules in the Set
∑∈
=)(
2
1
iJj ji
Mw
Similarity/Diversity Similarity/Diversity AlgorithmAlgorithm
• Each molecule is assigned a fingerprint vector of against the set of screens
• Pair-wise Similarity Coefficient: Cosine Metrics
• Average Similarity to the Set of Molecules is calculated• Diversity is a reciprocal to similarity
• Screens are explicitly collected for each molecule, reverse engineering is possible
Implemented in ChemoSoft Software, Trepalin et. al., JCICS, 42 (2), 2002, 249-258; http://www.chemosoft.com
Modification Modification –– Increased Increased PerformancePerformance
• Weights were introduced and assigned to screens• Holliday et.al* showed:
∑∑= =
•=N
I
N
J
CC AAJISIMILARITY1 1
),(
Implemented in ChemoSoft Software, Trepalin et. al., JCICS, 42 (2), 2002, 249-258; http://www.chemosoft.com
* Holliday J. et al.,Quant. Struct.-Act. Relat. 1995, 14, 501-506.
AC – centroid vectorW(I) weigth of I-thcompoundF – total number of screens
• Screens are collected with their weights, and no longer assigned to particular molecules
• Security is increased along with the performance
• Retrieval of the molecules of the specific class (similarity)
• Diversity optimization• Selectivity between close, yet different chemical
classes• Voids filling• Comparative study of screens with two different
radii• How screen’s radius impacts performance?
ExperimentExperiment
917 known approved drugs• MDL Drug Data Report -> launched drugs• C-C bond search• Manual Filtering
– Polypeptides– Non Specific molecules (e.g. disinfectants)– Porfilin complexes– Boronic compounds– Trivial alkylating agents (e.g. nitrogen mustard
compounds)
DatasetDataset
Reference Set: SteroidsReference Set: Steroids
• Total 39 Steroids
• 20 in Training Set
• Total Screens (S4) for the Training Set –820
• Total S2 Screens – 259
Training Set (20) Validation Set (19)
ScreensScreens--44
NOH
CH3
CH3
CH3 NOH
CH3
CH
OHCH3CH3
CH3
CH
N
OH
OH
CH3
H
H
H
H
Chiral
Progestin – total 58 Screens(S4), 28+30
NOH
CH3
NH
CH3
CH3
CH3
CH3
CH3
CH3
NH CH3
CH3
CH3
OHCH3CH3
CH3
CH3
CH3
CH3
CH3
CH3
CH2
CH3
CH3
CH3
CH3
CH3CH3
CH3
CH3
CH
OH
CH3
CH3
CH3
CH3
OHCH3
CH3
CH3
NOH
CH3N
OH
CH3
CH3
CH
OH
CH3
CH3
CH3
NH
CH3CH3
CH3
CH3
CH2
CH3CH3
CH2
OHCH3
CH3
CH3
CH3
CH3
NOH
CH3
CH3
NH
CH3
CH
OH
CH3
CH3
OHCH3
CH3
CH3
CH3
CH
OH
CH3
CH3
CH3
CH
OH
CH3
CH3
ScreensScreens--2 (smaller radius)2 (smaller radius)CH
N
OH
OH
CH3
H
H
H
H
Chiral
OHCH3
CH3
CH3
CH3
CH3
CH3CH3
Progestin – total 30 Screens (S2)
CH3
CH3CH3
CH3
CH3CH3
CH3
CH3
CH2
CH3
CH3
CH3
CH3 CH3
CH3
CH3
CH3
CH3
CH3
CH3
CH2 CH3
CH3
CH3CH3
OHCH3
CH3
CH3
CH
OHCH3
CH3
CH2CH3
NH CH3
CH3CH3
CH
CH3
CH
OH
CH3
CH3
CH3CH3
CH3
CH3
CH3
CH3
CH3 CH3
CH3
CH3
CHCH
CH2
CH3 CH3
NH
CH3
CH3
NCH2
OH
CH3
OHCH3
NCH2
OH
NOH
CH3
CH3
NH CH3
CH3
OHCH3
CH3
CH3CH3
CH3
NH2
OH
CH3 CH3
Similarity Sorting, S4Similarity Sorting, S4• Top 19 are steroids• Similarity [0.288, 0.154]• Threshold with #20 >0.05
O
O
O
O
CH3OHCH3
CH3
CH3
H
H
H
Chiral
O
OO
O
O
O
CH3OHCH3
CH3
CH3
O
N
OH
CH3
CH3
OH
H
H
H
Chiral
CH2
OH
CH3 OH
OH
O CH3
CH3
CH3
H
HChiral
O
F
O
O
O
CH3
CH3OH
F
Cl
CH3
CH3
H
H
Chiral
Similarity Sorting, S2Similarity Sorting, S2• Top 19 are steroids• Similarity [0.521, 0.320]• Threshold with #20 >0.05
O
O
O
O
CH3OHCH3
CH3
CH3
H
H
H
Chiral
O
OO
O
O
O
CH3OHCH3
CH3
CH3
O
N
OH
CH3
CH3
OH
H
H
H
Chiral
CH2
OH
CH3 OH
OH
O CH3
CH3
CH3
H
HChiral
O
O
OO
O
O
CH3OHO
CH3
CH3
CH3
H
H
H
Chiral
• The same set selected• 8 of 19 molecules placed exactly at the
same positions, including Best2 and Worst2, and the first non-steroid
• S4 screens are more discriminative• S2 – faster calculations and lesser data
volume, virtually impossible reverse engineering
Similarity: S2 Similarity: S2 vsvs S4S4
Diversity Optimization, S2Diversity Optimization, S2
C3
H
CC
3H
OH
N
O
OH
H
H
H
CH2
OH
N
OOH
Input: 259 S2 screens (SDF)
• Initial Diversity of the training set 0.607
• First Steroid comes at position 174 (downward slope)
• Diversity Maximum 0.9453 reached at #88
• Compound #1 increases diversity >6%
CH3
CH3C
3H
C3
H
C3
H CH3
NN
N
N
NN
Diversity Optimization, S4Diversity Optimization, S4
CH3
CH3
OH
O
N
CH3
CH3C
3H
C3
H
C3
H CH3
NN
N
N
NN
Input: 820 S4 screens (SDF)
• Initial Diversity of the training set 0.839
• First Steroid comes at position 523 (!) (almost maximum, downward slope)
• Diversity Maximum 0.8984 reached at #487
• Compound #1 increases diversity ~1%
O
CN
OH
C3H
C3HOH
H
H
H
• Adequate overall sorting• S2 Screens better suited for smaller sets• S4 Screens are more beneficial for
diversity analysis• When it is critical to avoid overlap with
existing chemistry space – S4 are better
Diversity: S2 Diversity: S2 vsvs S4S4
Reverse Engineering: S2Reverse Engineering: S2
NCH3CH3
CH3
N
O
CH3
O
NH2 CH3N
O
CH3
NH2CH3
ON
NH2CH3
CH3
O
NH2 NH2
OCH2
NH2 CH3
NCH3
CH3
CH3
NH
O
NH
N
O CH3
CH3
NH2 CH3
CH3
N
O
NH2CH3
CH3
CH3
O
NH2NH2
CH3
NH2 CH3
CH3
CH3CH3
N
O
O
NH2 N
O
O
N
? ?
19 Screens – 342 pairs; 5814 triplets; …
Requires consideration of >N! permutations
Plus Vector of Weights
Reverse Engineering: S4Reverse Engineering: S4
NCH3CH3
CH3
N
O
CH3
O
NH2 CH3N
O
CH3
NH2CH3
ON
NH2CH3
CH3
O
NH2 NH2
OCH2
NH2 CH3
NCH3
CH3
CH3NH
ONH
N
O CH3
CH3
NH2 CH3
CH3
N
O
NH2CH3
CH3
CH3
O
NH2NH2
CH3
NH2 CH3
CH3
CH3CH3
N
O
O
NH2 N
O
O
N
! !
N
O
O
NN
O
O
NH2
Plus Vector of Weights
• S2 - virtually irreproducible structure even for a single structure
• S4 may contain the molecule entirely • MW estimate > 260 for safe S4 usage• Both classes are strongly NP hard for the
reverse engineering of the molecules
Security: S2 Security: S2 vsvs S4S4
• Practically identical results in the similarity test• S4 is more sensitive in Diversity/Voids filling• S2 is more secure and can be efficiently applied
for small sets (1-5 molecules)• S4 can be safely applied for
– Large Sets– Complex Molecules
• Overall – S2 more efficient in general cases
Conclusion: S2 Conclusion: S2 vsvs S4S4
• 2 Similar Classes – 2 ‘similar’ sets of Screens (S2)
• Test of the initial compounds retrieval by the set of Screens
• ‘Noise’ check, especially for the close neighborhood classes
SelectivitySelectivity
QuinolinesQuinolines and and QuinazolineQuinazoline
NOH
I
Cl
NH
O
N
O
NH
O
CH3
CH3
CH3
NNN
N
NOO
O
O
NO
H
H
N
OH
O
F
OH
OHN
NH2
N
S
OHO
OH
ClCH3
CH3
N
N
N
O
N
O O
O
O
OH
CH3
CH3
N
NH
FFF
F
F F
OH
S
NH
N
NH
O O
CH3OCH3
N+
N
OONH N
CH3
CH3
NN
O
O
OOH
NH
CH3
CH3
CH3
NN
O
O
OOH
N
OH
CH3
CH3
CH3
NN
N NH2
CH3 CH3
N
N
NH
NH2
NH2
O
O
O
CH3 CH3
CH3
CH3
N
N
O
NN
O
O
NH2
O
OCH3
CH3
N
N NN
OO
NH2
O
O
CH3
CH3
N
N NN
O
NH2
O
OCH3
CH3
CH3
N
N N
N O
O
NH2
OO
O
OH
CH3CH3
CH3
CH3
CH3
N
N N
NO
O
NH2
O
OCH3
CH3
N
N N
OO
NH2
NH
O
O
CH3
CH3
CH3
N
NH
N
NO Cl
F
O
OCH3
13 Molecules, 6 Training set, 174 S2 Screens (no quinoline
fragments)
8 Molecules, XX S2 and ZZZ S4 Screens
V
V
V
V
V
V
Selectivity ExperimentSelectivity Experiment• 174 S-2 Screens as input• Quinolines at ##1-5, #8 and #15• Quinazolines at ##6, 7, 9, 11, 13,
21, 22, 30• Naphthaline at #10
N
N
NH
NH2
NH2
O
O
O
CH3 CH3
CH3
CH3
F
FF
N N
S
OHO
OH
ClCH3
CH3
N
NH2
N
N
O
NN
O
O
NH2
O
OCH3
CH3
• 896 Drugs subset, Quinolines and quinazolines excluded
• 40,000 molecules form ChemDiv stock– 722 Quinolines (1.8%)– 322 Quinazolines (0.8%)
S2: Voids FillingS2: Voids Filling
NH
N H
S
N H
OO
NBr
Br
Voids Filling, S2Voids Filling, S2
N
N
Cl
ClCl
Cl
N
N
S
SH
H
Input: 4805 S2 screens (SDF)• Initial Diversity of the set 0.862
• 1688 compounds to be added for maximum diversity of 0.9272 (filling voids)
• Concentration of Quinolines in first 1688 molecules: 3.50%, 59 total, first at #173
• Concentration of Quinazolines in first 1688 molecules: 2.19%, 37 total, first at #272
• Almost Double Enrichment, compared to random sampling
• Set of Screens allows efficient similarity/voids filling searches
• S-2 and S-4 Screens exhibit comparable good selectivity
• Better Security with S2-Screens, Better sensitivity with S-4
• Open format – no hidden data – customer’s confidence
• Reverse engineering strongly NP-hard for the set of molecules
• Screens and Algorithms are implemented in the Chemosoft
ConclusionConclusion