29
WWW.CHEMDIV.COM 11558, Sorrento Valley Road, San Diego, CA 92121, 11558, Sorrento Valley Road, San Diego, CA 92121, USA Tel.: 858 USA Tel.: 858- 756 7996, Fax: 858 756 7996, Fax: 858- 794 794- 4931 4931 E- mail: mail: [email protected] [email protected], , [email protected] [email protected] Screens as a Secure Screens as a Secure Descriptor of Chemistry Descriptor of Chemistry Space Space Nikolay Osadchiy and Sergey Trepalin Nikolay Osadchiy and Sergey Trepalin

Screens as a Secure DescriDescriptor optor of Chemistry f ...acscinf.org/docs/meetings/229nm/presentations/229nm29.pdf · Screens as a Secure DescriDescriptor optor of Chemistry f

Embed Size (px)

Citation preview

WWW.CHEMDIV.COM

11558, Sorrento Valley Road, San Diego, CA 92121, 11558, Sorrento Valley Road, San Diego, CA 92121, USA Tel.: 858USA Tel.: 858--756 7996, Fax: 858756 7996, Fax: 858--794794--4931 4931

EE--mail: mail: [email protected]@chemdiv.com, , [email protected]@chemdiv.com

Screens as a Secure Screens as a Secure Descriptor of Chemistry Descriptor of Chemistry

SpaceSpaceNikolay Osadchiy and Sergey TrepalinNikolay Osadchiy and Sergey Trepalin

Typical Tasks:• Calculate diversity of a chemistry space• Describe the chemistry space of hit series• Determine ‘voids’ in a chemistry space• Select the most similar compounds• Select a set of the most dissimilar

compounds

IntroductionIntroduction

• External Security – encryption – out of the scope, we consider not transfer, but data security

• Nature of data should allow for open format• Open Format of Exchange file• Complex derivative descriptors – are not always

trusted by the disclosing party• Data should be unaltered

What is Secure?What is Secure?

• Centroidal Structural Fragment• Concept is implemented in ChemoSoft and as a

standalone tool• Arbitrary sphere size – trade-off between

informativeness, security, and performance• Default implementation 4-bond length radius• Question: What is the ‘optimal’ size of the

structural fragment?

What is ‘Screen’?What is ‘Screen’?

Exchange File StructureExchange File Structure

SDF formatInformation gathered:• Screen – Structure• Screen – Weight

• Total Number of Molecules in the Set

∑∈

=)(

2

1

iJj ji

Mw

Similarity/Diversity Similarity/Diversity AlgorithmAlgorithm

• Each molecule is assigned a fingerprint vector of against the set of screens

• Pair-wise Similarity Coefficient: Cosine Metrics

• Average Similarity to the Set of Molecules is calculated• Diversity is a reciprocal to similarity

• Screens are explicitly collected for each molecule, reverse engineering is possible

Implemented in ChemoSoft Software, Trepalin et. al., JCICS, 42 (2), 2002, 249-258; http://www.chemosoft.com

Modification Modification –– Increased Increased PerformancePerformance

• Weights were introduced and assigned to screens• Holliday et.al* showed:

∑∑= =

•=N

I

N

J

CC AAJISIMILARITY1 1

),(

Implemented in ChemoSoft Software, Trepalin et. al., JCICS, 42 (2), 2002, 249-258; http://www.chemosoft.com

* Holliday J. et al.,Quant. Struct.-Act. Relat. 1995, 14, 501-506.

AC – centroid vectorW(I) weigth of I-thcompoundF – total number of screens

• Screens are collected with their weights, and no longer assigned to particular molecules

• Security is increased along with the performance

• Retrieval of the molecules of the specific class (similarity)

• Diversity optimization• Selectivity between close, yet different chemical

classes• Voids filling• Comparative study of screens with two different

radii• How screen’s radius impacts performance?

ExperimentExperiment

917 known approved drugs• MDL Drug Data Report -> launched drugs• C-C bond search• Manual Filtering

– Polypeptides– Non Specific molecules (e.g. disinfectants)– Porfilin complexes– Boronic compounds– Trivial alkylating agents (e.g. nitrogen mustard

compounds)

DatasetDataset

Reference Set: SteroidsReference Set: Steroids

• Total 39 Steroids

• 20 in Training Set

• Total Screens (S4) for the Training Set –820

• Total S2 Screens – 259

Training Set (20) Validation Set (19)

ScreensScreens--44

NOH

CH3

CH3

CH3 NOH

CH3

CH

OHCH3CH3

CH3

CH

N

OH

OH

CH3

H

H

H

H

Chiral

Progestin – total 58 Screens(S4), 28+30

NOH

CH3

NH

CH3

CH3

CH3

CH3

CH3

CH3

NH CH3

CH3

CH3

OHCH3CH3

CH3

CH3

CH3

CH3

CH3

CH3

CH2

CH3

CH3

CH3

CH3

CH3CH3

CH3

CH3

CH

OH

CH3

CH3

CH3

CH3

OHCH3

CH3

CH3

NOH

CH3N

OH

CH3

CH3

CH

OH

CH3

CH3

CH3

NH

CH3CH3

CH3

CH3

CH2

CH3CH3

CH2

OHCH3

CH3

CH3

CH3

CH3

NOH

CH3

CH3

NH

CH3

CH

OH

CH3

CH3

OHCH3

CH3

CH3

CH3

CH

OH

CH3

CH3

CH3

CH

OH

CH3

CH3

ScreensScreens--2 (smaller radius)2 (smaller radius)CH

N

OH

OH

CH3

H

H

H

H

Chiral

OHCH3

CH3

CH3

CH3

CH3

CH3CH3

Progestin – total 30 Screens (S2)

CH3

CH3CH3

CH3

CH3CH3

CH3

CH3

CH2

CH3

CH3

CH3

CH3 CH3

CH3

CH3

CH3

CH3

CH3

CH3

CH2 CH3

CH3

CH3CH3

OHCH3

CH3

CH3

CH

OHCH3

CH3

CH2CH3

NH CH3

CH3CH3

CH

CH3

CH

OH

CH3

CH3

CH3CH3

CH3

CH3

CH3

CH3

CH3 CH3

CH3

CH3

CHCH

CH2

CH3 CH3

NH

CH3

CH3

NCH2

OH

CH3

OHCH3

NCH2

OH

NOH

CH3

CH3

NH CH3

CH3

OHCH3

CH3

CH3CH3

CH3

NH2

OH

CH3 CH3

Similarity Sorting, S4Similarity Sorting, S4• Top 19 are steroids• Similarity [0.288, 0.154]• Threshold with #20 >0.05

O

O

O

O

CH3OHCH3

CH3

CH3

H

H

H

Chiral

O

OO

O

O

O

CH3OHCH3

CH3

CH3

O

N

OH

CH3

CH3

OH

H

H

H

Chiral

CH2

OH

CH3 OH

OH

O CH3

CH3

CH3

H

HChiral

O

F

O

O

O

CH3

CH3OH

F

Cl

CH3

CH3

H

H

Chiral

Similarity Sorting, S2Similarity Sorting, S2• Top 19 are steroids• Similarity [0.521, 0.320]• Threshold with #20 >0.05

O

O

O

O

CH3OHCH3

CH3

CH3

H

H

H

Chiral

O

OO

O

O

O

CH3OHCH3

CH3

CH3

O

N

OH

CH3

CH3

OH

H

H

H

Chiral

CH2

OH

CH3 OH

OH

O CH3

CH3

CH3

H

HChiral

O

O

OO

O

O

CH3OHO

CH3

CH3

CH3

H

H

H

Chiral

• The same set selected• 8 of 19 molecules placed exactly at the

same positions, including Best2 and Worst2, and the first non-steroid

• S4 screens are more discriminative• S2 – faster calculations and lesser data

volume, virtually impossible reverse engineering

Similarity: S2 Similarity: S2 vsvs S4S4

Diversity Optimization, S2Diversity Optimization, S2

C3

H

CC

3H

OH

N

O

OH

H

H

H

CH2

OH

N

OOH

Input: 259 S2 screens (SDF)

• Initial Diversity of the training set 0.607

• First Steroid comes at position 174 (downward slope)

• Diversity Maximum 0.9453 reached at #88

• Compound #1 increases diversity >6%

CH3

CH3C

3H

C3

H

C3

H CH3

NN

N

N

NN

Diversity Optimization, S4Diversity Optimization, S4

CH3

CH3

OH

O

N

CH3

CH3C

3H

C3

H

C3

H CH3

NN

N

N

NN

Input: 820 S4 screens (SDF)

• Initial Diversity of the training set 0.839

• First Steroid comes at position 523 (!) (almost maximum, downward slope)

• Diversity Maximum 0.8984 reached at #487

• Compound #1 increases diversity ~1%

O

CN

OH

C3H

C3HOH

H

H

H

• Adequate overall sorting• S2 Screens better suited for smaller sets• S4 Screens are more beneficial for

diversity analysis• When it is critical to avoid overlap with

existing chemistry space – S4 are better

Diversity: S2 Diversity: S2 vsvs S4S4

Reverse Engineering: S2Reverse Engineering: S2

NCH3CH3

CH3

N

O

CH3

O

NH2 CH3N

O

CH3

NH2CH3

ON

NH2CH3

CH3

O

NH2 NH2

OCH2

NH2 CH3

NCH3

CH3

CH3

NH

O

NH

N

O CH3

CH3

NH2 CH3

CH3

N

O

NH2CH3

CH3

CH3

O

NH2NH2

CH3

NH2 CH3

CH3

CH3CH3

N

O

O

NH2 N

O

O

N

? ?

19 Screens – 342 pairs; 5814 triplets; …

Requires consideration of >N! permutations

Plus Vector of Weights

Reverse Engineering: S4Reverse Engineering: S4

NCH3CH3

CH3

N

O

CH3

O

NH2 CH3N

O

CH3

NH2CH3

ON

NH2CH3

CH3

O

NH2 NH2

OCH2

NH2 CH3

NCH3

CH3

CH3NH

ONH

N

O CH3

CH3

NH2 CH3

CH3

N

O

NH2CH3

CH3

CH3

O

NH2NH2

CH3

NH2 CH3

CH3

CH3CH3

N

O

O

NH2 N

O

O

N

! !

N

O

O

NN

O

O

NH2

Plus Vector of Weights

• S2 - virtually irreproducible structure even for a single structure

• S4 may contain the molecule entirely • MW estimate > 260 for safe S4 usage• Both classes are strongly NP hard for the

reverse engineering of the molecules

Security: S2 Security: S2 vsvs S4S4

• Practically identical results in the similarity test• S4 is more sensitive in Diversity/Voids filling• S2 is more secure and can be efficiently applied

for small sets (1-5 molecules)• S4 can be safely applied for

– Large Sets– Complex Molecules

• Overall – S2 more efficient in general cases

Conclusion: S2 Conclusion: S2 vsvs S4S4

• 2 Similar Classes – 2 ‘similar’ sets of Screens (S2)

• Test of the initial compounds retrieval by the set of Screens

• ‘Noise’ check, especially for the close neighborhood classes

SelectivitySelectivity

QuinolinesQuinolines and and QuinazolineQuinazoline

NOH

I

Cl

NH

O

N

O

NH

O

CH3

CH3

CH3

NNN

N

NOO

O

O

NO

H

H

N

OH

O

F

OH

OHN

NH2

N

S

OHO

OH

ClCH3

CH3

N

N

N

O

N

O O

O

O

OH

CH3

CH3

N

NH

FFF

F

F F

OH

S

NH

N

NH

O O

CH3OCH3

N+

N

OONH N

CH3

CH3

NN

O

O

OOH

NH

CH3

CH3

CH3

NN

O

O

OOH

N

OH

CH3

CH3

CH3

NN

N NH2

CH3 CH3

N

N

NH

NH2

NH2

O

O

O

CH3 CH3

CH3

CH3

N

N

O

NN

O

O

NH2

O

OCH3

CH3

N

N NN

OO

NH2

O

O

CH3

CH3

N

N NN

O

NH2

O

OCH3

CH3

CH3

N

N N

N O

O

NH2

OO

O

OH

CH3CH3

CH3

CH3

CH3

N

N N

NO

O

NH2

O

OCH3

CH3

N

N N

OO

NH2

NH

O

O

CH3

CH3

CH3

N

NH

N

NO Cl

F

O

OCH3

13 Molecules, 6 Training set, 174 S2 Screens (no quinoline

fragments)

8 Molecules, XX S2 and ZZZ S4 Screens

V

V

V

V

V

V

Selectivity ExperimentSelectivity Experiment• 174 S-2 Screens as input• Quinolines at ##1-5, #8 and #15• Quinazolines at ##6, 7, 9, 11, 13,

21, 22, 30• Naphthaline at #10

N

N

NH

NH2

NH2

O

O

O

CH3 CH3

CH3

CH3

F

FF

N N

S

OHO

OH

ClCH3

CH3

N

NH2

N

N

O

NN

O

O

NH2

O

OCH3

CH3

• 896 Drugs subset, Quinolines and quinazolines excluded

• 40,000 molecules form ChemDiv stock– 722 Quinolines (1.8%)– 322 Quinazolines (0.8%)

S2: Voids FillingS2: Voids Filling

NH

N H

S

N H

OO

NBr

Br

Voids Filling, S2Voids Filling, S2

N

N

Cl

ClCl

Cl

N

N

S

SH

H

Input: 4805 S2 screens (SDF)• Initial Diversity of the set 0.862

• 1688 compounds to be added for maximum diversity of 0.9272 (filling voids)

• Concentration of Quinolines in first 1688 molecules: 3.50%, 59 total, first at #173

• Concentration of Quinazolines in first 1688 molecules: 2.19%, 37 total, first at #272

• Almost Double Enrichment, compared to random sampling

• Set of Screens allows efficient similarity/voids filling searches

• S-2 and S-4 Screens exhibit comparable good selectivity

• Better Security with S2-Screens, Better sensitivity with S-4

• Open format – no hidden data – customer’s confidence

• Reverse engineering strongly NP-hard for the set of molecules

• Screens and Algorithms are implemented in the Chemosoft

ConclusionConclusion

• Dr. Sergey Tkachenko• Dr. Alex Khvat• Dr. Nikolay Savchuk• Dr. Andrei Ivachtchenko

AcknowledgementsAcknowledgements