21
Developing & Benchmarking Large- scale Docking (LSD) Pipeline Niu Huang, 02/17/2004

Developing & Benchmarking Large-scale Docking (LSD) Pipeline Niu Huang, 02/17/2004

Embed Size (px)

Citation preview

Developing & Benchmarking Large-scale Docking (LSD) Pipeline

Niu Huang, 02/17/2004

LSD pipeline

Binding Site Refinement

(PLOP/Modeller)LigBase

Model Building

(ModBase/PDB)

Ligand Docking

(DOCK3.5.5.4)

Post-docking Refinement

(PLOP)

Central Database System

Where are we now?

Applications(CK – enolase,

Jenni – malaria related,Chris – Sporalation …)

LSD Modules(Testing & Debugging,

Benchmarking)

Investigation in details(Enrichment,

Binding properties,Performance …)

Target Protein

SPHGEN

DATABASE

GRID

SCORING

Docking pipeline

Energy component histograms for docked AR ligands

0

20

40

60

80

100

120

140

-120 -100 -80 -60 -40 -20 0 20 40

Energy (kcal/mol)

fre

qu

en

cy

ele. Ene.

vdw. Ene.

Total Ene.

DATA ANALYSIS

Test case (from J. Med. Chem., McGovern & Shoichet, 2003)

N

NH

+N

N

HN2

NH2

N

CH3

NH

O COO-

COO-

SNH

NO

O

O

NH

O

O

H

(CH2)3

NH NH2

NH2

+

NH

NHN2

O NH

O COO-

COO-

O H

NH

N

O

NH2

N

CH

O

NH

COO-

O

NH

COO-

COO-3

NH

NH2

+

NH

N

O

OH

OHCH2OH

NH

N

NH2OO CH3

N

NN

N

NH2

O

OH

OCH3

S NO

O

CF3

CH3

NCl

NH2

CH2CH3

DHFR

GART

TS

Thrombin

PNP SAHH AChE AR PARP

Expert vs automated docking

Enrichment plots comparing the performance of an expert (dark blue), automated procedure (magenta, referred to Test10), and random enrichment (black).

0

20

40

60

80

100

0 10 20 30% of database

0

20

40

60

80

100

0 10 20 30% of database

0

20

40

60

80

100

0 10 20 30

% of database

0

20

40

60

80

100

0 10 20 30% of database

0

20

40

60

80

100

0 10 20 30% of database

0

20

40

60

80

100

0 10 20 30% of database

0

20

40

60

80

100

0 10 20 30% of database

0

20

40

60

80

100

0 10 20 30% of database

0

20

40

60

80

100

0 10 20% of database

AChe AR DHFR

GART PARP PNP

SAHH

Thrombin TS

% o

f lig

an

ds f

ou

nd

Approach to “expert docking” limit?

Enzyme/

target

% of db to find 25% of known ligands

Susan (expert docking) John (auto) Niu (best) Niu (Test10)

AchE 6.3 6.5 4.8 5.0

Adometc 11 N/A 0.65 0.72

AR 2.8 6.4 2.9 3.5

DHFR 2.0 7.9 0.3 0.3

GART 0.4 6.9 0.95 0.95

L99A N/A 0.5 0.05 0.05

PARP 2.9 3.9 2.8 4.6

PNP 2.8 3.1 1.25 1.25

SAHH 1.1 1.8 2.1 2.1

Thrombin 6.6 9.5 4.2 4.2

TS 3.5 9.4 1.55 1.55

Missing atoms

Case analysis (DHFR)

DHFR cont. 1DHFR Docking parameters Enrichment plots Docking statistics

(CPU 2.4 GHZ)

Test1 INDOCK.3; .useligsph = on, 70 matching spheres

% of db to find 25% of known ligands = 31;

Max. enrichment factor < 5

~ 300 hrs, 125580 cmpds scored

Test? INDOCK.3; .useligsph = on, 50 matching spheres

% of db to find 25% of known ligands = 9.9; Max. enrichment factor = 8.2

~ 31 hrs, 126545 cmpds scored

Test? INDOCK.2; .useligsph = on, 50 matching spheres

% of db to find 25% of known ligands = 47;

Max. enrichment factor < 5

~ 16 hrs, 122876 cmpds scored

Test9 INDOCK.3;.usefragsph = on, 50 matching spheres

% of db to find 25% of known ligands = 4.3; Max. enrichment factor = 9.2

~ 50 hrs, 109638 cmpds scored

Test10 INDOCK.3;.usefragsph = on, 35 matching spheres

*without cofactor

**without HOH

***without HIP28

(*) from Susan.

~ 15 hrs, 85288 cmpds scored

(~ 7.5 hrs, 70541 cmpds from Susan);

Test11 INDOCK.1;.usefragsph = on, 35 matching spheres

% of db to find 25% of known ligands = 3.5; Max. enrichment factor = 76

~ 2 hrs, 55149 compds scored

0.3 3.7*2.9** 0.3***(2.0)

110 64*43**128***(29)

DHFR cont. 2dhfr_ligands_test1

0

1

2

3

4

5

6

-50 -45 -40 -35 -30 -25 -20 -15 -10

total interaction energy

Fre

qu

ency

dhfr_mddr_test1

0

200

400

600

800

1000

1200

1400

-50.0 -40.0 -30.0 -20.0 -10.0

total interaction energyF

req

uen

cy

dhfr_mddr_test10

0

100

200

300

400

500

600

700

800

900

-50 -40 -30 -20 -10

total interaction energy

Fre

qu

ency

dhfr_ligands_test10

0

0.5

1

1.5

2

2.5

3

3.5

-50 -45 -40 -35 -30 -25 -20 -15 -10

total interaction energy

Fre

qu

ency

Using focused set of spheres appears to be essential for reducing the noise caused by inaccurate scoring function that favors the wrong docking poses, which is alleviated by only using the spheres filled in hot spot region.

DHFR cont. 3

Test1 docked ligands top scored mddr decoys

Test10 docked ligands top scored mddr decoys

Case analysis (Aldose Reductase)

* Structure, 1997, 5:601-612

The conformational flexibility of the binding site appears to contribute to the poor enrichment as implicated by crystal structures, however it may be also due to other factors such as, lack of protein desolvation penalty in scoring function.

AR cont. 1

Energy component histograms for docked AR ligands

0

20

40

60

80

100

120

140

-120 -100 -80 -60 -40 -20 0 20 40

Energy (kcal/mol)

frequ

ency

ele. Ene.

vdw. Ene.

Total Ene.

Correlation coefficients between electrostatic energy and total energy, vdw energy and total energy are 0.74 and 0.66 for docked ligands, individually, 0.62 and -0.33 for docked top 500 decoys. Clearly, electrostatic interaction is way too favorable and dominate the interaction energy score for docked decoys, which might be remedied by including the protein desolvation penalty.

Energy component histograms for docked top 500 AR decoys

0

10

20

30

40

50

60

70

-250 -200 -150 -100 -50 0 50

Energy (kcal/mol)

freq

uenc

y

ele. Ene.

vdw. Ene.

total. Ene.

Case analysis (PARP)PARP Docking parameters Enrichment plots Docking statistics

Test? INDOCK.1;

70 matching spheres

% of db to find 25% of known ligands = 3.0;

Max. enrichment factor = 9.1

113008 cmpds

Test? INDOCK.2;

70 matching spheres

% of db to find 25% of known ligands = 7.3;

Max. enrichment factor =4.2

121035 cmpds

Test? INDOCK.3;

70 matching spheres

% of db to find 25% of known ligands = 10.5;

Max. enrichment factor = 3.3

123872 cmpds

Test9 INDOCK.3;

50 matching spheres

% of db to find 25% of known ligands = 9.4;

Max. enrichment factor = 3.5

123401 cmpds

Test10 INDOCK.3;

35 matching spheres

% of db to find 25% of known ligands = 4.5;

Max. enrichment factor = 6.4

116717 cmpds

Test11 INDOCK.1;

35 matching spheres

% of db to find 25% of known ligands = 2.8;

Max. enrichment factor = 9.6

96533 cmpds

Docked ligands Top scored MDDR decoys

PARP cont. 1

Case analysis (AChE) Poor enrichment (5.0 % of db to find 25% of known ligands)

appears to be caused by the large number of improbable docking poses. The AChE binding cavity is large with many waters and more than one clear binding region in the pocket; no direct hydrogen bonds between the ligand and the protein have been observed, only water-bridged hydrogen bonds, which presents a particular hard case to dock to. (Jacobsson, JMC, 2004)

Can we do something about it to improve our docking for such cases?

Case analysis (Thrombin)

Multiple binding sub-sites? anything to do with the way to generate dockable databaseand the way to match spheres?

Preliminary Conclusion A fully automated docking procedure and a consistent parameter set for

Grids generation, Docking and Scoring appear to perform well across all the tested systems.

Cofactor, iron and structural waters involving in ligand binding are required to be carefully inspected, as well as protonation states of amino acid residues in binding site.

“larger binding pocket, more extensive sampling – INDOCK.3” is required (validated by DHFR, TS, thrombine and GART test sets).

Docking spheres and delphi spheres can be generated by using different schemes. Focused set of matching spheres were shown to be critical for systems like DHFR, TS and GART, and indicates that the information of hot spot in binding pocket will be important for directing docking.

Careful interpretation of docking results (energy component analysis) should be regularly employed to identify possible errors caused by certain factors.

High quality test setsEnrichment data sets (known ligands and decoys datasets)i. Susan test setii. Enolase test setiii. NCTR ER data set: 232 diverse compounds, covers a 106 – fold range in a

validated ER competitive binding assay, and NCTR AR data set: 202 diverse compounds (Tong, et.al. 2001)

iv. McMaster DHFR data set (http://hts.mcmaster.ca)v. Compumine ERalpha , MMP3, AChE and fXa data sets (http://

www.compumine.com/research/scoring.html)

Docking and scoring test sets (experimental structures and binding affinities)

i. CCDC/Astex validation test set: 308 crystal complexes (http://ccdc.cam.ac.uk)ii. X-CScore dock set: 100 crystal complexes and binding affinities (wang, et al.

2003)

Suggestion What is the first and possibly major second putative

major principal component that if fixed would make the enrichment better?

For each improvement that could be made, your estimate of what should be done, how much effort, likelihood of improvement.

Closely look at the active site residues (ionization and protonation states) , use top decoy compounds to identify the residues that contribute to overestimation of the docking energy.

Acknowledgement

John @ Shoichet CK @ Jacobson Ursula & Eswar @ Sali