Predicting Protein-Ligand Interactions using Iterative Stochastic Elimination Algorithm

Predicting Protein-LigandInteractions using Iterative

Stochastic Elimination Algorithm

Boris Gorelik, B.Pharm, M.Sc

A dissertation submitted to the Hebrew University of Jerulalem

for the degree of Doctor of Philosophy

·2007·

i

This work was conducted under the supervision of

professor Amiram Goldbum

ii

Abstract

Molecular Docking is an “in silico” process for predicting the structure of

receptor-ligand complexes. Such a prediction is of great importance in vari-

ous fields of life sciences, mainly in drug design efforts. Numerous methods

for solving this problem have been developed, employing a plethora of algo-

rithms.

Four main difficulties affect the docking algorithms – the vast space that

needs to be searched, the scoring of ligand poses, the flexibility of both

partners – protein and ligand and introducing water molecules that frequently

mediate intermolecular interactions.

This work presents ISE-dock– a docking program which is based on

the Iterative Stochastic Elimination (ISE) algorithm. ISE is a generic opti-

mization algorithm that is based on elimination of values that consistently

lead to the worst results. It constructs large sets of near optimal solutions

with no additional computational cost compared to producing single poses.

ISE-dock is based on the source code of AutoDock v.3.0.5 and uses its

scoring function. Development of a scoring function is beyond the scope of

this work. Unlike the original AutoDock program, ISE-dock is capable of

dealing with conformational changes of the receptor. The changes in the re-

ceptor’s backbone are implemented in an implicit, multistep way. Using this

approach, multiple structures of the protein are generated by an ISE-based

program. The resulting structures serve as a target for subsequent docking of

the ligand. Explicit handling of changes in the protein 3D structure is made

possible by “tearing off” side-chain atoms from the protein. Thus, movable

iii

protein atoms are treated as a part of the ligand. In the current version of

ISE-dock, such a handling of protein flexibility is limited to unconstrained

rotations of protein side chain atoms. Although not done in this work, it is

possible to use rotamer libraries to decrease the complexity of the problem,

thus the experiments presented in here represent the “worst case” scenario

in terms of side chain flexibility.

ISE algorithm begins by constructing a matrix that contains a set of the

possible (discrete) values for each degree of freedom (variable) that defines

the problem (system). If the problem is prediction of molecular conformation

and the degrees of freedom are rotatable bonds, then the angular rotations

around each bond are its discrete variables. One value is picked randomly out

of the set of each variable, to determine the full configuration (conformation)

of the system, which is then evaluated by a scoring function. This step

is repeated many times to form a large sample, usually in the 104 – 106

range. The scores of that sample are arranged in a virtual histogram in which

only a small fraction (1%-10%) of worst and of best results are examined in

detail, to assess the contribution of each and every variable value on the final

scores. A value that appears in the worst results with a significantly higher

frequency than expected from its random distribution (based on its total

appearance in the full sample) or one that appears with a significantly lower

frequency than expected among the best results, is marked for elimination.

The next iteration of random picking, scoring, sampling and eliminating thus

begins with a smaller number of possible combinations. The elimination

process is performed iteratively until the number of possible conformations

enables exhaustive search in feasible time. Additions in this work to previous

applications of ISE include:

• Local optimization of a randomly picked fraction of the sampled struc-

tures during the stochastic search (as performed by LGA in Auto-

Dock [Morris et al. J Comp Chem 1998, 19, 1639–62] ). During the

final, exhaustive search step, any screened conformation has a proba-

bility of 60% to undergo optimization. The main purpose of the local

optimization step is to solve clashes and unfavorable conformations that

are caused by the discrete nature of the algorithm.

iv

• Only a limited portion of the values may be discarded for any given

variable in any iteration.

• Keeping and updating a list of best encountered conformations. The

size of the list is user defined.

ISE-dock was validated using four independent data sets. Flexible lig-

and – rigid protein docking was validated using 81 protein-ligand complexes

from the PDB and ISE-dock performance was compared to those of Glide,

GOLD and AutoDock. Flexible ligand – flexible protein docking was

tested using three additional data sets: collagenase (backbone flexibility, 2

complexes), Acetylcholine Esterase – AChE (flexibility of a single side chain,

2 complexes) and trypsin (flexibility of several side chains, 10 complexes).

When no protein flexibility is allowed, ISE-dock has a better chance

than the other three to find more than 60% top single poses under RMSD=2.0A

and more than 80% under RMSD=3.0A from experimental. ISE alone pro-

duced at least one 3.0A or better solutions among the top 20 poses in the

entire test set. In 98% of the examined molecules, ISE produced solutions

that are closer than 2.0A from experimental. Paired t-tests (PTT) were used

throughout to assess the significance of comparisons between the performance

of the different programs. ISE-dock provides more than a 100-fold docking

solutions in a similar time frame as LGA in AutoDock. The usefulness of

the large near optimal populations of ligand poses is demonstrated by show-

ing a correlation between the docking results and experiments that support

multiple binding modes in p38 MAP kinase [Pargellis, C. et al. Nat Struct

Biol 2002, 9, (4), 268–72] and in Human Transthyretin [Hamilton JA, Benson

MD. Cell Mol Life Sci 2001; 58(10):1491–1521].

Introduction of partial handling of protein flexibility into ISE-dock re-

quires several changes to the original scoring function, which has a strong

impact on the quality of the top ranked solutions. Nevertheless, the entire

docking solutions in this work always contain ligand poses of reasonable to

very high quality.

Docking of a flexible ligand into a protein while partially “unfreazing” the

backbone was tested on two collagenase-inhibitor complexes from the PDB

v

(PDB codes: 456c, 966c). In this case, the bound docking solutions contain

ligand poses with reasonably low RMSD values of 1.33A (456c) and 1.18A

(966c).

Two structures of AChE (4 cross docking experiments) and 10 struc-

tures of trypsin (100 cross docking experiments) with their respective in-

hibitors demonstrate the capabilities of ISE-dock to deal with protein side

chain flexibility. In both cases, high quality docking solutions are obtained

in terms of RMSD of all movable atoms from their experimental positions.

Docking populations for AChE contain solutions with RMSD≥0.37A, and

in the “worst” case, RMSD≥0.85A. In 74 (out of 100) cases in the trypsin

data set, the top 20 docking solutions contain poses with RMSD<2.0A. In

94 cases, the entire docking sets contain solutions with RMSD<2.0A and all

docking sets contain solutions with RMSD<3.0A.

This work shows that ISE-dock is superior in many aspects to the cur-

rently well established docking programs Glide, GOLD and AutoDock

in flexible ligand – rigid protein docking. It has been also shown that ISE-

dock deals successfully with various degrees of protein flexibility. In order

to handle flexible proteins in full extent, the scoring scheme needs to be

redesigned. The latter task is beyond the scope of this work.

Protein flexibility is an important aspect of a protein-ligand docking pro-

gram. Other degrees of freedom that were not accounted for in this work,

but that can be introduced into ISE-dock relatively easily are modeling

of structurally important water molecules and protonation and tautomeric

states of the interacting molecules.

Contents

1 Introduction 1

1.1 Current drug discovery process . . . . . . . . . . . . . . . . . 1

1.2 Flexibility in molecular interactions . . . . . . . . . . . . . . . 5

1.3 Energy and thermodynamic potentials . . . . . . . . . . . . . 7

1.4 Common energy components . . . . . . . . . . . . . . . . . . . 12

1.5 Force fields and scoring functions . . . . . . . . . . . . . . . . 19

1.5.1 Force field based energy functions . . . . . . . . . . . . 20

1.5.2 Approximate energy functions . . . . . . . . . . . . . . 21

1.5.3 Statistical potentials . . . . . . . . . . . . . . . . . . . 22

1.5.4 Geometric and chemical complementarity functions . . 23

1.6 Energy funnels . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.7 Multiple binding modes . . . . . . . . . . . . . . . . . . . . . 25

1.8 Docking techniques . . . . . . . . . . . . . . . . . . . . . . . . 26

1.8.1 Flexibility in docking programs . . . . . . . . . . . . . 26

1.8.2 Search algorithms . . . . . . . . . . . . . . . . . . . . . 29

1.8.3 Evaluating docking programs . . . . . . . . . . . . . . 32

1.9 Open problems and issues . . . . . . . . . . . . . . . . . . . . 34

2 Methods 35

2.1 Energy function . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.2 AutoDock docking program . . . . . . . . . . . . . . . . . . 37

2.2.1 Lamarckian Genetic Algorithm . . . . . . . . . . . . . 38

2.2.2 Problem representation . . . . . . . . . . . . . . . . . . 41

2.3 ISE-dock program . . . . . . . . . . . . . . . . . . . . . . . . 42

vi

Contents vii

2.3.1 Iterative Stochastic Elimination algorithm . . . . . . . 43

2.3.2 Problem representation . . . . . . . . . . . . . . . . . . 46

2.3.3 Protein flexibility . . . . . . . . . . . . . . . . . . . . . 46

2.4 Rigid protein docking . . . . . . . . . . . . . . . . . . . . . . . 49

2.4.1 LGA docking . . . . . . . . . . . . . . . . . . . . . . . 49

2.4.2 The data set . . . . . . . . . . . . . . . . . . . . . . . . 50

2.4.3 Comparisons and their analysis . . . . . . . . . . . . . 51

2.4.4 Paired t-test . . . . . . . . . . . . . . . . . . . . . . . . 52

2.4.5 Comparing CPU time . . . . . . . . . . . . . . . . . . 53

2.4.6 Energy funnels . . . . . . . . . . . . . . . . . . . . . . 53

2.5 Flexible protein docking . . . . . . . . . . . . . . . . . . . . . 54

2.5.1 Protein backbone Flexibility . . . . . . . . . . . . . . . 56

2.5.2 Flexibility of a single side chain . . . . . . . . . . . . . 59

2.5.3 Flexibility of several side chains . . . . . . . . . . . . . 62

2.5.4 Comparisons and their analysis . . . . . . . . . . . . . 63

3 Flexible ligand – rigid protein docking 64

3.1 Top scoring poses . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.2 Top 20 poses . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.3 Solution space coverage . . . . . . . . . . . . . . . . . . . . . . 69

3.4 Time performance . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.5 Multiple binding modes . . . . . . . . . . . . . . . . . . . . . 73

3.6 PDB data supports distinct funnels . . . . . . . . . . . . . . . 78

4 Flexible Ligand – Flexible Protein Docking 84

4.1 Protein backbone flexibility . . . . . . . . . . . . . . . . . . . 84

4.2 Flexibility of a single side chain . . . . . . . . . . . . . . . . . 87

4.3 Flexibility of several side chains . . . . . . . . . . . . . . . . . 90

4.4 Discussion on protein flexibility . . . . . . . . . . . . . . . . . 94

5 Conclusions 97

Appendices (submitted separately) 100

Contents viii

A Results published in a peer reviewed journal 101

B ISE-dock and AutoDock parameters and their values 103

B.1 AutoDock parameters and their

default values . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

B.2 ISE-dock parameters and their

default values . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

C Detailed Results 107

C.1 Flexible Ligand – Rigid Protein docking results results . . . . 107

C.2 Flexible ligand – rigid protein docking energy landscapes . . . 111

D Flexible ligand – flexible protein docking. Trypsin data set 119

List of Figures 123

List of Tables 128

Acknowledgments 129

Bibliography 130

Hebrew abstract 140

Chapter 1

Introduction

1.1 Current drug discovery process

Since the dawn of history, humankind has been searching for ways to fight

diseases and improve the quality of life. Modern science has undergone

tremendous developments and has successfully developed a great variety of

medicines. Nevertheless, the constant search for better drugs that reduce

side effects, cure more diseases, and extend life expectancy and quality has

never stopped. Drugs have traditionally been discovered by experimental

methods, but more recently, computerized (virtual) drug discovery methods

have been devised and prove to be helpful in the process of drug discovery

and in designing drugs. Figure 1.1 presents an overview of current methods

for designing drugs and discovering them. Roughly, the systematic search

for new active molecules can be divided in three categories: classical chem-

istry drug discovery, high trhoughput screening and virtual high throughput

screening.

1

Introduction 2

Figure 1.1: Schematic diagram of the main methods in the drug discovery process. Arrowsdesignate process flow. Black asterisks mark steps that may involve molecular docking.Abbreviations: SAR – structure-activity relationship; QSAR – quantitative SAR; ADME-Tox – absorption, distribution, elimination, toxicity

Classical chemistry drug discovery During the classical drug design pro-

cess, medicinal chemists use their personal experience, combined with ratio-

nalizing the knowledge of active compounds and the suspected drug target.

The process involves iterations of data evaluation, synthesis and purifica-

tion, and assessment of biological activity. Only a few compounds can be

processed simultaneously using this approach. This approach is still labor-

intensive, slow, and expensive, requiring costly materials and techniques.

High throughput screening In several large and medium sized Pharma

companies, high throughput screening (HTS) techniques, by robotically scan-

ning the activities of hundreds of thousands of compounds has become a

Introduction 3

major method. The targets for screening can be single molecules, colonies

of bacteria, fungi, or animal cells. In this kind of experiments, the effect

is recorded using fast and, sometimes, non-specific parameters such as color

change, conductivity of electric current, particle count, etc. HTS experiments

are frequently conducted without exact knowledge about the target structure

or about the mechanism of action. While faster than the first approach, HTS

often suffers from ambiguity during the process of results interpretation and

still may require expensive materials and equipment.

Virtual high throughput screening (V-HTS) In order to save time and

reduce costs, virtual HTS is designed to mimic the HTS task in silico and

is expected to indicate which compounds are worth testing in “wet” exper-

iments. Instead of screening real compounds against real targets, virtual

computer libraries of existing and not yet existing chemicals are used. Nat-

urally, this process is much cheaper and, usually, faster than the two former

ones. On the other hand, the V-HTS process relies heavily on the con-

struction and validation of the underlying computational methods and on

the interpretation of the results. Availability of fast, validated and accurate

computational screening methods is, usually, the major bottle neck of the

V-HTS approach. The main tool of V-HTS is molecular “docking”, in which

a ligand or potential drugs is “driven” in order to find a good “parking place”

on the biological target.

Docking programs are computational tools that model the structure and

the nature (affinity) of molecular complexes. These programs aim to predict

geometry of inter- and intra-molecular interactions and to rank the various

Introduction 4

possibilities. The main advantage of computational techniques in general,

and of docking programs particularly, is that they are much cheaper and

faster than the corresponding “wet” techniques. Docking programs are used

as a primary screening tool during the virtual high throughput process. They

also assist biologists, biochemists, and medicinal chemists in designing novel

molecules and in interpreting experiments that assess the activity of already

existing ones.

Two main goals of docking tools are (1) to assist in designing novel chemi-

cal compounds, and (2) to study the nature of interactions between biological

targets and ligands. These may include endogenous molecules such as hor-

mones, or external ones such as drugs or toxic compounds.

Every docking program requires that the three dimensional (3D) struc-

ture of the target molecule be known to some extent. Protein Data Bank

(PDB)[5] is a publicly available repository that contains more than 42,000 3D

structures of biological macromolecules resolved with various degrees of res-

olution. Since 1999, the U.S. National Institute of General Medical Sciences

(NIGMS) has sponsored a large scale project called the Protein Structure

Initiative (PSI)[75]. The main goal of this initiative is to enlarge the number

of solved 3D structures of proteins, which would enable better coverage of

the existing drug targets and the discovery of new ones. Since the estab-

lishment of the PSI, the project has yielded more than 1,800 solved protein

structures (as of June 2006), with current estimated rate of more than 500

solved structures per year[67].

Despite the progress that the field of docking has undergone in the last

few years, several problems still exist. One of the major problems is energy

Introduction 5

calculation. Another major problem is accounting for the many degrees of

freedom of the docking problem. These include flexibility of the molecules,

protonation and tautomeric states etc. Considering all these degrees of free-

dom results in a tremendous combinatorial space that each docking tool has

to search. Due the flexible nature of molecules, it is important not to limit

the scope of the docking solution to a single structure, but instead, to predict

collection (“ensemble”) of low energy multiple conformations that contribute

to the biological activity.

In this work, I present ISE-dock – a protein ligand docking tool that

successfully overcomes the huge combinatorial space problem, while account-

ing for ligand and, to a lesser extent, protein flexibility, and that is capable of

producing arbitrary large docking populations without substantial extension

of CPU time.

1.2 Flexibility in molecular interactions

Since 1894, when Emil Fischer proposed the famous lock and key model[21],

the perception of the nature of binding between biological molecules has un-

dergone several changes. Although evidence that support the lock and key

model exists (see for example:[6, 55, 22]), two models that are considered

to represent the majority of receptor-ligand interactions are induced fit [48]

and equilibrium of multiple pre-existing conformations [63, 52, 76]. Induced

fit theory assumes that the conformation of the target and ligand affect each

other as they approach an encounter. The conformation of the final complex

may not be derived directly from the conformation of the separate molecules.

Introduction 6

“Pre existing conformations” assumes that the final target and ligand con-

formations are already probed by the isolated molecules but, they could be

of much higher energy than the most abundant conformation and therefore,

their accessibility is minute in the absence of the partner. It is not uncom-

mon that the most populated unbound states of a protein are not those that

are most populated in the bound structure[97, 10]. The same notion is true

for ligands: it was found[83] that ligands rarely bind their receptors in the

calculated global minimum conformation. Moreover, in 60% of the cases, the

bound ligand is not found even in its local energy minimum with at least

10% of the examined ligands bind with strain energies over 9 kcal/mol.

Many theoretical and experimental studies support either the induced fit

or pre-existing populations models in different cases of binding[37, 55, 63].

From a thermodynamic point of view, the two models are equivalent, how-

ever, describing biological systems in terms of pre-existing populations and

conformational selection is more useful in the process of drug discovery[97].

Regardless of which of the two models more accurately describes the na-

ture of binding, it is clear that molecular flexibility is involved in complex

formation.

The process of binding may result in either increase or decrease of flexibil-

ity. Decreased flexibility may be attributed to enthalpy-entropy compensa-

tion, when more effective binding interactions are gained by freezing motion.

On the other hand, complex formation may be stabilized by entropic contri-

bution, associated with increased flexibility[116]. It has been suggested[101]

that in 13 different MHC receptor-peptide complexes, the flexibility is asso-

ciated with as much as 50% of free energy of binding.

Introduction 7

Flexibility plays an important role not only in complex formation, but also

in the mechanism of action of various complexes. For example, the conforma-

tional changes of several enzymes are very important for their activity[10, 40,

51]. Solved structures of protein-ligand complexes frequently show complexes

with 70 – 100% of the ligands’ surface area buried. Clearly, this kind of con-

formations could not be achieved without at least a minimal degree of protein

flexibility. Works that analyze bound and apo-proteins show that although

there are complexes, where the protein undergoes almost no change upon

ligand binding[41], proteins that bind small molecules are usually subjected

to conformational changes[97, 61, 60, 88].

1.3 Energy and thermodynamic potentials

The three most common thermodynamic potentials are: internal energy, en-

thalpy and Gibbs free energy.

Internal Energy The internal energy (denoted as U or E) of a thermody-

namic system is the total kinetic energy due to the motion of particles and

the potential energy associated with the vibrational and electronic energy

of atoms, including the energy of chemical bonds. Internal energy does not

include the kinetic energy due to the motion of the system as a whole. It

does not account for potential energy due to the position of the system in an

external gravitational, electric or magnetic field.

Introduction 8

The internal energy is essentially defined by the first law of thermody-

namics, which states that energy is conserved:

∆U = Q+W +W ′ (1.1)

Where ∆U is the change in internal energy of a system during a process, Q

is heat “added” to a system, W is the mechanical work “done on” a system,

and W ′ is energy added by all other processes.

Most biological interactions occur in a fluid environment. In such an en-

vironment the mechanical work done on the system is related to the pressure

(P ) and the volume (V ):

δW = −PdV (1.2)

The heat energy is a function of temperature (T ) and entropy (S):

δQ = TdS (1.3)

Thus, the internal energy of a system of biological interest may be expressed:

dU = TdS − PdV (1.4)

Enthalpy Enthalpy or heat content (denoted as H or ∆H) describes the

amount of “useful work” that may be obtained from a closed thermodynamic

system, under constant pressure. In the absence of an external field, enthalpy

Introduction 9

is defined as:

H = U + PV (1.5)

Where U , P and V are internal energy, pressure and volume respectively. En-

thalpy is sometimes referred as heat capacity because under constant pressure

and volume:

dH = δQ ≤ TdS (1.6)

Thus, the difference in enthalpy is the maximum amount of thermal energy

that may be obtained from a system.

The total enthalpy of a system cannot be measured directly. ∆H – the

change in enthalpy is measured instead. In exothermic reaction at constant

pressure, the change in enthalpy equals to the energy released from the sys-

tem. Similarly, in endothermic reactions, ∆H equals to the energy absorbed

by the system. If the system is kept under constant pressure and constant

volume, the change in enthalpy equals the heat amount that is released from

or absorbed by a system.

Introduction 10

Gibbs Free Energy The Gibbs free energy (which frequently is referred to

simply as free energy), is defined as:

G = U + PV − TS+

= H − TS+

=∑i=1

µiNi

(1.7)

Where: U is the internal energy; P is pressure; V is volume; T is the tem-

perature; S is the entropy; µi is the chemical potential of the i-th chemical

component; Ni is the number of particles (or number of moles) composing;

the i-th chemical component. It can be shown that

∆G = ∆H − T∆S (1.8)

Where ∆S is the change in the internal entropy of the system. The value

of ∆G from equation (1.8) is used to determine whether a chemical reaction

is favorable or not: reactions with ∆G < 0 will occur spontaneously, while

those with ∆G ≥ 0 will not.

Binding Affinity Non-covalent receptor-ligand interactions may be written

in the following general form[72]:

RLkd⇀↽ka

L+R

Where: R, L and RL are the receptor, ligand and receptor-ligand complex,

respectively; kd and ka are kinetic constants of dissociation and associa-

Introduction 11

tion, respectively. This reaction describes dissociation of a receptor-ligand

complex. The thermodynamic equilibrium constant of this reaction in ideal

conditions is defined as:

Kd =[R][L]

[RL](1.9)

Where [X] denotes the molar concentration of the component X. The equi-

librium constant can be related to the change in the Gibbs free energy (eq.

(1.8)) of the above dissociation reaction:

∆G = ∆G0 = RTlnKd (1.10)

Here, R is the universal gas constant, and T is the absolute temperature.

∆G0 is the free energy change at equilibrium under standard conditions (all

the chemical components are at 1M concentration, T=273.15K, pressure =

1atm).

In attempts to calculate the change in free energy upon binding (free

energy of binding), it is customary to separate the overall energy into distinct

components. These components usually may include entropy loss due to

association, entropy gain of water due to binding of the ligand (hydrophobic

effect), entropy loss in the receptor and the ligand due to constraints of

internal degrees of freedom, interaction between the ligand and the receptor,

and changes in the conformational (internal) energy of the molecules upon

binding.

Introduction 12

The basic assumption of most of the works on experimental or computa-

tional determination of binding energy is that different contributions to the

binding energy are independent and additive. Thus binding energy may be

written as a sum of its components[72]:

∆Gbind = ∆Gsolvent+

+ ∆Greceptorconf + ∆Gligand

conf +

+ ∆Gint+

+ ∆Gmotion

(1.11)

. One should note that, based on the principles of independence and addi-

tivity of energy components, many other variants of this equation may be

written. Furthermore, the same assumption of additivity and independence

allows the creation of statistical functions that approximate the binding free

energy without direct connection to the underlying physical and thermody-

namic processes.

1.4 Common energy components

Based on the equation (1.11), energy calculations are divided into distinct

components. In this section I will describe the most commonly used terms

of energy functions. This list is by no means complete, but rather serves as

a brief introduction.

Introduction 13

Physically based potentials are mainly divided between bonding and non-

bonding expressions. Supplementary expressions for solvation or entropy loss

due to restricted rotations are sometimes added.

Non-bonding expressions

It is common to model pairwise interactions between atoms that are divided

by at least 4 covalent bonds in terms of electrostatic (Coulomb) and Van der

Waals interactions.

Coulomb potential We use Coulomb potential to estimate the enthalpy

contribution of any two charged particles to the overall potential energy:

Eel = εQ1 ×Q2

r(1.12)

Where Q1 and Q2 are the partial charges of the two particles, r is the distance

separating between them, and ε is the dielectric constant of the separating

medium. In vacuum, ε equals 1. Figure 1.2 shows a typical shape of electro-

static potential of charged particles.

Hydrogen bonds The hydrogen bonds (H-bonds) effect is highly related to

electrostatic interactions. This effect is caused by interaction of electroneg-

ative atoms with hydrogen connected to other electronegative atoms. The

nature of H-bonds allows charge transfer along the bond. The strongest H-

bond effect is achieved when the three interacting atoms (hydrogen donor,

hydrogen atom and hydrogen acceptor) and the mediating lone electron pair

lie on a single line. To account for this directionality, many force fields con-

Introduction 14

Figure 1.2: Typical shapes of electrostatic interactions energy. The energy of two identical(full line) and opposite (dashed line) charges in vacuum are shown

tain explicit terms for the angle of the H-bond. For example, following is

the H-bond component of MM3 force field[58] that demonstrates an explicit

term for the angle θ between the interacting atoms:

EHB = ε#[1.84× 105e−120/P − 2.25

P 6

D

(l

l0

)cosθ

](1.13)

Where l and l0 denote the actual and the reference H-bond lengths, respec-

tively, ε# is the depth of the energy potential well, P is the ratio of the sum

of the van der Waals radii of the atoms divided by the sum of the effective

interatomic distances between them and D is the dielectric constant. The

dependence of energetics on the angular relations of H-bonds plays an im-

portant role in the specificity of molecular interactions. Figure 1.3 shows

examples of typical inter- and intra-molecular hydrogen bonds.

Introduction 15

Figure 1.3: Examples of inter- (left) and intra- (right) molecular H-bonds

The majority of existing scoring functions does not include explicit terms

for hydrogen bonds[54], but rather rely on Van der Waals or electrostatic

interactions.

Van der Waals interactions Van der Waals (VdW) forces account for both

attraction and repulsion of non bonded atoms. Usually, Van der Waals en-

thalpy contribution of atoms is estimated using the Lennard-Jones (LJ) po-

tential:

EV dW =N−1∑i

N∑j=i+1

[4εij

[(σijr

)6

−(σijr

)12]]

(1.14)

Where εij is the depth of the potential well between the atoms i and j, r is

the distance between two atoms, σij is the distance at which the inter-particle

force is zero and N is the number of atoms.

Equation (1.14) is sometimes referred to as the 6-12 LJ potential, as op-

posed to 4-10 potential, a more “smoothed” estimation with lower repulsion

effect. Figure 1.4 presents the shape of the Van der Waals potential of two

identical atoms. Although the equation (1.14) is the most encountered

one, there are other ways to estimate Van der Waals energy (for example

Hill’s equation[38]).

Introduction 16

Figure 1.4: Van der Waals interaction energy of argon dimer. Taken from the Wikipedia[113] under the GNU Free Documentation License

Bonding expressions

The three most common terms that describe the contribution of bonding

interactions to the overall energy are bond stretching, angle bending and

bond rotation (torsion).

Bond stretching One of the equations that describe the potential energy

for a covalent bond is:

Estretch = De

(1− e−α(r−r0)

)2(1.15)

In this equation (which is often referred to as a Morse equation), De is the

depth of the energy minimum, r0 is the reference bond length, α = ω√µ/2De,

where µ is the reduced mass and ω is the bond vibration frequency.

To simplify the energy calculations, a harmonic potential is often applied

to bond stretching (Hooke’s law). Although less accurate, harmonic potential

Introduction 17

Figure 1.5: Comparison of Morse (dashed line) and Hooke’s harmonic (full line) poten-tials of bond stretching energy around the minimum. To construct this graph, all theparameters in equations (1.15) and (1.16) were assigned the value of 1

is faster to calculate and is accurate enough in the bottom of the potential

well.

Estretch =1

2k(r − r0)2 (1.16)

Figure 1.5 presents the shapes of Morse and Hooke’s potentials around the

minimum.

Angle bending The angle bending contribution to the potential energy may

be estimated using the following equation:

Ebending =1

2(θ − θ0)

2[1− k1(θ − θ0)− k2(θ − θ0)

2 − k3(θ − θ0)3 . . .

](1.17)

Where θ is the angle, θ0 is the reference angle and k1, k2, . . . are force con-

stants specific to the bonds that form the angle. A good approximation of

Introduction 18

this general form equation is Hooke’s harmonic potential:

Ebending =k

2(θ0 − θ)2 (1.18)

Bond torsion One of the possible equations that describe the contribution

of torsions around chemical bonds is

Etorsion =N∑n=0

Cncosn(ω) (1.19)

Where C is some force constant, ω is the torsion angle, and N – the num-

ber of rotating bonds. Although many force field terms of bond torsion

contain the above equation, there is sometimes a need in more accurate esti-

mations. On the other hand, many force fields do not contain explicit terms

for torsions[54]. In these cases non-bonding terms for Van der Waals and

electrostatic interactions are used to achieve the desired potential profile.

Entropy estimation and solvation terms

A solute molecule that leaves the solution in favor of a complex with another

molecule produces two main effects on the system’s entropy. First, it changes

the micro-structure of the water bulk that surrounds the two solute molecules.

This change results in more water molecules that are capable of creating

hydrogen bonds between themselves. The second effect is the change in the

internal degrees of freedom.

Entropy change estimation is one of the most challenging problems in

computational research of biological systems. The reason for the complexity

Introduction 19

of this task may be demonstrated by the Gibbs entropy formula:

S = −kBN∑i=1

pi log pi (1.20)

Where N is the number of possible discrete states of a system, and pi is the

probability of a certain state. Equation (1.20) results in a huge complexity.

The large number of possible states of a system leads towards very small

values of pi, which in turn requires extensive sampling and may lead to

large accumulation of errors. Several additional ways to exactly evaluate the

entropy exist, but they do not change the complex nature of the calculations.

For a review on entropy calculations in biological systems see ref.[3].

1.5 Force fields and scoring functions

During the process of docking, many conformations are searched. The pro-

gram needs to choose between the different conformations, thus each confor-

mation is given a numerical value, which in most of the cases, is supposed

to represent its relative stability. Computational functions that estimate

the energy of the system can be based on the principles of classical physics

(force field based functions). Another class of functions combines statistical

physics equations with many approximations that are based on known macro-

structures. This class of methods is often called approximate or knowledge

based functions[82]. In addition purely statistical scoring functions exist.

Such functions are based on statistical analysis of various patterns, such as

distribution of contacts between different types of atoms[69]. Another ap-

Introduction 20

proach of the estimation of the “fitness” of docking structures is to use shape

complementarity.

1.5.1 Force field based energy functions

Force field based scoring functions are based on the equations that were

mentioned in Section 1.4. Two major such energy functions are AMBER

[14] and CHARMM [65]. These functions differ in atom typing, parameters

for the various terms and in the basic equations that build them up. The

main equation of the AMBER force field reveals the complexity that is

common to all the energy functions in this class:

Etotal =∑bonds

Kr(r − r0)2+

+∑angles

Kθ(θ − θ0)2+

+∑

dihedrals

Vn2

[1 + cos(nΦγ − phase)] +

+∑i<j

[Aijr12ij

− Bij

r6ij

− qjqjεrij

]+

+∑i<j

[Cijr12ij

− Dij

r10ij

](1.21)

In this equation, the last term is the estimation of hydrogen bonds energy.

The rest of the terms have already been discussed. A review of CHARMM,

AMBER and other common force fields has been recently published[64].

Due to the complexity of force field based scoring functions, they pose

relatively heavy computational load on the computer, which results in rela-

tively low calculation speed. Thus, in the case of the docking problem, the

Introduction 21

full forms of these functions are mostly suitable for structure preparation

before docking or during the post-docking processing.

1.5.2 Approximate energy functions

As stated before, one of the major drawbacks of force field based scoring func-

tions is their extensive computational cost due to the large number of energy

terms and their complexity. Moreover, several terms, such as solvation effect,

the contribution of the flexibility to the overall system energy and others re-

quire sampling of multiple conformations in the solution space. To overcome

this obstacle, several knowledge based potentials have been proposed. In this

class of functions, the number of energy terms and the number of supported

atom or bond types are reduced. The general form of the remaining terms

resembles that of the force field based functions. The parametrization is done

using statistical analysis of known structures of macromolecules. The struc-

tures are chosen according to the problem and may include folded proteins,

proteins bound to other proteins, small molecules, DNA, etc. It is possible to

perform calibration of the parameters using focused sets of structures (target

tailored functions). Studies exist that show that such a strategy improves the

accuracy of scoring functions[11, 92]. Because the parametrization of knowl-

edge based scoring functions is done using known macro-structures, they

implicitly account for entropic effects such as solvation and changes in inter-

nal degrees of freedom. Estimation of entropic and solvation contributions to

the overall binding affinity is usually done using one or more of the following

terms[109, 49, 70]: hydrophobic match, solvent accessible surface (divided to

Introduction 22

atom types according to the extent of hydrophobicity/hydrophilicity), and

the number of internal degrees of freedom (usually, the count of rotatable

bonds). This support of entropic terms is gained without the costly compu-

tations.

On the other hand, the calibration process does not account for non-native

structures. This might lead to meaningless results when one attempts to

quantitatively evaluate poses that reside far away from an energy minimum.

Most existing docking programs (for example AutoDock [33, 71, 70],

FlexX [49], FlexE [12], Glide [23], GOLD [42] and others) use approxi-

mate scoring functions. It is possible to compensate for the relative lack of

accuracy of this class of functions by further re-scoring docking candidates

with or without an additional simulation step (such as minimization, molecu-

lar dynamics). This multistage approach was successfully adopted by several

research groups[8, 108]. For example, in one work[108], molecular dynamics

combined with MM-PBSA (molecular mechanics Poisson-Boltzman/surface

area) were used to re-rank the solutions suggested by DOCK 4.0. In that

work, a conformation within 1.1A RMSD from an HIV-1 RT inhibitor was

predicted before the 3D structure was published.

1.5.3 Statistical potentials

Another approach to simplify energy calculations even more is to use purely

statistical potentials. One of such potentials was proposed by Miyazawa and

Jernigan[69]. In that work, intra-residue contacts in folded proteins were

examined. It was found that several residues can be found near the others

Introduction 23

with different propensities. These findings were used to compare proposed

folded structures in terms of probability. A similar statistical approach was

also used to analyze the distribution of various intermolecular contacts in

protein-protein[29] and protein-ligand[79] complexes. Similar to semi-empiric

scoring functions, statistical potentials account implicitly for solvation and

other entropic effects and, on the other hand, of a limited validity when

analyzing non-native structures.

Generally, statistical potentials provide high calculation speed which, un-

fortunately, comes at the expense of accuracy. Preliminary results during the

early stages of my research with the probability tables provided in the work

of Glasser et al[29] have generated unacceptable results. (These results are

neither shown, nor discussed in this work).

1.5.4 Geometric and chemical complementarity functions

When two molecules bind to each other, a certain degree of shape comple-

mentarity has to exist[74, 43]. This notion serves as a rationale behind shape

complementarity or geometry complementarity scoring functions. Geometry

complementarity was the exclusive scoring scheme in many early docking

programs[74, 20, 25, 7]. The current scoring functions use additional cri-

teria in order to facilitate the accuracy. For example, in work by Bohacek

and McMartin[68], the accessible protein surface was divided into hydropho-

bic, hydrogen-bond donating, or H-bond accepting zones. Other criteria for

accessing chemical complementarity are based on partial charges, hydropho-

bicity/hydrophilicity, atom types, etc.

Introduction 24

1.6 Energy funnels

As stated earlier, energy is a complex function that depends on an enormous

number of variables. The multidimensional hypersurface that describes the

energy as a function of all the relevant variables is known as the “energy land-

scape”. According to eq. (1.8), at equilibrium, any thermodynamic system is

supposed to reside in a minimum (local or global) of such a landscape; other-

wise, the system would spontaneously move until it reaches one. One should

note that most probably, multidimensional energy hyperspace contains many

local minima, as opposed to a single global one.

The existence of funnels in the energy landscape has been proposed

for protein folding[16, 56, 105, 100, 52] and has been further expanded for

protein-protein[99, 115] and protein-ligand recognition[109, 96]. It has been

suggested that the shape of the energy landscape is in correlation with the

nature of protein folding or binding between the molecules.

The funnel shape energy landscape theory suggests that structures with

single-minimum energy landscape may represent an extremely stable folded

structure or the ”lock and key“ binding mechanism. Several minima on

the bottom of the energy landscape with small barriers between them may

be a result of induced fit or non specific interactions. Finally, a rugged

landscape with multiple minima separated by relatively high energy barriers

may indicate domain swapping or the existence of multiple binding modes .

Introduction 25

1.7 Multiple binding modes

As was presented previously, biomolecules are flexible and mobile entities.

The molecular thermal motion results in a reality that is dramatically dif-

ferent from the static picture that is seen in structures solved with X-rays

or even in the multiple structures obtained by NMR. Although eq. (1.8)

implies that any system at thermodynamic equilibrium resides in a single

energy minimum, the real-life situation is quite different. The constant ther-

mal motion and ever-changing environmental conditions prevent thermody-

namic equilibrium, and energy barriers may rule out the transfer between

one conformation to another, potentially more stable one.

At a non-zero temperature, thermodynamic systems are able to occupy

non-minimal regions of the landscape according to the distribution.

Ni

N=

e(−Ei/kT )∑j

[e(−Ej/kT )

] (1.22)

In this equation (also known as the Boltzmann or Maxwell-Boltzmann dis-

tribution), Ni is the number of molecules at equilibrium temperature T , in a

state i that has energy Ei; N is the total number of molecules in the system

and k is the Boltzmann constant which, for gaseous and liquid systems is

identical to universal gas constant (R) from eq. (1.8). If the energy barrier

between two minima is low enough, and the temperature is high enough,

then the molecules in a system can alternate between multiple states. If the

differences between binding energies (i.e. ∆(∆Gbind)) of two or more con-

formations is such that transformation of the system between them doesn’t

effectively compensate for the separating energy barriers, these multiple con-

Introduction 26

formations may exist in the system simultaneously, presenting a phenomenon

known as multiple or alternative binding modes.

A growing body of data supports the existence of multiple binding modes

of ligands to receptors. These may manifest in the form of a ligand that binds

the same (or similar) protein in different distinct modes, or alternatively,

ligand molecules that share structural similarity may be observed in different

binding modes when bound to the same protein[18, 9, 44, 91, 57]. It is clear

that individual conformations of multiple binding modes, if they exist, may

have a unique contribution to the binding energies or specificity. The program

presented in this work, ISE-dock is capable to produce arbitrary large near-

optimal populations of docking solutions, resulting in an efficient sampling

of the energy hyperspace and increasing the chances of detecting alternative

binding modes.

1.8 Docking techniques

1.8.1 Flexibility in docking programs

The structural and energy considerations that were presented above imply

that accounting for flexibility in docking programs is a necessary task. The-

oretically, accounting for molecule flexibility in a system that contains N

atoms will result in 3N degrees of freedom (3 degrees of freedom for trans-

lating each atom). This number of degrees of freedom results in a colossal

rise in the computational complexity of docking calculations and cannot be

treated directly. In order to reduce the size of the solution space, several

Introduction 27

approaches are taken by, either alone or (more frequently) in various combi-

nations. These approaches include explicit flexibility of only small parts of

the system; “soft” potentials and low resolution docking, and using multiple

conformations.

Selective flexibility Among all the internal degrees of freedom that the-

oretically exist in the system, only dihedral torsions are usually taken into

consideration. This is due to the substantially lower energy barriers that

are needed for this type of movement, compared to bond stretching and an-

gle bending[54]. In addition, internal flexibility is usually limited to certain

portions of the interacting molecules. Treating ligand flexibility alone, and

keeping the protein rigid, reduces dramatically the combinatorial complexity

of a protein-ligand docking program. This approach is very popular. In fact,

most of the modern protein-ligand docking programs are capable of deal-

ing with full ligand flexibility but not with the conformational changes of a

protein[95]. The rigidity of protein is a reasonable approximation in many

cases, and it has lead to several successes. Nevertheless, accounting for re-

ceptor flexibility is a very important step toward improving the process of

docking[4, 46, 49, 12]. Najmanovich et al. have shown that in many cases

only a few side chains in the active side of a receptor change their confor-

mations during ligand binding[73]. In other cases, hinge-like movements of

large portions of the protein occur[89, 90], while retaining relative rigidity

of the remaining parts of the system. These findings allow the user to par-

tially “unfreeze” the protein, while keeping a feasible combinatorial size of

the problem. Version 4.0.1 of the program AutoDock takes this approach,

Introduction 28

by allowing the user to specify the flexible parts of the receptor (side chains

only). The ISE-dock program that is presented in this work (and was devel-

oped before the publication of AutoDock 4.0.1) takes a similar approach.

Hinge-based docking studies have also been reported[89, 90, 78].

Soft potentials Allowing partial inter-penetration of molecules by lowering

the repulsion penalties of VdW interactions is a way to implicitly account

for molecule flexibility in docking simulations. For example, in a work by

Ferrari et al.[19], a modified, softer, Lennard-Jones potential was used in

order to screen large libraries of molecules against T4 lysozyme, a protein

that undergoes small conformational changes when binding different ligands.

Yet another way to allow intermolecular penetration to handle implicitly

protein flexibility is to use protein’s Cα only in the first stages of the docking

(low resolution docking)[103, 102].

Multiple conformations Flexibility of the interacting molecules may be

simulated by using multiple structures. The ways to obtain these struc-

tures include utilization of multiple X-rays structures, ensembles of struc-

tures obtained by NMR techniques, and the results of molecular dynam-

ics or other simulations. Three major ways exist to use multiple molecular

structures in protein-ligand docking studies: separate docking of a ligand

into each individual protein structure[19, 94], identifying the conformational

changes and considering their combinations[12], and using energy functions

that consider an energy-weighted or geometry-weighted average of the mul-

tiple structures[46, 70].

Introduction 29

One of the important advantages of using multiple conformations is that,

unlike the rest of the previously mentioned methods, it easily allows the

movement of side chains and the backbone to be considered. In addition,

point mutations or even completely different proteins may be considered in

a single docking study.

1.8.2 Search algorithms

Docking algorithms can be roughly divided into two categories: those explor-

ing the energy landscape of the system and those (re)constructing the ligands

in the binding pockets of the macromolecule. The first class is represented by

various implementations and combinations of Simulated Annealing (SA), Ge-

netic Algorithm (GA), molecular dynamics, geometric complementary match

etc. Examples of the implementations of this approach are: Dock3.5 [53],

AutoDock [33, 71, 70], and GOLD [42].

The second class of algorithms (represented by FlexX [49]) involves plac-

ing of one or more base fragments of the ligand into the binding pockets of

the protein and reconstruction of the remained molecule according to prede-

fined criteria. This approach is much faster and gives good results in cases

where the binding site has a deeply buried pocket with the ability to make

hydrogen bonds. However, if the binding pocket is shallow, or the main

contribution to the binding process is done by hydrophobic interactions, the

placement of the base fragments and further reconstruction of the ligand are

doubtful[107].

Introduction 30

Genetic algorithms Genetic Algorithms (GA’s) are a general-purpose fam-

ily of optimization techniques that mimic the process of evolution[45]. During

the optimization process, an instance of the problem is encoded using linear

representation (chromosome). In the first step multiple random configura-

tions (individuals) are generated, and a fitness function is calculated for each

of them. During the subsequent steps, several operators may be applied

to some of the individuals, such as point mutation in the chromosome or

cross-over exchange of the information encoded in chromosomes between two

individuals. The fitness function is used to decide which individual is allowed

to survive to the next iteration and to produce offsprings. GOLD (Genetic

Optimization for Ligand Docking)[42] was the first docking program to use

GA. GOLD performs automated docking with full acyclic ligand flexibility,

partial cyclic ligand flexibility and partial protein flexibility in the neighbor-

hood of the binding site. The location of the site must be provided by the

user (with a possibility of using other software). Another GA-based docking

program, AutoDock [70] uses local search techniques to modify the en-

coding chromosome, and to propagate the optimized “genetic information”

to the next generations. For detailed a description of AutoDock and its

algorithm, see Section 2.2 (page 37).

Monte Carlo simulated annealing Monte Carlo Simulated Annealing (SA)

techniques involve random alteration of the system that undergoes optimiza-

tion. If the change creates a conformation with lower (better) values of the

scoring function, then the new structure is accepted for the next steps. If,

on the other hand, the energy increases, the new structure is accepted with

Introduction 31

a temperature-dependent probability P = e(−Et−1−Et)/(kBT ). Where Et−1

and Et are the energy values before and after the random change, kB is

the Boltzman constant, and T it the temperature. During the SA process,

the temperature T is reduced according to the predefined scheme (cooling

schedule), resulting in less permissive acceptance criteria. The MCDOCK

[1] program uses SA to solve the docking problem. The conformations are

generated using geometry-based docking and then energy-based docking is

performed.

FlexX FlexX is an incremental docking program[85]. It binds flexible lig-

ands into the binding pockets of a rigid receptor. FlexX involves three

steps: selection of base fragments in the ligand molecule, placement of these

fragments in the active site of the receptor, and incremental reconstruction

of the whole ligand. The reconstruction is made fragment by fragment so

that the energy of the complex is locally minimal. For a better sampling,

the algorithm is allowed to diverge to various energetically favorable regions.

This algorithm saves only a limited number of the best scoring partial so-

lutions to continue to the next round of ligand reconstruction. Since the

greedy algorithm selects only the best partial solutions to continue to the

next round of ligand reconstruction, flexible docking is likely to be more

demanding on the quality of the scoring function used to evaluate (partial)

docking solutions[47]. FlexX-Ensemble (formerly known as FlexE)[12]

introduces a new feature to the FlexX algorithm. FlexX-Ensemble takes

into account flexibility of the receptor by using a predefined ensemble of re-

ceptor conformations. The ensemble may be derived from multiple X-ray

Introduction 32

structures or homology modeling or generated by molecular dynamics simu-

lations. The protein is dissected into a constant (rigid) and several flexible

parts. The flexible parts may be combined to create conformations that are

not observed in the original ensemble of the structures.

Internal Coordinate Mechanics Internal Coordinate Mechanics (ICM)

performs global optimization of a flexible ligand in the receptor field[2]. This

algorithm is based on a large number of random moves with gradient local

minimization. A history mechanism is used to escape local minima.

Computer vision techniques Image recognition techniques were described

in a review by Nussinov and Wolfson[77]. These methods were implemented

on rigid and flexible docking. While the shape complementarity of molecules

that were crystallized together (bound docking) is expected to be good, the

docking of unbound molecules is less trivial.

1.8.3 Evaluating docking programs

Comparing docking programs is not a trivial task. Many criteria to perform

this task have been proposed and used in the literature[34, 13, 110, 44, 25].

The most common criterion to assess the “correctness” of a docked com-

plex, compared to the experimentally determined structure is to compare

the Cartesian coordinates of the solution and the reference structure. This

comparison is reported as the Root Mean Squared Deviation (RMSD) of

Introduction 33

atoms:

RMSD =

√∑Ni,j [(∆xij)2 + (∆yi,j)2 + (∆zij)2]

N(1.23)

Lack of specificity, inability to differentiate between more and less important

regions in a complex, and the need for a reference structure are several pitfalls

of this measure[13]. Nevertheless, RMSD is the measure of choice the vast

majority of docking techniques. It is widely accepted to treat solutions with

RMSD values below 2.0A as successful ones. Other methods of evaluating

include modified deviation functions[1] and accounting for correct positioning

of intermolecular contacts[50]. An additional approach is to screen a large

library of compounds with only a few that is known to bind efficiently to

a molecular target. In this type of test, the enrichment factor of correctly

recognized binding molecules is checked[84, 24, 26].

Theoretically, only the lowest (best) scored docking pose of a ligand needs

to be examined. But there are several factors that require treating multiple

docking solutions. Among them are mobility of the molecules, inaccuracy of

scoring functions, and the fact that molecules are not always found in their

global minimum. Due to these reasons, it is customary to check the best

available deviation from the experimentally known structure among several

solutions that were provided by a docking program. The comparison of

multiple docking solutions is thought to downscale the dependency of the

results on a scoring function, and to better reflect the ability of the docking

algorithm[34].

Introduction 34

1.9 Open problems and issues

Protein-ligand docking is a valuable tool in the processes of drug discovery

and lead optimization and during the basic study of intermolecular inter-

actions. Protein-ligand docking was successfully used in a wide range of

problems, but despite the plethora of existing solutions, the docking problem

is far from being solved. Many of the existing programs tend to converge

around a certain local minimum or not to converge at all. Sometimes, bio-

logically irrelevant solutions are produced.

Most of the existing programs are capable of proposing multiple docking

poses, but the time needed by many of them to do so increases with the size

of the output population.

Protein flexibility is another and, perhaps, the most difficult and urgent

challenge in the protein-ligand docking field. Other degrees of freedom that

are very important, but are hardly addressed by the existing docking pro-

grams are the position of mediating water molecules[85], co-factor position-

ing, electron transfer and protonation states of the interacting molecules.

Any docking program is tightly connected to at least one scoring function.

Although the development of a scoring function is beyond the scope of this

work, one should remember that the choice of such a function has a direct

impact on the docking program performance.

Chapter 2

Methods

2.1 Energy function

An ideal scoring function in a protein-ligand docking program would combine

speed and the ability to distinguish quantitatively between native and non-

native poses. Developing a scoring function is beyond the scope of this work.

ISE-dock uses AutoDock’s grid-based scoring function[70]. Auto-

Grid, a part of AutoDock suite, pre-calculates grids of Van der Waals,

electrostatics, and solvation interactions of a biomolecular target, based on

atom types. Following are the terms that construct the scoring function used

35

Methods 36

in AutoDock and, subsequently, in ISE-dock:

∆G = ∆GV dW

∑i,j

(Aijr12ij

− Bij

r6ij

)+

+ ∆Ghbond

∑i,j

E(t)

(Cijr12ij

− Dij

r10ij

+ Ehbond

)+

+ ∆Gelect

∑j,j

qiqjε(rij)rij

+

+ ∆Gsol

∑iC ,j

SiVje(−r2ij/2σ2)+

+ ∆GtorNtor

(2.1)

The five ∆G terms in this equation are empirically determined using lin-

ear regression analysis, correlating a set of 30 protein-ligand complexes with

known binding constants and solved 3D structures. The first and the third

terms of the above equation are standard expressions for VdW and elec-

trostatic interactions, respectively. In the second (H-bond) term, E(t) is a

directional weight based on H-bond’s angle, t and Ehbond is the estimated

average energy of hydrogen bonding between water molecules and a polar

atom. The unfavorable entropy effect of ligand binding (the fifths term) is

a function of the number of sp3 bonds – Ntor. The solvation term of eq.

(2.1) considers fragmental volumes of only carbon atoms in the ligand (i)

and all atom types in the receptor (j). Parametrization of the carbon atoms

distinguishes between aliphatic and aromatic atom types. The constant co-

efficients in equation (2.1) (Aij, Bij, Cij and Dij) are specific for each pair of

atom types.

Methods 37

During the docking process, the program evaluates any position of the

ligand by interpolating over those grids for the protein-ligand interaction

of each atom of the ligand according to its current position and adding the

internal conformational energy of the ligand. By default, the docking box has

the dimensions 22.5A ×22.5A ×22.5A with a resolution of 0.375A between

grid points. The version of AutoDock used in this work (3.0.5) supports

eight atom types: C (aliphatic carbon), A (aromatic carbon), N (nitrogen),

O (oxygen), S (sulfur), H (hydrogen), X and M (“spare” types for additional

atoms such as metal, halogen, phosphorus etc). It is customary[106, 98, 17]

to substitute the original AutoDock parameters for Zinc. We used the

following parameters, which lead to more accurate energy calculations[39]:

(radius: 0.87A; well depth: 0.35 kcal/mol; and charge: +0.95e).

The use of grid-based scoring functions has two important properties:

first, the simulation speed is facilitated significantly and second, it implies

that there can be no variation in the protein structure during the docking

process.

2.2 AutoDock docking program

The AutoDock program[32, 33, 70] served as a source code base and a

reference point for ISE-dock performance. The source code of AutoDock

(version 3.0.5) was obtained from the authors. AutoDock performs flexible

ligand – rigid protein docking using one of the following algorithms: Simu-

lated Annealing (SA), Genetic Algorithm (GA) and Lamarckian Genetic Al-

gorithm (LGA). LGA is a hybrid optimization algorithm that deviates from

Methods 38

GA and has been shown[33] to give the best quality of performance out of

the three available ones. AutoDock was the most cited docking program

in the scientific literature during the years 2001 – 2005[95].

2.2.1 Lamarckian Genetic Algorithm

Genetic Algorithm (GA) is a general type of optimization algorithms, and it

exists in several variants. The version of GA that is used in AutoDock is

described as Algorithm 2.1.

Algorithm 2.1 Genetic Algorithm used in AutoDock [70]

Require: string representation of a problem (chromosome)1: create random population P2: repeat3: mate random pairs of individuals (crossover)4: perform random mutations5: for all i ∈ P do6: evaluate i7: end for8: sort P according to the scoring function9: select best individuals to survive to the next iteration

10: until stopping criteria are met11: return best individual

In order to be optimized by GA, an instance of a problem is encoded into

a flat string (chromosome), which may be subjected to several GA operators

and then scored. The GA operators include crossover of two chromosomes

and point mutations. These operators (Algorithm 2.2, lines 3 and 4 in Al-

gorithm 2.1) are applied randomly with user-defined probability. The opti-

mization terminates if no improvement in scoring function is achieved over a

number of generations or after a specified number of generations.

Methods 39

The LGA differs from the canonical GA by an additional step of local

optimization. The addition of local optimization provides that an acquired

adaptation of an individual promotes changes in its chromosomes that in

turn pass to the next generations.

The LGA is described using pseudocode as Algorithm 2.2.

Algorithm 2.2 Lamarckian Genetic Algorithm used in AutoDock [70]

Require: string representation of a problem (chromosome)1: create random population P2: repeat3: mate random pairs of individuals (crossover)4: perform random mutations5:6: select sub population S ∈ P to undergo local optimization7: for all individual i ∈ S do8: perform local optimization of i9: modify i’s chromosome to reflect the optimized state

10: end for11:12: for all i ∈ P do13: evaluate i14: end for15: sort P according to the scoring function16: select best individuals to survive to the next iteration17: until stopping criteria are met18: return best individual

Here, an instance is encoded in a chromosome (genotype), which in turn is

translated to a phenotype. At the initial stage, the phenotype is identical to

the genotype. As in the basic GA, several operators are applied randomly on

the population with the predefined probabilities (Algorithm 2.2, lines 3 and

4). At the next stage, several individuals are randomly selected (Algorithm

2.2, line 6). These individuals undergo local optimization of the phenotype.

Methods 40

The optimized phenotype is translated back to a new genotype, which then

propagates to the next generations.

AutoDock uses Solis Wets [93] local optimization. Solis Wets (SW) local

optimization algorithm is a greedy local search heuristic method. During

the SW search, random moves along all the axes of the solution space are

performed until an improvement is found. The variance of the random moves

is influenced by the frequency with which improving moves are found. The

pseudocode of SW is provided as Algorithm 2.3.

Algorithm 2.3 Pseudocode of Solis Wets local search algorithm

initialize variancenumberOfSuccesses = numberOfFalures = 0repeat

perform random move using variancecalculate currentEnergyif currentEnergy < previousEnergy thennumberOfFalures = 0increase numberOfSuccessesif numberOfSuccesses > threshold thennumberOfSuccesses = 0expand variance

end ifelsenumberOfSuccesses = 0increase numberOfFaluresif numberOfFalures > threshold thennumberOfFalures = 0contract variance

end ifend if

until stopping criteria are met

There are no robust stopping rules for SW algorithm, since the conver-

gence isn’t guaranteed. In AutoDock, the SW search step stops after the

Methods 41

variance of the random moves drops below a threshold or when a specified

number of optimization steps is reached. The original SW algorithm uses

random steps with equal variances for each degree of freedom. In an attempt

to improve the optimization results, AutoDock enables separate lists of

random moves variances for each degree of freedom to be kept. This variant

of the SW method is referred as Pseudo Solis Wets (PSW) local search.

2.2.2 Problem representation

In AutoDock, the ligand’s pose relative to the protein and it’s internal

conformation are encoded using a vector of real values. The first three values

in the vector define the translation of the ligand. The rotation is encoded

using quaternion notation. This notation represents rigid body rotation using

a unit vector (represented by three numbers) and an angle of rotation around

this vector. Thus, the three degrees of freedom of rigid body rotation are

represented by four degrees of freedom in AutoDock. The rest of the

values in the chromosome vector represent the dihedral angles of the ligand’s

rotatable bonds.

Since AutoDock uses a grid-based energy function, the receptor is rep-

resented by a set of pre-calculated grid maps.

During the docking process, the encoding vector (genotype) is gener-

ated and translated to molecule coordinates (phenotype). A local search

is randomly applied with user-defined probability. The applied local search

changes the coordinates. This change is translated to respective changes of

the genotype.

Methods 42

All the algorithms used by AutoDock result in a single optimized pop-

ulation. Frequently, more than one solution is desired (as explained in Chap-

ter 1). In this case, the program may be configured to perform the docking

procedure several times. The total time needed to produce multiple docking

solutions is proportional to the number of desired structures.

2.3 ISE-dock program

The ISE-dock program was implemented as a set of added and modified

classes in AutoDock source code, and uses its energy function. As in the

original AutoDock application, molecular flexibility is treated by allow-

ing changes in dihedral angles. Our representation of rigid body rotation

differs slightly from the one that is implemented in the original program:

while AutoDock encodes rotations using quaternions, ISE-dock uses sub-

sequent rotations around the X, Y and Z axes, as this option provided better

results in preliminary experiments with ISE. ISE simulations produce large

populations of docking poses, which is one of the standard results of applying

this algorithm to any problem. The number of energy evaluations performed

by the program is not affected by the size of the docking population. Since

energy evaluations account for more than 85% of the CPU time, the time

needed to complete the docking is practically independent of the number of

docking solutions that the program produces. The number 4096 (212) was

chosen to limit the sorting of poses, mainly dictated by the available space

on our hard disks.

Methods 43

2.3.1 Iterative Stochastic Elimination algorithm

The Iterative Stochastic Elimination algorithm (ISE) is described as pseu-

docode in Algorithm 2.4.

This is a general optimization algorithm that can be applied to any prob-

lem described by independent variables and a set of discrete values for each

variable. In the case of ISE-dock, the variables are: translation (3 vari-

ables), rotation (3 variables) and bond torsion angles (one for each rotatable

bond). The algorithm begins by constructing a matrix that contains, for

each degree of freedom, a set of all possible values. This matrix is referred to

as possibilities pool. Two terms are required with respect to the possibilities

pool: problem size (PS) and pool depth (PD). Problem size is defined as

the total number of all possible combinations that can be generated from the

pool. Pool depth is the maximum number of remaining values among all the

variables that define the problem.

PS =

Nvariables∏i

ni (2.2)

PD =Nvariables

maxi

(ni) (2.3)

Where Nvariables is the number of variables and ni is the number of possible

values for ith variable.

During the first phase (referred as elimination phase), a large number of

conformations is generated. The conformations are generated by randomly

picking a single value from the pool, and assigning it to the respective vari-

able.

Methods 44

Algorithm 2.4 Iterative Stochastic Elimination Algorithm

Require: problem represented as a set of variables and possible discretevaluesgenerate pool

2: initialize populationwhile size(pool) < threshold do

4: generate sample S of s random configurationsfor all i ∈ S do

6: perform local optimization with probability Pscorei = evaluate(i)

8: if (size(population) < outptutSize) or (scorei < scoremax) thenadd i to population

10: trunk population to outputSizeend if

12: end forsort S

14: L = low energy part of SH = high energy part of S

16: for all variable var ∈ pool dofor all value val ∈ poolvar do

18: observedLow = number of occurrences of (var, value) ∈ Lratio = expectedLow(value)/observedLow(value)

20: if ratio > threshold thenrank = ratio/threshold

22: mark poolvar,value for elimination with rankend if

24: observedHigh = number of occurrences of (var, value) ∈ Hratio = observedHigh/expectedHigh

26: if ratio > threshold thenrank = threshold/ratio

28: mark pair (var, value) for elimination with rankend if

30: end foreliminate up to e% values with highest rank from poolvar

32: end forend while

34:perform exhaustive search of pool, add best scored configurations topopulation

36: return population

Methods 45

The randomly generated conformations have a certain probability (0.06

by default) to undergo local optimization. The main purpose of the local

optimization step is to solve clashes and unfavorable conformations that are

caused by the discrete nature of the variable values (translation, rotation

and torsions). Unlike local optimization by the Lamarckian Genetic Algo-

rithm in AutoDock, local optimization does not affect the variables in the

possibilities matrix, only the energy values that are associated with them.

The sample is evaluated and sorted. The sorted sample is divided into three

uneven parts: subsets of lowest, highest and intermediate energy conforma-

tions. The intermediate subset is not used in the analysis. A particular

value of a variable may be discarded from the pool of values if one of the

two following criteria is met. The first criterion is the occurrence of a value

in the higher energy subset with significantly higher frequency that is ex-

pected under the random distribution assumption. Alternatively, a value

may be eliminated if it appears in the lower energy subset with lower than

random frequency. Not more than a user-specified portion of values may

be discarded at each iteration (the default value is 10%). The elimination

process is performed iteratively until the number of possible conformations

enables exhaustive search in a feasible time. During the exhaustive phase,

the solution candidates have a probability of P = 0.6 (default value) to un-

dergo local optimization. Note that local optimization probability in this

exhaustive phase is ten time larger than the probability for local optimiza-

tion during the elimination phase. During the whole process, a list of the

best seen conformations is updated kept.

Methods 46

The local optimization steps, the limit of discarded values per iteration

and the fact that the best seen conformations are collected during the elimi-

nation phase are new to this implementation of ISE and were not present in

previously published ones[30, 31].

The sample size and the sizes of lower- and higher-energy subsets depend

on the current pool depth (eq. 2.2) and are user configurable, as is the

required ratio between the expected and the observed occurrences of a value

(Algorithm 2.4, lines 20 and 26). The maximal fraction of eliminated values

for each variable and the probability of local search during the elimination

and exhaustive phases are also determined by the user.

2.3.2 Problem representation

As in AutoDock, the ligand’s configuration is encoded by real values that

define its position in space (translation), its orientation (rotations about

axes), and the internal rotations around single bonds. Unlike in AutoDock,

we have decided to use three degrees of freedom to describe the rotations of

the ligand around the principal exes. In our implementation, the rotation is

defined by sequential rotations of the molecule around the X, Y and Z axes

(in this order).

2.3.3 Protein flexibility

Accounting for protein flexibility is a very important task, which, until re-

cently was ignored by the majority of current protein-ligand docking programs[34,

97]. Proper inclusion of flexibility (as a set of rotations around side-chain

Methods 47

Figure 2.1: “Tearing off” atoms to represent side chain flexibility using phenylalanine asan example. Dummy atoms are marked by the letter “D” in their names. The N, Cα andCβ atoms on the receptor molecule overlap with their respective dummy counterparts.

and main-chain bonds) requires extensive changes to the current source code

of ISE-dock and is thus beyond the scope of this work. Nevertheless, before

further work is done, it is important to assess the ability of ISE to cope with

this problem. Docking experiments that account for protein flexibility that

are presented in this work serve as a demonstration of ISE capabilities.

Side chain flexibility in ISE-dock

As was previously described in Section 2.1 (page 35), the grid-based energy

function implies that the entire protein remains frozen during the docking

simulation. To overcome this limitation I have decided to “transfer” selected

atoms from the protein to the ligand, as the ligand may be treated with flex-

ibility. This is a technical choice to overcome that limitation of the original

program. Figure 2.1 describes the process. First, a set of flexible residues

is identified using previous knowledge. Then, for each flexible residue, all

the side-chain atoms, except for Cα, Cβ and the adjacent hydrogen atoms,

are deleted. The resulting structure (the constant part of the protein) is

used in all further calculations, mainly for calculating the interactions on the

grids. In order to include the flexible part of the receptor in the docking cal-

culations, the original coordinates of the side chains of the flexible residues

Methods 48

are copied to the ligand molecule. In addition, backbone’s nitrogen atom is

also copied. A dummy bond connects between the residue’s N atom and an

atom from the ligand. We now have three atoms that are common to the

ligand and the protein. These overlapping atoms serve as reference points

in defining the side chain’s torsions: the atoms N, Cα, Cβ and Cγ define χ1;

Cα, Cβ, Cγ, and Cε define χ2 and so on. In order to prevent clashes penalty

due to the overlapping, the common atoms on the ligand’s size are marked

as dummy atoms. Dummy atoms are ignored during energy calculations.

All the atoms that originally belonged to the receptor molecule are excluded

from the operations of translation and rotation, thus only the dihedral angles

change during the ISE search.

The transfer of atoms from the receptor to the ligand breaks a cova-

lent bond between Cβ and Cγ. After the transfer, Cγ is considered a part

of another molecule. This means that Cβ—Cγ interactions are interpreted

as intermolecular ones. Nevertheless, the distance between the two atoms

remains the distance of covalent C—C bond. To prevent the large energy

penalty that would have been caused by this misperception, Cγ atom is also

marked as dummy. This measure means that Cγ atom is not included in

any energy calculation. Atoms transfer and exclusion of Cγ’s from energy

calculations uneventfully leads to loss of accuracy. To test the validity of the

“tearing off” approach, I have “docked” only the side chains, with a ligand

molecule fixed in its crystallographic position. In these experiments (data

not shown), the RMSD of the side chain atoms with respect to their observed

position was below 0.3A.

Methods 49

Backbone flexibility in ISE-dock

The “tearing off” approach that was undertaken to include side chain flexi-

bility of proteins isn’t suitable for flexibility of the backbone due to various

technical limitations that are posed by the original code of AutoDock. In

this work, multiple protein conformations were used as a “target bank” for

the docking process. The multiple conformations of the protein were gener-

ated using the Iterative Stochastic Elimination algorithm[76, 86]. The ligand

is docked separately to each of the generated protein conformations, which

is kept frozen as usual. The results are combined according to the energy

values.

2.4 Rigid protein docking

2.4.1 LGA docking

LGA docking has been proposed to be superior to other methods in Auto-

Dock [70]. We have used the original (unmodified) AutoDock program to

obtain the results for LGA. As already mentioned, we substituted the default

Pseudo Solis Wets local optimization by the original Solis Wets algorithm.

We have also changed the default solution size from 10 to 35 in order to allow

AutoDock to perform as many energy evaluations (≈ 8.8 × 106) as were

performed on the average by ISE (≈ 8.6× 106).

Methods 50

2.4.2 The data set

We used the public portion of the test set used by Perola et al[84] in their

comparison of docking algorithms. The original test set consisted of 150

pharmaceutically relevant protein-ligand structures, of which 100 are pub-

licly available. The preparation process of these structures was performed by

the Perola group[84]. We converted these files to mol2 format. Protein struc-

tures were kept in their bound conformation and were assigned charges from

the Kollman (United Atoms) forcefield [111, 112]. In this forcefield, heavy

atoms and the non-polar hydrogen atoms adjacent to them are treated as

single (united) spheres and the only hydrogen atoms that are accounted for

individually are the polar ones. Ligands, co-factors and metal ions were

assigned charges using the Gasteiger-Huckel method [28], which, unlike the

former, treats all the atoms separately. Charges’ assignments were performed

using Sybyl(R)7.1. Ligand rotatable bonds were marked by visual examina-

tion. After the preparation, any existing co-factors were merged with the

protein and treated as part of the appropriate protein model. Atom types

were assigned automatically by the appropriate utilities in AutoDock suite.

Of the 100 complexes, 19 were excluded due to the following reasons:

• 1 complex (PDB code: 830c) containing both zinc and calcium

• 6 complexes with a co-factor that contains Phosphorus atoms (due to

lack of validated parameters): 1aoe, 1dib, 1dlr, 1frb, 1syn, 7dfr

• 8 complexes with ligands that contain more than 8 atom types (this

limitation is imposed by AutoDock) 1qwx, 1ls, 1mq5, 1mq6, 1gl9,

1ydt, 2csn.

Methods 51

Table 2.1: PDB codes of the 81 complexes in the rigid protein test set.

13gs 1cim 1f0r 1h1s 1k1j 1nhu 1ydr 5std

1a42 1d3p 1f0t 1h9u 1k22 1nhv 1yds 5tln

1a4k 1d4p 1f4e 1hdq 1k7e 1o86 2cgr 7est

1a8t 1d6v 1fcx 1hfc 1k7f 1ppc 2pcp 966c

1afq 1efy 1fcz 1hpv 1kv1 1pph 2qwi

1atl 1ela 1fjs 1htf 1kv2 1qbu 3cpa

1azm 1etr 1fkg 1i7z 1l8g 1qhi 3erk

1bnw 1ett 1fm6 1i8z 1lqd 1qpe 3ert

1bqo 1eve 1fm9 1if7 1m48 1r09 3std

1br6 1exa 1g4o 1iy7 1mmb 1thl 3tmn

1cet 1ezq 1h1p 1jsv 1mnc 1uvt 4dfr

• 4 structures with incomplete protein structure in proximity to the lig-

and (cutoff: 10A) (1f4f, 1f4g, 1ohr, 1uvs) The remaining 81 complexes

are listed in Table 2.1.

2.4.3 Comparisons and their analysis

In this work I compare the performance of ISE-dock to that of AutoDock,

Glide and GOLD. AutoDock was chosen due to the fact that it allows

direct comparison of ISE and LGA search algorithms, without any bias from

the scoring function. The latter two programs showed the best performance

in a previous extensive analysis by Perola et al[84]. ISE and LGA results

reported are average values of three independent simulations with different

seed numbers of the random number generator. Glide and GOLD results

were kindly provided by Dr. E. Perola.

Methods 52

ISE algorithm is compared to LGA by using the same energy func-

tion. ISE is different than GOLD and Glide in both the search strat-

egy and the scoring. Such differences, in search and in scoring, character-

ize most comparisons of docking programs. In all the tested programs, lig-

and flexibility (torsion angles only) is accounted for, while keeping the pro-

teins rigid. Several protocols for comparing docking algorithms have been

proposed[34, 13, 110, 44, 25]. The choice of a particular comparison pro-

cedure frequently depends on the particular problem, the data set and the

programs under investigation. In order to be able to compare our results

to those obtained by Perola et al. with Glide and GOLD, we followed

their criteria[84] and used the RMSD of the top ranking solution versus the

corresponding crystal structure, and the best RMSD within the top 20 so-

lutions. We have also used the best RMSD within the entire docked set of

ISE and LGA as an additional criterion. This latter criterion indicates the

ability of the algorithm to cover the solution space, and is less dependent on

the scoring function. RMSD is calculated using heavy atoms of the ligands.

To examine the statistical significance of those criteria, we added the paired

t-test (PTT).

2.4.4 Paired t-test

The need to apply statistical methods for comparing docking algorithms has

been recently suggested[13]. We have the RMSD results for each docking

experiment available for each of the algorithms to be compared (either ob-

tained by us or by Perola et al.[84]), therefore we can compare results of

Methods 53

ISE-dock to those obtained by the others by using a paired t-test (PTT).

We compare the paired RMSD differences (for all protein complexes docked

by two algorithms – ISE and another) under the assumption that the paired

differences are independent and identically normally distributed.

2.4.5 Comparing CPU time

Variable computation times are the result of differences in CPU, in algorithm

implementation and in other program specific issues. As both LGA and

ISE are parts of the same program, and most (>85%) computation time is

spent on energy evaluations, we use the number of energy evaluations as

an independent estimate of time performance. In order to enable a common

basis for comparing performance we changed the default output size for LGA

from 10 poses to 35. This size was chosen so that the average number of

energy evaluations using LGA (≈ 8.8× 106) would approximately equal the

average number of energy evaluations performed during ISE optimizations

(≈ 8.6× 106).

2.4.6 Energy funnels

The existence of funnels in the energy landscape has been proposed for pro-

tein folding[109, 62, 100, 105] and has been further expanded for protein-

protein[99, 115] and protein-ligand recognition[96, 109]. It has been sug-

gested that the shape of such plots is in correlation with the nature of binding

between the molecules[62]. In the part of this work that deals with flexible

ligand - rigid protein docking, I utilize the ability of ISE to generate large

Methods 54

populations of near-optimal solutions to estimate the energy landscape in the

vicinity of the minimum. For each docked complex, I construct an energy vs

RMSD plot.

2.5 Flexible protein docking

As stated above, protein flexibility is introduced into this work as a series

of experiments that serve a proof of the concept that ISE is successfully

presenting protein flexibility. Therefore, the experimental design is limited

to several typical cases and no statistical analysis of the results is performed.

The test cases were chosen so that the flexible regions in the proteins are

limited to small and ones in proximity to the bound ligand.

The ability of ISE-dock to represent changes in the protein backbone

was tested using two structures of collagenase with inhibitors. In our repre-

sentation of backbone flexibility, we follow other studies that produce mul-

tiple backbone conformations and dock a ligand to each of them, in order

to identify the protein conformation to which the ligand would preferentially

dock. However ISE has been shown to produce higher quality backbone con-

formations that are close to experimental. Docking to a protein with flexible

side chains was tested on two systems: acetylcholinesterase (single side chain)

and trypsin (several side chains). In all cases, the structures chosen are from

results of X-ray crystal structure determination in the PDB and represent

real modifications of the protein structures. All the selected complexes are

pharmaceutically relevant.

Methods 55

Docking process

Applying ISE-dock to flexible backbone docking requires initial separation

of ligand from the receptor-ligand complex. In the next stage our loop pre-

dicting program (ISE-based) predicts conformations of flexible protein frag-

ments. This program was developed in our group and has been successfully

applied [86, 76]. During the search for optimal backbone conformations, ISE

samples flexible fragments by probing dipeptides conformations. Dipeptide

selection is performed according to the given sequence. The conformations

are evaluated by an energy function that combines penalties for deviations

from peptide geometry and interactions between the fragment and the rest

of the protein. This process results in a set of conformations sorted by the

value of the scoring function. Side chains are represented as centered on Ca

in the evaluation of interactions. The side chains are added to each backbone

conformation in a subsequent step using the program SCAP[114].

Main chain Following the generation of backbone conformations of a loop

or protein fragment, the ligand is docked into each of a selected set of pro-

tein conformations. This set is limited in number due to computational

restrictions and also due to the energy gap from the lowest energy (global

minimum) conformation. It is reasonable to assume that an energy loss of

5 kcal/mol may be compensated by interactions with a ligand. Thus, we

used a threshold of 5 kcal/mol for backbone conformations above the global

minimum in order to pick a small set out of a much larger one, produced

by ISE. In each docking experiment, 4096 conformations were generated as

a result of applying ISE to the flexible ligand positions, with each protein

Methods 56

conformation. The sets for all protein conformations are merged and sorted,

and only the best 4096 conformations remain for final examination.

Side chains To perform ligand docking that includes flexible side chains,

an initial decision must be made, which side chains will be treated as flex-

ible. Those specific side chains are then “combined” with the ligand, thus

becoming flexible as the ligand is. Preparation of structures for computations

follows the one described description in section 2.4.2.

2.5.1 Protein Backbone Flexibility –

Test Case of Collagenase

General

The protein family of Matrix Metalloproteinases (MMPs) is responsible for

metabolizing the macromolecular components of extracellular matrix. The

collagenase subfamily (MMP–1, –8 and –13) enzymes are responsible for

cleaving fibrillar collagen. This cleavage is a key process in rheumatoid and

osteoarthritis[59]. The crystal structure of collagenase-3 (MMP–13) with

RS-130830 (dipenyl-ether sulphone based hydroxamic acid) has been solved

with a resolution of 2.4A (PDB code: 456c)[59]. Fibroblast collagenase-

1 (MMP–1) in complex with RS-104966 (N- hydroxy- 2- [4- (4- phenoxy-

benzenesulfonyl)- tetrahydropyran- 4 yl]- acetamide) has been solved with

1.9A resolution (PDB code: 966c)[59]. The ligands RS-130830 and RS-

104966 are chemically similar, with the only additional substitution on one of

Methods 57

Table 2.2: Affinities to collagenase

RS-130830 RS-104966

PDB complex 456c 966c

Ki(nM)

MMP–1 590 23

MMP–13 0.52 0.13

the two phenyl rings. The molecules have different specificity profiles towards

MMP–1 and –13 (Table 2.2).

The two proteins share 59% sequence identity, and have very similar 3D

structures. The major difference between the structures of these two proteins

is in a few characteristics (orientation, amino acid contents and length) of a

single loop: residues 243–255 (13 amino acids) for PDB structure 456c and

residues 239–249 (11 amino acids) for structure 966c. This, together with

the residue at position 218 (according to SWISS-PROT numbering of 456c)

form the specificity pocket – the sub-site that is responsible for collagenase

specificity as well as the specificity of quite a few other MMPs. Figure 2.2

presents a structural alignment of the two collagenases. As one may see, the

backbone traces of the two differ mainly in fragments Gly 248 — Met 253 in

456c (6 amino acids) and Ser 244 — Leu 247 in 966c (4 amino acids). These

fragments belong to the S1′–specificity pocket.

Methods 58

Figure 2.2: Structural alignment of 456c and 966c. Backbone traces of the proteins arecolor coded according to the distance (in A) between the aligned backbone atoms. RS-130830 (red) and RS-104966 (green) are shown as sticks models.

Comparisons and their analysis

Our flexible backbone docking involves initial prediction of loop positions,

rigid docking of the ligand to these multiple loops and then combining the

results into a single set. The computational effort that is involved in this

multistep methodology is much greater than the computational cost of rigid-

protein docking. Due to the need to apply a few programs in order to obtain a

set of final results, the effect of the additional investment of CPU time cannot

be assessed nor isolated. Therefore we do not compare flexible-backbone

docking to rigid protein docking.

Protein backbone conformations of fragments or loops are produced by

applying ISE to the structure of the protein in the protein-ligand complex,

without the presence of ligand. To evaluate the results, we compare the frag-

Methods 59

ment conformations to the original loop/fragment conformation in the com-

plex. We compare by measuring backbone atoms deviations (using RMSD).

For the ligand, its predicted position is compared to the one observed crystal-

lographically using RMSD of heavy atoms. Ligand RMSD of the top scored

conformation, best RMSD in top 20 and in all available solutions are re-

ported. Ideally, RMSD of all movable atoms (protein backbone, side chain

atoms and the ligand) needs to be calculated. To calculate RMSD over this

set of atoms, one needs to take into account the numerous local axes of sym-

metry present in any protein-ligand complex. Phenyl rings, carboxylate and

guanidine groups are examples of substructures that contain such axes. Cor-

rect accounting for symmetry axes is a complex combinatorial problem with

an exponential complexity. Due to the preliminary nature of flexible protein

docking experiments and in order to simplify the process of evaluation, I de-

cided to use two values simultaneously: RMSD of the ligand and RMSD of

protein backbone atoms.

2.5.2 Flexibility of a single side chain –

Test case of acetylcholinesterase

General

Acetylcholinesterase (AChE) plays an important role in regulating the func-

tions of the central and peripheral nervous systems. This enzyme cleaves

acetylcholine, which is secreted by neuron vesicles into the synapse that sep-

arates the vesicle and the membrane of the next cell in line. Acetylcholine

encounters receptors on that membrane and activates the continuation of the

Methods 60

Figure 2.3: Cross section of AChE complexed with acetylcholine (PDB code: 2ace), coloredby (A) partial charge of the atoms and (B) by the residue type (colored by PyMol):hydrophobic (GILMPV) – white, aromatic (FWY) – magenta, semipolar (C) – yellow,polar (HNQST) – cyan, positive (KR) – blue, negative (DE) – red. Acetylcholine iscolored blue in both panes.

neuronal transmission. AChE cleaves acetylcholine in a two step reaction into

choline and acetate, thus terminating the signal. The catalysis occurs in a

very deep, electron-rich, binding pocket, which is also called the gorge (see

Figure 2.3). The protein structures of AChE is complexed with Huperzine

A (PDB code: 1vot) and with Aricept (PDB code: 1eve) differ mainly in

the position of the side chain of one residue, Phe 330 (Figure 2.4)[97]. When

AChE is complexed with Huperzine A (1vot), Phe 330 adopts the confor-

mation that keeps the binding gorge closed. When, on the other hand, the

bulkier Aricept molecule is present in the complex (1eve), Phe 330 adopts

a conformation that allows the entry of this bigger ligand to the binding

pocket. The difference between the two conformations in the χ1 angle (1eve

– 105.3o; 1vot – 58.9o).

Comparisons and their analysis

To asses the performance of ISE-dock, results of rigid-protein docking and

cross-docking to AChE (1eve and 1vot) are compared to those obtained by

Methods 61

Figure 2.4: AChE complexed with Huperzine A (PDB code: 1vot, light gray) and withAricept (PDB code: 1eve, dark gray). The ligands and Phe 330 side chains from both thecomplexes are highlighted using sticks.

flexible docking. A total of 4 cross docking experiments are performed with

each method. The comparison is done using RMSD of the ligand only (heavy

atoms) due to the very strong similarity between the backbones of 1eve and

1vot, differing by only RMSD ≈0.2A. In addition, RMSD of all movable

heavy atoms is calculated (including side chains). This allows an evaluation

of our docking by the common acceptable RMSD criteria, but do not compare

rigid and flexible docking.

As with the rigid docking, we use the three criteria of (1) top ranked

solution, (2) best out of top 20 poses, and (3) best available pose to compare

to the crystallographic structure.

Methods 62

2.5.3 Flexibility of several side chains –

Test case of trypsin

General

Trypsin is a serine protease in the gastrointestinal tract, where it is respon-

sible for protein hydrolysis. It is a very well studied protein with numerous

available 3D structures in the PDB. Due to it’s role in the digestive system,

trypsin is not very selective as it is supposed to bind and cleave a very broad

range of proteins and peptides. Due to this nonspecific binding, many struc-

turally diverse small molecules bind to trypsin. A set of 10 protein-ligand

structures was chosen as a data set for this study. Their PDB codes are:

1ppc, 1pph, 1tng, 1tnh, 1tni, 1tnj, 1tnk, 1tnl, 1tpp and 3ptb. This data set

is similar (but not identical) to that used by Kramer at al. in their evalu-

ation of the FlexX program[49]. In the data set used here, three residues

(Leu 99,Gln 192, and Gln 221) demonstrate conformational changes of their

side chains over this data set. These residues were identified using visual

examination of the binding pockets of all the proteins in the set. Figure 2.5

summarizes the trypsin set.

On the average, the trypsin data set contains 4.1 rotatable bonds per

complex due to the different ligand in these complexes. The addition of

three flexible side chains results in more than a three fold rise in the number

of rotatable bonds (average of 12.4 bonds per complex). For each rotatable

bond, ISE-dock has to consider 60 possible angles, one for each 6o. This

leads to a dramatic exponential increase in the problem size ISE-dock has

to consider: ≈ 1016 combinations in rigid-protein docking vs ≈ 1031 combina-

Methods 63

Figure 2.5: Trypsin data set. 10 superimposed trypsin structures: 1ppc, 1pph, 1tng, 1tnh,1tni, 1tnj, 1tnk, 1tnl, 1tpp and 3ptb. The ligand molecules and the residues that aretreated as flexible are shown as sticks. The remaining parts of the proteins are shown asbackbone trace.

tions after the inclusion of protein flexibility. Due to this increase of problems

size, flair comparison of the results of flexible docking to those obtained by

rigid docking is problematic. Thus, in this proof of concept examination, the

results are not compared to those obtained by rigid docking, but evaluated

as described below.

2.5.4 Comparisons and their analysis

Having 10 trypsin-ligand complexes, it is possible to construct a 10x10 cross

docking matrix. The values of RMSD (as calculated over all the movable

heavy atoms) of the top scored solution, the best RMSD over the top 20

poses and the best available RMSDs are reported and analyzed.

Chapter 3

Flexible ligand – rigid protein

docking

Flexible ligand - rigid protein docking was compared between ISE and other

algorithms by assigning the results to different RMSD threshold bins. Dock-

ing results are summarized in Figure 3.1. In this table, three criteria for

comparing the methods are presented: RMSD of top scoring poses, best

RMSD in top 20 poses, best RMSD in all available poses. The first criterion

assumes that the scoring function, which is related to energies, is an exact

measure of stability, therefore concentrating on the best scored results. The

second criterion assumes that the scoring function may not be able to dis-

tinguish between the best RMSD and some other poses, limiting those to

the best 20, by energy. The third criterion extends this criterion to a much

larger number of poses. The table presents the minimal RMSD for the set of

81 proteins, the maximal RMSD, its mean, median and standard deviation

and, finally, its t-test for ISE with respect to any of the other algorithms.

64

Flexible ligand – rigid protein docking 65

Table 3.1: Summary of docking results by ISE, LGA, Glide and GOLD.

RMSD of top pose Best RMSD in top 20 poses Best RMSDin all avail-able posesα

ISE LGA Glide GOLD ISE LGA Glide GOLD ISE LGA

Minimum 0.52 0.39 0.3 0.41 0.25 0.31 0.3 0.34 0.2 0.31

Maximum 5.99 5.95 10.63 10.19 2.46 3.65 10.36 6.35 1.64 2.74

Mean 1.73 1.9 2.57 3 0.98 0.99 1.49 1.56 0.73 0.89

Median 1.33 1.55 1.63 2.17 0.84 0.81 1.11 1.1 0.69 0.72

SD 1.14 1.39 2.58 2.44 0.51 0.62 1.44 1.26 0.37 0.5

P(PTT) 0.09 0 0 — 0.46 0 0 — 0.01 0.006

α ISE-dock– 4096 poses, AutoDock– 35 poses.

The detailed results for each of the complexes are presented as table C.1 in

Appendix C. In the analysis of Figure 3.1 and additional figures presented

below, I demonstrate that the performance of ISE-dock is in many aspects

better than the performance of several well established docking programs.

One should note that the results for Glide and GOLD that were obtained

by Perola et al[84] and are reported in this paper, differ slightly from those

published in the original work, due to the fact that they were obtained using

100 publicly available and 50 internal company structures[84], as opposed to

the subset of 81 publicly available structures in this report.


3.1 Top scoring poses

Figure 3.1 presents the fraction of top scored poses in the full set of docking

experiments, in a given RMSD threshold from the crystal structures. It

may be seen that ISE-dock achieves better results than the other three

programs when considering 50% of the complexes or more. ISE did not

dock any complex with top scored solution below 0.5A. In the remaining

threshold values, ISE and LGA outperform (with various degrees) Glide and

GOLD with respect to the number of structures with top scored solutions

below the corresponding threshold. For thresholds above 1.0A, there is a

slight advantage of ISE over LGA, which increases for larger threshold values.

About 70% (65% for LGA) of the top scoring structures are found by ISE-

dock to be under 2.0A RMSD from experiment and nearly 85% (76% for

LGA) are found under 3.0A.

The mean and median RMSD values for the top scored poses, as well as

the standard deviations, are better with ISE than LGA, Glide or GOLD.

The PTT for ISE results vs the others are: LGA: P=0.09, Glide: P=0.002

and GOLD: P< 0.001. P is the probability that the difference between the

algorithms is random, as calculated by PTT.

Top scoring poses are complexes of best interaction energy, and are ex-

pected to show the lowest RMSD from experimental. However, they are

frequently found to have larger RMSD values due to (1) limited inclusion

of flexibility and (2) limitations of the scoring functions, which compromise

between speed and quality. Still, these scoring functions are expected to be

good enough to identify the best answers among the top results for a docking


Figure 3.1: Top single docking poses at different RMSD bins with respect to crystalstructures, 4 different programs. Results for Glide and GOLD were obtained by Perola etal.[84].

experiment, and the number 20 was chosen[84] to probe for such best RMSD

results.


3.2 Top 20 poses

Comparison of top 20 poses demonstrates that ISE-dock outperforms both

Glide and GOLD and shows better or similar performance, compared to

AutoDock s LGA. The mean and the median RMSD values of the best out

of the top 20 poses obtained by ISE are similar to those obtained by LGA

and are better than those obtained by the other two algorithms. Pairwise

comparison shows that the performances of ISE and LGA on the top 20 poses

are identical (P=0.46). Examination of the best 20 docking poses shows that

ISE is clearly better than Glide and GOLD, with a probability P≤0.001

with respect to any of these two (see Figure 3.1). Figure 3.2 demonstrates

that LGA and ISE have an advantage over Glide and GOLD for the top 20

poses in all RMSD ranges. ISE results for 0.5A, 2.0A and 3.0A thresholds are

better than those of LGA. ISE alone produced at least one 3.0 A or better

solution among the top 20 poses in the entire test set (100.0% compared to

97.5%, 90.1% and 87.6% for LGA, Glide and GOLD, respectively). In 98%

of the examined molecules, ISE produced solutions that are closer than 2.0A

from experimental. Examination of the top 20 poses is most meaningful for

comparing between the programs, as it appears to indicate that the sampling

conducted by ISE-dock is indeed more thorough than the sampling of the

other programs.


Figure 3.2: Top 20 docking poses, RMSD to corresponding crystal structures. Results forGlide and GOLD were obtained by Perola et al.[84].

3.3 Solution space coverage

ISE’s ability to generate very large populations of near-optimal solutions re-

sults in much better coverage of solution space near the (global) minimum.

This is borne out by comparing best RMSD in the full set of solutions by

ISE and LGA in similar CPU time (4096 and 35 solutions, respectively). The

population obtained in standard runs of ISE is larger than that obtained by

LGA by more than a 100-fold. This increases significantly the chance of

finding docking poses with lower RMSD values. It is reasonable to compare

populations that differ that much in size, as we show in the discussion of

alternative binding modes in the results section. I could not compare ex-

tended docking populations for Glide and GOLD, as no such data were

reported. It should be emphasized that ISE’s 4096 solutions in this case, and

any number of solutions in other cases, are not merely poses encountered


during the random search, but are the best ones following the probing of the

whole space. The PTT probability value for comparison of the two docking

sets is P=0.006. ISE results are better with respect to all the terms in the

five-number summary (minimum, maximum, average, median and standard

deviation) of the best RMSD in the entire solution set (Figure 3.1). When

examining the percentage of complexes with at least one solution below a

certain threshold, as depicted on Figure 3.3, the most prominent difference

between the algorithms is at 0.5A: 32.0% vs 17.3% in favor of ISE. This dif-

ference drops down to 3.7% in favor of ISE at 2.0A. All the 81 complexes

were docked by ISE with at least one solution below 2.0A. LGA succeeded

to dock all the complexes with at least one solution below 3.0A. These find-

ings suggest that populations docked by ISE, combined with a more accurate

scoring technique, may lead to better detection and identification of relevant

docking results.

The ISE docking population (comparing by CPU time, 4096 top solu-

tions of ISE vs 35 of LGA) is much more diverse in its poses than that

produced by LGA. We clustered the poses using Sequential Leader Clus-

tering algorithm[36], with a default distance criterion of 1.0A. The average

number of clusters for the 81 molecules is ≈1870 for ISE and ≈14 for LGA.


Figure 3.3: Top available docking poses produced in equal CPU times, RMSD to corre-sponding crystal structures. The numbers of poses are 4096 (ISE) and 35 (LGA).

3.4 Time performance

We used the time performance of ISE and LGA in order to choose approxi-

mately equal processing times and analyze the number of solutions obtained

in that span of time. The average time needed to obtain 4096 docking so-

lutions on an Intel R© XeonTM

3 GHz computer, using ISE with the current

settings, was about 7.5 minutes. The average time needed to obtain 35 so-

lutions using LGA was about 8.3 minutes. As mentioned above, the time

required by LGA is linear with the number of solutions. Thus, it is expected

that more than 16 hours are required to obtain 4096 docking solutions with

LGA. For AutoDock, it has been recently suggested to increase the reli-

ability of results by obtaining more solutions and by increasing the number

of evaluations[66]. Such an increase has a substantial toll in computer time,

which is absent in ISE. We could not compare the time performance of ISE-


Figure 3.4: Number of iterations before switching to exhaustive search as a function ofinitial combinatorial size (number of initial combinations).

dock to those of Glide and GOLD. Results for the quality of the solutions

with these programs are reported here as they appear in Perola et al.[84].

The initial number of total possible combinations for ISE docking ranged

from 1,012 to 1,034 depending on the number of ligand rotatable bonds,

ranging between 2 and 14. The number of iterations (between 50 and 76

for different molecules) needed to reduce the size of the problem below the

threshold (105 combinations for switching to exhaustive computations) is ap-

proximately linear with respect to the logarithm of the initial problem size.

The graph that describes this relationship is shown in Figure 3.4. Based on

that linearity, it should be possible to extend the number of variables and

values to include protein side chains, main chain angles as well as additional

degrees of freedom.


3.5 Multiple binding modes

A growing body of data supports the existence of multiple binding modes of

ligands to receptors[18, 9, 44, 91, 57, 27, 35, 81]. In order to learn about mul-

tiple binding modes from ISE-dock, the shape of energy landscapes around

minima in energy vs RMSD graphs of ISE results is examined. These plots

may be roughly divided by visual examination into three groups: those with

one distinct funnel, those with multiple funnels and those with no distinct

funnel. It has been suggested[62] that existence of a single “canyon” at the

bottom of the energy landscape corresponds to a stable structure, multiple

minima might indicate the existence of multiple binding modes, and rugged

and unshaped energy vs RMSD plots may be the result of a looser or non-

specific binding, induced fit phenomena or domain swapping.

Figure 3.5A shows a representative of a few complexes that appear on

energy vs RMSD plots with a single funnel-like region (PDB code 1yds).

As expected, in this case, the docking solutions are structurally close to

the crystallographic pose and to one another (Figure 3.5B). Figure 3.6A

demonstrates an energy vs RMSD plot with two funnels (PDB code 1bqo),

while Figure 3.7A shows such a plot with no distinct funnel (docking results

of 1hpv). As one may see from Figure 3.7B, there are at least two predicted

binding modes for this complex, which is in agreement with our previous

suggestion. In Figure 3.7B, the ligand positions are spread over a large

conformational variation. Energy vs RMSD plots of the entire data set of

81 complexes after a single docking run are presented in Figures C.1 – C.7

(Appendix C.2).


Figure 3.5: A: Energy vs RMSD plot for docking populations of the complex 1yds obtainedwith ISE, showing a single distinct funnel. B: the same plot for 35 solutions obtained byLGA. The plots are shown using the same scale. C: The first 35 solutions (dark lines)docked by ISE vs the ligand in the crystal (gray sticks). Receptor residues with at leastone atom within 5.5A of the ligand are shown as light gray cartoon. All structures in thiswork were visualized using PyMol[15].

In 27 cases (34%), the span of energy for 4096 solutions between the

global minimum (GM) and docking solution of highest energy is less than 5

kcal/mol. In 50 cases (61%), all 4096 solutions are within 5 – 15 kcal/mol

from the GM, and in only 4 cases (5%), the energy spread is larger than 15

kcal/mol. Figure 3.8 shows the cumulative percentage of solutions (for 81

complexes, each with 4096 poses) with increasing energy gaps from the GM,

thus clarifying that most conformations are close to the GM. These 4 plots


Figure 3.6: A: Energy vs RMSD plot for docking populations of the complex 1bqo obtainedwith ISE, showing two distinct funnels. B: the same plot for 35 solutions obtained by LGA.The plots are shown using the same scale. C: The crystal structure of the ligand (graysticks) and the first 35 solutions (dark lines) docked by ISE.

with high energy minima (1fm9, 1hpv, 1qbu, 3std), have (as 3.7A) no distinct

funnel. The docking poses of these 4 complexes have no single binding mode,

but are disperse. The main feature of these complexes is the deeply buried

ligands in binding pockets (data shown for 1hpv, Figure 3.7).


Figure 3.7: A: Energy vs RMSD plot for docking populations of the complex 1hpv obtainedwith ISE, showing a scatter of the results. B: the same plot for 35 solutions obtained byLGA. The plots are shown using the same scale. C: The crystal structure of the ligandand the first 35 solutions docked by ISE.


Figure 3.8: Cumulative fractions (Y-axis) of 81 ISE docking complexes with an energyspan between the global minimum of each (pose number 1) and the other 4095 poses,below the given threshold (X-axis).


3.6 PDB data supports distinct funnels

Twenty four plots with multiple distinct funnels are found in our test set

(1azm, 1bqo, 1cim, 1eve, 1f4e, 1fm6, 1h1p, 1h9u, 1hdq, 1if7, 1iy7, 1jsv, 1k7e,

1kv1, 1qhi, 1qpe, 1r09, 1uvt, 1ydr, 3cpa, 3std, 4dfr and 5std). Ligands of

two of the twenty four complexes are present in the PDB in complexes with

other proteins (5-acetamido-1,3,4-thiadiazole-2sulfonamide from 1azm in 9

complexes; 6-O-cyclohexylmethyl guanine from 1h1p in 2 complexes) but

display similar binding modes in all of them. One complex (3cpa) contains

glycyl-tyrosine as a ligand, which is not searchable in the PDB as it is not rec-

ognized as a hetero compound. Two complexes contain “related structures”

– same or similar proteins with different ligands (1f4e, 1kv1). Of these two,

I would like to concentrate on p38 MAP kinase that was crystallized with an

inhibitor (PDB code: 1kv1; ligand HET ID: BMU)[80]. Another structure

of the same protein exists in the PDB bound to a structurally different lig-

and (PDB code: 1kv2, ligand HET ID: B96)[80]. Figure 3.9 demonstrates

that those ligands bind in two different modes. The ligand in 1kv2 is much

larger (527 g/mol) than the ligand in 1kv1 (306 g/mol). An additional no-

ticeable difference between the two ligands is that the toluyl group of 1kv2

is positioned in the place of the CH2 pyrrole group of the ligand in 1kv1.

The energy vs RMSD plot for the 1kv1 complex (Figure 3.10) displays

three distinct funnels with solutions ranked 1, 222 and 270 at their bottom

(marked d1, d222 and d270). These three poses are summarized in Figure 3.2.

As may be seen in Figure 3.11, the top scored pose is close to the crystal

structure position (RMSD of 1.37A). In the d222 solution (Figure 3.12) the


Figure 3.9: Complexes 1kv1 (light gray) and 1kv2 (dark gray) superimposed using back-bone atoms. The ligands are shown as sticks and backbone of closest (within 5.5 A)residues to the ligand are shown as PyMol cartoons.

ligand is positioned in reverse to d1, while in d270 it is positioned so that

chlorophenyl is in the position of toluyl in 1kv2 (Figure 3.13). Generally,

LGA is capable of producing cluster-like structures when plotting the calcu-

lated solution energy vs RMSD from a single structure even when configured

to predict relatively small amount of docking solutions (see for example dock-

ing solutions for complexes 2cgr, 3cpa or 4dfr in section C.2 of the Appendix).

Nevertheless, in the case of 1kv1, the points on Figure 3.10B, representing

35 LGA solutions, are all clustered around a small well defined region in the

E vs RMSD plot and do not suggest any alternative binding modes.

It has been proposed that thyroxine binds to Transthyretin in two an-

tiparallel modes[27, 35, 81]. ISE-dock and AutoDock s LGA were ap-

plied to re-dock the thyroxine ligand from its crystal structure complex with


Figure 3.10: Energy vs RMSD plot for docking populations obtained by ISE (A) and LGA(B) of the complex 1kv1. The plots are shown using the same scale. The best single ISEsolutions at each of the three funnels have ranks 1, 222 and 270 and are marked witharrows.

Figure 3.11: The best ISE-dock solution for 1kv1 (sticks). The crystal structures of 1kv1and 1kv2 ligands are shown for comparison (lines). 1kv1 is colored according to: C – cyan,N – blue, Cl – green. 1kv2 is colored according to: C– yellow, N – blue, O – red.

Human Transthyretin (PDB: 2rox). The energy vs RMSD plot for the pop-

ulation obtained by ISE shows at least two distinct funnels with docking

solutions ranked 1st and 2nd (marked d1 and d2, respectively) at the bot-

tom of the energy funnels. Figure 3.14B shows that the two solutions are

indeed antiparallel. The solutions by LGA do not suggest an alternative


Figure 3.12: ISE-dock solution for 1kv1, ranked 222 (sticks). The crystal structures of1kv1 and 1kv2 ligands are shown for comparison (lines). The coloring scheme is identicalto that of Figure 3.11

Table 3.2: Binding modes of 1-(5-tert-butyl-2-methyl-2h-pyrazol-3-yl)- 3-(4-chloro-phenyl)-urea (from 1kv1)

Pose E(kcal/mol) RMSD(A)

1 -10.51 1.37

222 -9.13 3.95

270 -8.84 4.69

binding mode. Figure 3.14A shows the energy vs RMSD plots for ISE and

LGA docking solutions of 2rox.


Figure 3.13: ISE-dock solution for 1kv1 solution ranked 270 (sticks). The crystal structuresof 1kv1 and 1kv2 ligands are shown for comparison (lines). The coloring scheme is identicalto that on Figure 3.11

In AutoDock, the lower number of solutions supplied by LGA compared

to ISE in similar CPU time provides fewer suggestions for ligand binding

modes. This is further emphasized by the smaller number of clusters of LGA

docking compared to ISE-dock, which covers solution space better than

LGA in a similar CPU time. Large ISE populations may thus improve upon

the imperfections in the energy functions.


Figure 3.14: Energy vs. RMSD plot for docking populations of the complex 2rox, obtainedby ISE (A) and LGA(B). The best single ISE solutions at each of the two funnels haveranks 1 and 2 and are marked with arrows. C: Antiparallel docking solutions ranked 1 and2 for 2rox (green and magenta sticks respectively). The carbons in the crystal structureof thyroxine are shown thin sticks colored cyan. The backbone of closest (within 5.5 A)residues to the ligand are shown in PyMol cartoon representation colored cyan.

Chapter 4

Flexible Ligand – Flexible Protein

Docking

4.1 Protein backbone flexibility – test case of

collagenase

The coordinates of MMP–13 and MMP–1 (456c and 966c) were obtained

from the PDB. All the water molecules and metal ions, except for the cat-

alytic Zinc were removed. The ligands were separated from the protein and

saved in a separate file. As the 456c structure contains two identical chains,

only one of them (A) was used. Alternate positions of the conformation-

ally flexible loops (residues 248 – 253 for 456c and residues 244 – 247 for

966c) were produced by ISE. As any ISE implementation produces multi-

ple near-optimal solutions, only the conformations that differ from the best

scored one (global minimum) by not more than 5 kcal/mol were chosen for

84

Flexible Ligand – Flexible Protein Docking 85

the next step. For 966c, there were 31 such solutions and RMSD of backbone

atoms with respect to the crystallographic structure (over the flexible region

only) ranged between 0.09A and 0.33A. In the case of 456c, only 5 solutions

with energy values of 5 kcal/mol above the global minimum were generated.

Their RMSD values were slightly higher that those of 966c and ranged be-

tween 0.59A and 0.61A. These RMSD values describe only the backbone of

the protein flexible fragment (loop).

Loop generation is conducted without the presence of ligand. and side

chain conformations are generated by SCAP[114] using the optimized back-

bone conformations (section 2.5.1, page 58). Although the geometry of the

backbone in both the proteins is very close to that observed in the PDB struc-

tures, the prediction of side chains positions has been also performed with

no ligand presence. This might be the reason for the relatively high RMSD

values observed in this data set for the positions of the ligands (Figure 4.1).

In the top 20 docking solutions the best RMSD values ranged between 1.59A

(456c-966c)a and 2.20A (966c-456c). The top scored solutions had RMSDs

between 2.25A and 3.49A. Nevertheless, the docking results indicate that

ISE-dock has successfully included good docking poses in the final docking

sets of all the four docking experiments. This conclusion follows the best

RMSD values in the entire docking populations of 4096 structures. RMSD

values are below 2A in all cases. If the ligand and the protein originate from

the same complex, the prediction of ligands’ poses are even better: 1.33A for

456c and 1.18A for 966c. The fact that no solution with RMSD <1A was

aIn this work, the names of cross docking experiments follow the [ligand name]-[receptorname] template


Table 4.1: Collagenase data set, best ligands’ RMSD (A) in top 1, top 20 and all available(4096) solutions. RMSD of the backbone from the crystal position of the correspondingsolution is also reported.

Ligand Receptor Top 1 Top 20 Top 4096

Ligand Backbone Ligand Backbone Ligand Backbone

456c 456c 2.25 0.61 1.59 0.61 1.33 0.61

966c 966c 2.92 0.28 2.09 0.20 1.18 0.21

456c 966c 3.49 0.13 2.14 0.20 1.61 0.27

966c 456c 2.75 0.61 2.20 0.61 1.76 0.61

found between the top 20 docking solutions is easily explained by two fac-

tors: (1) the scoring function used during the docking process is not capable

to distinguish between changes in the protein 3D structure and (2) the loop

structure was optimized with no ligand present in the binding side, while

the subsequent docking process did not allow protein accommodation to the

presence of the ligand.

Loop conformations were successfully predicted by ISE algorithm, in

terms of the backbone structure. Nevertheless, due to the small ranges in

backbone RMSD values, no conclusion about the ability of ISE-dock on its

own to discriminate between backbone positions could be done.


4.2 Flexibility of a single side chain –

Test case of acetylcholine asterase

The two AChE structures in this study differ in the side chain positions of

the residue Phe 330 (Figure 2.4)[97]. The values of χ1 angles of 1eve and

1vot are 105.3o and 58.9o, respectively. The results of docking experiments

with AChE test set are presented in Table 4.2.

Rigid protein docking Rigid bound docking resulted in good accuracy:

RMSD values of top scoring solutions were 1.85A for 1eve and 0.86A for 1vot.

The best RMSD values among the top 20 solutions were 0.63A and 0.86A

for 1eve and 1vot, respectively. When no protein flexibility was allowed,

cross docking experiments, as expected, gave worse results than the “native”

(bound) docking. A decrease in the quality of the results was observed when

Aricept (1eve), the larger of the two ligands, was cross-docked into the protein

structure that was solved in complex with Huperzine A(1vot). The RMSD

value for top ranked solution in that case was 2.91A. However, the closest

ligand pose to the experimental structure (pose #813 out of 4096 poses) had

an RMSD value of 1.43A.

Flexible protein docking – Cross docking When protein side chain (Phe

330) flexibility was allowed, cross docking of Aricept resulted in minor im-

provements of RMSD values in the three tested parameters. On the other

hand, in the cross docking of Huperzine A, the top 1 and the top 20 solu-

tions had worse RMSD values, compared to those obtained by rigid cross


Table 4.2: Results of Acetylcholineesterase cross docking experiments (RMSD[A]). Theresults are reported for the best scored solution (Top 1) and the best RMSD values outof the top 20 and out of all the available solutions (Top 4096). The ligand structures arelisted in rows and the protein structures are listed in columns.

Rigid docking Flexible docking

Ligand’s position All movable atoms

1eve 1vot 1eve 1vot 1eve 1vot

Top 1 1eve 1.85 2.91 1eve 2.17 2.12 1eve 1.95 1.85

1vot 1.09 0.86 1vot 2.60 0.72 1vot 2.28 0.70


Top 20 1eve 0.63 1.97 1eve 1.87 1.59 1eve 1.55 1.19

1vot 1.03 0.81 1vot 2.47 0.70 1vot 2.14 0.68


Top 4096 1eve 0.39 1.43 1eve 1.29 1.40 1eve 0.48 0.85

1vot 0.65 0.54 1vot 0.45 0.24 1vot 0.52 0.37

docking. However, a much closer to experimental ligand pose was found for

Huperzine A among the entire docking solution, with an RMSD of 0.45A,

compared to 0.65A that was obtained without protein flexibility. Examining

the predicted positions of all the movable atoms, one may find that high

quality results were included in the final docking sets of all four protein-

ligand combinations. This conclusion emerges from the RMSD values of the

closest solution out of 4096 available ones 0.85A for 1eve-1vot and 0.52A for

1vot-1eve cross-docking. On the other hand, the top solution and the top

20 solutions in the cross-docking cases relatively of high RMSD. Figure 4.1

demonstrates the results of unbound docking for the AChE data set.


Figure 4.1: The best available docking solution for (A) 1eve-1vot and (B) 1vot-1eve inunbound (cross-) docking experiments. The docking solutions for all the movable atomsare shown as lines and the crystal structures are shown as sticks. The protein structuresare shown as backbone trace.

Flexible protein docking – Bound docking When flexibility of Phe330

was included, the quality of bound docking results for Aricept (1eve-1eve)

were worse, compared to those obtained without protein flexibility. Ligand’s

RMSD values for the top scored solution, the best out of top 20 and the

best available solution were respectively 2.17A, 1.87A and 1.29A. In the case

of Huperzine A bound docking (1vot-1vot), there was a slight improvement

in the prediction of ligand position: 0.72A vs 0.86A for best scored pose,

0.70A vs 0.81A for best out of top 20 solutions and 0.24A vs 0.54A for best

available solution. The decrease in quality of bound docking results upon the

introduction of flexibility (as was observed in the case of 1eve-1eve), can be

related to the increase in problem complexity. On the other hand, Phe330

flexibility during docking of Huperzine A into a closed pocket (1vot-1vot) may

have solved minor clashes and as a result, gave in better results. Figure 4.2

illustrates the results of bound docking for the AChE data set.


Figure 4.2: The best available docking solution for (A) 1eve-1eve and (B) 1vot-1vot inbound docking experiments. The docking solutions for all the movable atoms are shownas lines and the crystal structures are shown as sticks. The protein structures are shownas backbone trace.

4.3 Flexibility of several side chains – Test case

of trypsin

The RMSD values of χ torsional angles of the three residues that were treated

as flexible in this work are listed in Figure 4.3. The structural differences

between the proteins along the data set (in terms of torsional RMSD values)

range from 2.7o (1tng – 1tnh) to 62.1o (1ppc – 3ptb).

Cross docking of the 10 PDB structures resulted in 100 different docking

experiments. The detailed results of all the experiments are listed in Ap-

pendix D. RMSD of top scoring poses, the best RMSD in top 20 poses and

the best RMSD of all the available poses are reported and analyzed in Ta-

ble 4.4 and Figure 4.3. These results are assigned to RMSD threshold bins.

The bins are identical to the ones that were used in the rigid protein docking

experiments (Section 4.2, page 87).

The overall results of cross docking over the trypsin data set are good.

Contrary to the intuitive expectation, the RMSD values over the diagonals


Table 4.3: Torsion RMSD (in degrees) of flexible residues in the trypsin data set

1ppc 1pph 1tng 1tnh 1tni 1tnj 1tnk 1tnl 1tpp 3ptb

1ppc 0.0 43.4 42.6 42.1 43.0 40.7 40.3 41.0 61.5 62.1

1pph 43.4 0.0 35.5 34.5 36.7 34.0 33.3 34.3 58.3 61.0

1tng 42.6 35.5 0.0 2.7 4.3 4.8 5.0 4.6 48.1 37.6

1tnh 42.1 34.5 2.7 0.0 3.5 4.5 3.7 3.3 48.6 37.9

1tni 43.0 36.7 4.3 3.5 0.0 6.9 6.1 4.2 48.0 36.9

1tnj 40.7 34.0 4.8 4.5 6.9 0.0 2.8 4.4 50.6 40.9

1tnk 40.3 33.3 5.0 3.7 6.1 2.8 0.0 3.3 48.8 39.6

1tnl 41.0 34.3 4.6 3.3 4.2 4.4 3.3 0.0 49.2 39.8

1tpp 61.5 58.3 48.1 48.6 48.0 50.6 48.8 49.2 0.0 31.7

3ptb 62.1 61.0 37.6 37.9 36.9 40.9 39.6 39.8 31.7 0.0

Color map: 0 6 12 18 24 30 36 42 48 54 60≤

Figure 4.3: Top docking poses at different RMSD bins with respect to crystal structures

of Table 4.4 (bound docking) are frequently not the minimum ones. The

ligand from the 1tng complex is docked to all the protein structures with


lower RMSD values, compared to the remaining ligands. On the other hand,

the ligand from 1tpp has the highest RMSD values. The detailed docking

results for the trypsin data set are listed in Table D.1 in the Appendix.

No protein-ligand combination could be docked with top scored solution

below RMSD of 0.5A. In 5 cases, the entire docking set contained at least

one such a pose. In 17 cases, the top scored docking solution had and RMSD

below 2.0A, in 74 cases 20 top scored solutions contained at least one pose

with RMSD<2.0A. Solutions with RMSD<3.0A were present in all the 100

protein-ligand combinations, while in 92 of them contained at least one such

a conformation among the top 20 docking solution.


Table 4.4: Trypsin data set, RMSD values of top single docking poses and best dockingposes in top 20 and top 4096 solutions(A), colorcoded

Receptor

Ligand 1ppc 1pph 1tng 1tnh 1tni 1tnj 1tnk 1tnl 1tpp 3ptb

Top 1

1ppc 1.7 3.4 2.8 3.0 2.6 2.0 2.7 3.0 3.4 2.51pph 3.9 4.7 4.6 4.3 4.5 3.9 4.3 2.8 4.5 3.61tng 1.0 1.1 0.5 0.6 1.0 0.8 0.9 0.6 1.0 1.01tnh 3.4 4.4 2.8 3.4 2.6 2.1 3.5 2.1 2.1 2.81tni 3.0 3.1 2.8 2.3 2.5 2.3 4.1 3.8 4.1 2.71tnj 4.0 3.2 3.8 2.5 3.2 2.3 2.8 2.5 2.2 3.51tnk 4.7 4.3 2.9 3.6 2.7 2.6 3.5 4.4 3.7 3.11tnl 4.5 2.1 3.0 2.6 2.7 1.8 2.7 3.3 2.1 3.41tpp 4.8 5.6 5.2 4.6 4.6 4.5 5.5 5.2 4.2 6.03ptb 3.0 3.3 3.2 3.7 3.2 2.7 3.5 3.1 3.1 2.8

Top 20


Top 4096


Color map: 0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3≤


4.4 Discussion on protein flexibility

Accounting for protein flexibility introduces additional degrees of freedom,

but is a more realistic representation of biological systems. Until recently,

most major docking programs have been ignoring conformational variations

of side chains and backbone of the receptors[97]. Nevertheless, due to the

advances in the docking algorithms and in computational power, four out of

the five most cited docking programs for year 2005[95] allow some extent of

protein flexibility (Table 4.5). Therefore, any new proposed protein-ligand

docking program is expected to address protein flexibility. Due to time

constraints, handling protein flexibility by ISE-dock was implemented only

Table 4.5: Current status of protein flexibility handling ISE-dock and in five populardocking programs (sorted according to the number of citations in 2005[95])

ISE-dock Explicit flexibility of several side chains specified by the user.Implicit handling of changes in the backbone using pregener-ated populaions.

AutoDock No protein flexibility in AutoDock ver.3. Recently releasedver.4 allows side chain flexibility of selected residues

DOCK Protein flexibility is not implemented

FlexX FlexX-Ensemble (formerly known as FlexE) – an exten-tion of FlexX. The flexibility of the protein is representedby an ensemble of structures, combined to a so-called unitedprotein description. It is possible to recombine elements fromdifferent ensemble structures

GOLD Partial protein flexibility, including protein side chains andbackbone flexibility for up to ten user-defined residues

ICM Partial protein flexibility, including protein side chains andselected loops


partially as a preliminary step before further development. In order to in-

clude protein flexibility, the scoring function of AutoDock (and thus of

ISE-dock) was extended and applied to conditions that were not accounted

for during its construction and callibration. This application of the scoring

function in cases that differ dramatically from the ones that were used for

its construction and calibration was a trade off between the accuracy and

the speed of development in the proof of concept phase of development and

has direct impact on the quality of results. Although limited to small re-

gions, protein flexibility handling in ISE-dock is successful and is another

demonstration of the ability of ISE do deal with multiple degrees of freedom

in protein-ligand docking problems. Indeed, docking experiments in all the

three test sets succeeded in producing high quality docking poses. The solu-

tions in the collagenase set contained ligand poses with ligand RMSD values

above 1.18A (966c-966c), the AChE set contained solutions with RMSD for

all movable atoms of 0.48A (1eve-1eve) and the trypsin set contained docking

solutoins with even lower RMSD: 0.3A (1tng-1tng).

The main pitfall of the flexible ligand – flexible protein docking using

ISE-dock is the scoring function. The original energy function does not

account for changes in the 3D structure of the protein. Implicit protein

flexibility (collagenase data set) involves combining solutions of docking a

ligand into different protein structures. Explicit handling of changes in the

protein 3D structure during the docking process involves transferring atoms

from the protein to the ligand and exclusion of Cγ atoms of the flexible

residues from the scoring scheme. The quality of top scored solutions is

heavily biased by the scoring function. As one may see from the results


for the collagenase data set (Section 4.1, and, to a higher extent, from the

AChE and trypsin sets (Sections 4.2 and 4.3), poses that are very close to

those of the crystal structures are always sampled in the final docking sets,

but are not scored well. Rescoring the docking solutions with or without a

post-docking processing step may hopefully improve the predictive ability of

ISE-dock.

Protein flexibility is an important aspect of a protein-ligand docking pro-

gram. Other degrees of freedom that were not accounted for in this work,

but that can be introduced into ISE-dock with relative ease are modelling

of structurally important water molecules, as well as tautomeric and proto-

nation states.

Chapter 5

Conclusions

Iterative Stochastic Elimination is a generic optimization algorithm that aims

to solve highly complex combinatorial optimization problems in an efficient

and fast manner. We find that it is able to solve the docking problem, as many

others, in polynomial time. Another advantage of ISE is its ability to pro-

duce arbitrarily large numbers of near-optimal solutions without substantial

penalty in terms of CPU time. ISE was first implemented in our lab in 2000

to solve the problem of positioning polar protons in protein structures[30]

and is under constant development. Since then it was successfully imple-

mented for solving side chain positioning[31], structure prediction of cyclic

peptides[87], flexible fragments in protein backbone[86, 76] and others.

ISE-dock is a new docking program based on the Iterative Stochas-

tic Elimination algorithm. The programs performance in flexible ligand –

rigid protein docking was compared to those of AutoDock, Glide and

GOLD on 81 complexes which are part of a set of complexes previously

chosen to compare docking programs. The ability to handle conformational

97

Conclusions 98

changes in the backbone and the side chains of the protein was assessed by

three independent data sets: collagenase (backbone flexibility, 2 structures),

acetylcholinesterase (single side chain flexibility, 2 structures) and trypsin

(flexibility of several side chains, 10 strucures).

In flexible ligand – rigid protein docking, ISE-dock performs better than

the three docking programs with these complexes. ISE-dock succeeds in

docking all the 81 complexes with at least one solution of RMSD <3.0A

among the top 20 scored poses (LGA of AutoDock finds 97.5%, Glide finds

90.1% and GOLD finds 87.7%), and with at least one RMSD<2.0A within

the entire docking population (LGA finds 96.3%, no information is available

on Glide and GOLD). PTT of top 20 solutions and all the available solu-

tions, applied to the results of ISE-dock and to the other algorithms, shows

a clear advantage for ISE-dock.

The more significant results of the flexible ligand - rigid protein docking

experiments are provided by the ability of ISE to achieve large near-optimal

populations of solutions without a significant additional CPU effort. These

populations improve the coverage of solution space and may be used to es-

timate the shape of energy landscapes near minima and to suggest multiple

binding modes, as was demonstrated in two cases (p38 MAP kinase – 1kv1

and Human Transthyretin – 2rox). The ability to analyze energy landscapes

accessible to ligands in a pocket has thus been shown to be useful. However,

the accuracy of that analysis can not be fully assessed yet due to the lack

of experimental data. Although, theoretically, such an analysis of very large

docking populations is possible with other docking programs, to the best of

our knowledge, the energy (score) vs RMSD plots of docking solutions, al-

Conclusions 99

though known previously were not used to visualize and estimate the energy

landscape of a protein – ligand complex.

Accounting for protein flexibility introduces additional degrees of free-

dom, but gives a more realistic representation of biological systems. Handling

of protein flexibility was introduced into ISE-dock in a partial way. Even in

this premature implementation, ISE-dock was shown to successfully dock

flexible ligands into partially flexible protein structures, which include a few

side chains and consider backbone flexibility. In all the cases, the docking

populaitions obtained by ISE-dock contained good to excellent solutions.

In the collagenase data set (Section 4.1, flexible ligand were successfully

docked into protein structures with partially flexible loops. The accuracy in

predicting the structure of the backbone is very high with RMSD of backbone

atoms as low as 0.13A from the crystal structure. Although the top ranked

solutions for ligand positions were of high RMSD from the experimental

structure (2.25A– 2.49A), the docking populations contained high quality

solutions (RMSD of 1.18A– 1.76A).

Docking experiments with side chain flexibility (AChE, Section 4.2 and

trypsin, Section 4.3) were even more accurate: in the AChE case, the docking

populations contained solutions with RMSD values as low as 0.37A and in the

case of trypsin, the best populaition contained a solution with RMSD=0.30A.

The experiments presented in this work show that ISE is capable of solv-

ing very complex problems. In addition to molecular flexibility, such prob-

lems may target protonation and tautomerizatioin states of both the protein

and the ligand, explicit simulation of water molecules etc. The latter task is

of great importance, as it is known (see for examples [85, 104]) that including

Conclusions 100

water molecules improves the quality of docking results. In order to equip

ISE-dock with all these important features, one has to overcome two major

obstacles: (1) adaptation of the grid based scoring function to correctly treat

conformational changes in the protein and (2) docking several molecules (or

any independent entities) simultaneously.

Appendix A

Results published in a peerreviewed journal

Following is the letter from the editor of “PROTEINS: Structure, Function,and Bioinformatics” journal that notifies the fact that an article based onthis work has been accepted for publication.

Return-path: <[email protected]>

Envelope-to: [email protected]

...

Message-ID:

<439655644.1187888215280.JavaMail.wladmin@mcv3-wl18>

Date: Thu, 23 Aug 2007 12:56:55 -0400 (EDT)

From: [email protected]

To: [email protected]

Subject: PROTEINS: Manuscript Prot-00274-2007.R1 Accepted

Cc: [email protected]

Errors-To: [email protected], [email protected]

PROTEINS: Structure, Function, and Bioinformatics

23-Aug-2007

Dear Mr. Boris Gorelik:

Your manuscript entitled "High quality binding modes in docking

ligands to proteins" has passed all required peer review and has

been recommended to me by the Editorial Board. I am pleased

to accept the paper for publication in the next available issue of

PROTEINS.

101

Results published in a peer reviewed journal 102

You will receive an e-mail immediately following with instructions

for production of your article. I look forward to seeing it in press.

Congratulations on submitting such an excellent study.

Sincerely,

Eaton E. Lattman

Editor-in-Chief

PROTEINS: Structure, Function, and Bioinformatics

The Johns Hopkins University

Department of Biophysics

Baltimore, MD 21218 U.S.A.

Appendix B

ISE-dock and AutoDockparameters and their values

B.1 AutoDock parameters and theirdefault values

Following are the default parameters of AutoDock v 3.0.5 and their shortdescription. For more details see the manual published by AutoDock au-thors

seed time pid # for random number generatortypes CANOSH # atom type namesfld [PROTEIN_NAME].maps.fld # grid data filemap [PROTEIN_NAME].C.map # C-atomic affinity map filemap [PROTEIN_NAME].A.map # A-atomic affinity map filemap [PROTEIN_NAME].N.map # N-atomic affinity map filemap [PROTEIN_NAME].O.map # O-atomic affinity map filemap [PROTEIN_NAME].S.map # S-atomic affinity map filemap [PROTEIN_NAME].H.map # H-atomic affinity map filemap [PROTEIN_NAME].e.map # electrostatics map file

move [LIGAND_NAME].pdbq # small molecule fileabout [X],[Y],[Z] # small molecule center# Initial Translation, Quaternion and Torsionstran0 random # initial coordinates/A or "random"quat0 random # initial quaternion or "random"ndihe 10 # number of initial torsionsdihe0 random # initial torsionstorsdof 0 0.3113 # num. non-Hydrogen torsional DOF & coeff.

103

ISE-dock and AutoDock parameters and their values 104

# Initial Translation, Quaternion and Torsion Step Sizes# and Reduction Factorststep 2.0 # translation step/Aqstep 50.0 # quaternion step/degdstep 50.0 # torsion step/degtrnrf 1. # trans reduction factor/per cyclequarf 1. # quat reduction factor/per cycledihrf 1. # tors reduction factor/per cycle

# Internal Non-Bonded Parametersintnbp_r_eps 4.00 0.0222750 12 6 #C-C lj[LENNARD JONES PARAMETERS FOR EACH PAIR OF ATOM TYPES]intnbp_r_eps 2.00 0.0029700 12 6 #H-H lj

outlev 1 # diagnostic output level

# Docked Conformation Clustering Parameters for# "analysis" commandrmstol 1.0 # cluster tolerance (Angstroms)rmsref [LIGAND_NAME].pdbq # reference structure# file for RMS calc.write_all # write all conformations in a cluster

extnrg 1000. # external grid energye0max 0. 10000 # max. allowable initial energy,# max. num. retries

# Genetic Algorithm (GA) and Lamarckian# Genetic Algorithm (LGA) Parametersga_pop_size 50 # number of individuals in populationga_num_evals 250000 # maximum number of# energy evaluationsga_num_generations 27000 # maximum number#of generationsga_elitism 1 # num. of top individuals that# automatically survivega_mutation_rate 0.02 # rate of gene mutationga_crossover_rate 0.80 # rate of crossoverga_window_size 10 # num. of generations for# picking worst individualga_cauchy_alpha 0 # ~mean of Cauchy distribution# for gene mutation


ga_cauchy_beta 1 # ~variance of Cauchy distribution# for gene mutationset_ga # set the above parameters for GA or LGA

# Local Search (Solis & Wets) Parameters# (for LS alone and for LGA)sw_max_its 300 # number of iterations of# Solis & Wets local searchsw_max_succ 4 # number of consecutive successes# before changing rhosw_max_fail 4 # number of consecutive failures before# changing rhosw_rho 1.0 # size of local search space to samplesw_lb_rho 0.01 # lower bound on rhols_search_freq 0.06 # probability of performing local# search on an indiv.set_psw1 # set the above pseudo-Solis & Wets parameters

# Perform Dockingsga_run 10 # do this many GA or LGA runs

# Perform Cluster Analysisanalysis # do cluster analysis on results

B.2 ISE-dock parameters and theirdefault values

Following are the default parameters of ISE-dock. Parameters that arecommon to AutoDock are not listed here.

# ISE docking parametersise_sample_size -50 # sample size. negative values mean that# the size will be the product of current pool depth and# the absolute value of this parameter

ise_conf_in_h_l -2 # number of conformations in the# highest- and lowest- energy subsets. negative values# mean that the size will be the product of current pool# depth and the absolute value of this parameter

ise_output_size 40 # number of solutions in the final# docking set


ise_z_value 3.84 # statistical value that determines# the rigidity of the elimination process

ise_elimination_fraction 0.1 # limit the number of values# that can be eliminated from any given gene

ise_threshold 1e5 # threshold to switch from the# stochastic to the exhaustive search

ise_method stochastic # one of the following:# stochastic exhaustive

ise_pool_file <use_dpf> # if file name is specified,# read the initial pool from it if ’’<use_dpf>’’, then# use the *grid parameters listed below to initialize# the possibilities pool

ise_t_grid 1.5 # translation gridise_r_grid 6 # rotation gridise_d_grid 6 # dihedral torsions grid

ise_optimize_solution FALSE # perform local# optimization on the final docking solution

ise_optimize_on_elimination TRUE # perform local# optimization during the elimination phase. use the# value of ls_search_freq parameter for probability# of performing local search

ise_optimize_on_exhaustive_freq 0.6 # probability#of local search during the exhaustive phase

set_ise # set the above parameters# Perform ISE dockingise_run

# Perform Cluster Analysisanalysis # do cluster analysis on results

Appendix C

Detailed Results

C.1 Flexible Ligand – Rigid Protein docking re-sults results

Table C.1.

Top scoring pose Best RMSD

Top 20 All available

CO

DE

ISE

LG

A

Glid

e

GO

LD

ISE

LG

A

Glid

e

GO

LD

ISE

LG

A

13gs 1.86 2.30 2.81 1.52 0.46 0.72 2.69 1.09 0.25 0.58

1a42 1.65 3.30 1.47 5.28 0.47 0.97 1.47 2.26 0.47 0.79

1a4k 1.88 1.91 2.29 2.33 1.50 1.54 1.38 1.81 0.76 1.46

1a8t 2.27 3.51 1.11 4.69 0.86 0.80 1.11 2.07 0.85 0.71

1afq 2.07 2.93 1.12 1.35 1.06 1.01 0.53 1.35 1.06 1.01

1atl 3.21 3.04 2.10 1.55 0.95 1.22 1.46 1.55 0.92 1.04

1azm 2.33 2.81 2.04 2.60 1.97 2.17 1.24 0.66 0.54 1.97

1bnw 3.93 4.21 4.36 4.88 1.03 3.02 1.35 4.30 0.61 1.12

1bqo 0.92 0.61 1.60 1.55 0.72 0.51 1.60 1.35 0.72 0.48

1br6 1.85 1.85 3.51 1.82 1.64 1.83 1.69 0.63 0.44 1.82

1cet 2.05 4.21 3.05 8.52 1.71 1.88 2.80 5.30 0.75 1.81

1cim 1.16 1.16 1.54 1.30 0.66 0.65 1.34 1.03 0.23 0.58

Continued on next page

107

Detailed Results 108

Table C.1 – continued from previous page



CO

DE

ISE

LG

A

Glid

e

GO

LD

ISE

LG

A

Glid

e

GO

LD

ISE

LG

A

1d3p 1.32 3.91 2.40 4.03 1.03 0.86 1.61 1.57 0.91 0.85

1d4p 0.98 1.56 2.35 2.69 0.74 0.86 0.74 0.99 0.50 0.79

1d6v 2.31 2.50 4.06 4.08 1.79 2.36 2.01 1.68 0.97 2.19

1efy 2.53 4.45 1.95 2.88 1.98 2.03 0.38 0.69 0.52 1.95

1ela 1.15 1.55 0.75 1.25 1.14 0.87 0.75 1.06 1.08 0.87

1etr 1.71 0.66 1.49 2.60 1.19 0.66 1.15 2.18 1.01 0.66

1ett 2.55 4.59 0.92 4.37 0.85 0.72 0.65 1.29 0.85 0.72

1eve 1.52 2.58 1.94 2.39 0.58 0.59 1.15 1.03 0.51 0.52

1exa 0.52 0.46 0.43 0.41 0.36 0.44 0.43 0.41 0.23 0.41

1ezq 2.65 2.19 10.63 2.25 1.68 1.06 4.30 1.10 1.58 1.02

1f0r 1.53 1.66 8.72 3.19 0.80 0.62 1.90 1.23 0.80 0.62

1f0t 1.24 4.84 2.26 2.12 0.84 0.89 1.60 2.06 0.84 0.89

1f4e 3.92 3.92 1.23 1.75 2.46 1.73 1 1.55 0.56 1.36

1fcx 0.58 0.58 0.48 0.74 0.50 0.55 0.48 0.49 0.20 0.53

1fcz 0.57 0.59 0.77 0.91 0.45 0.54 0.52 0.50 0.24 0.49

1fjs 1.49 1.59 5.04 2.12 1.31 0.73 3.44 1.44 1.31 0.73

1fkg 1.07 1.20 1.75 4.18 0.93 0.93 1.67 4.05 0.93 0.93

1fm6 2.84 0.40 0.64 0.68 0.69 0.35 0.64 0.65 0.69 0.35

1fm9 1.72 1.60 1.74 3.38 1.21 0.85 1.74 1.49 1.17 0.85

1g4o 3.70 3.99 2.15 4.59 2.21 2.92 1.62 0.81 0.58 2.44

1h1p 4.08 3.72 0.65 1.21 1.35 1.35 0.65 0.52 0.38 1.31

1h1s 0.80 0.62 0.97 1.16 0.61 0.42 0.97 1.16 0.58 0.36

1h9u 0.59 0.53 0.82 1.12 0.33 0.47 0.48 1.03 0.33 0.35

1hdq 1.07 1.88 2.16 3.67 0.55 0.84 0.62 0.84 0.37 0.77

1hfc 1.55 4.47 2.37 2.34 1.40 0.98 1 0.61 1.34 0.98

1hpv 1.11 1.73 1.20 9.47 1.01 0.88 1.19 1.38 1.01 0.88






CO

DE

ISE

LG

A

Glid

e

GO

LD

ISE

LG

A

Glid

e

GO

LD

ISE

LG

A

1htf 2.55 1.64 10.12 10.19 1.53 0.59 1.99 3.13 1.49 0.59

1i7z 0.87 1.02 0.60 0.86 0.45 0.82 0.44 0.82 0.45 0.38

1i8z 0.72 1.92 3.82 3.66 0.55 0.74 2.55 2.69 0.39 0.63

1if7 3.65 4.40 1.43 5.42 1.64 3.65 1.34 1.65 0.87 2.74

1iy7 0.96 1.04 1.16 0.91 0.75 0.99 0.99 0.59 0.75 0.77

1jsv 0.88 1.25 5.45 6.94 0.74 0.71 3.40 5.36 0.69 0.71

1k1j 4.11 1.47 5.88 6.54 1.59 1.23 4.48 3.24 1.57 1.23

1k22 1.69 0.55 0.74 1.03 1.06 0.42 0.74 0.72 1.06 0.41

1k7e 0.88 0.74 0.72 0.96 0.56 0.53 0.68 0.53 0.21 0.31

1k7f 0.79 0.77 2.02 0.84 0.69 0.68 0.51 0.76 0.69 0.66

1kv1 1.21 1.21 0.66 0.81 0.70 1.14 0.59 0.56 0.27 0.66

1kv2 0.73 0.78 1.63 0.80 0.58 0.69 0.91 0.74 0.52 0.63

1l8g 1.33 1.60 2.90 2.17 0.74 1.50 1.57 2.17 0.70 1.16

1lqd 0.89 0.39 1.93 0.65 0.74 0.31 1.93 0.45 0.74 0.31

1m48 1.89 1.12 0.68 1.64 1.10 0.55 0.68 1.12 1.10 0.55

1mmb 2.11 2.12 3.18 6.11 1.79 1.32 1.16 1.37 1.64 1.32

1mnc 3.96 0.69 0.36 1.95 1.53 0.60 0.36 1.38 1.21 0.60

1nhu 3.38 3.51 6.07 5.17 1.02 1.07 3.16 3.75 0.69 1.07

1nhv 3.26 4.68 6.57 8.95 1.35 1.76 5.96 4.45 1.04 1.76

1o86 3.46 1.25 1.06 1.85 1.80 1.25 0.97 0.99 1.54 1.25

1ppc 1.60 1.59 1.69 1.76 1.37 1.20 1.62 1.76 1.30 1.20

1pph 3.39 2.38 5.09 4.95 1.36 1.42 1.09 0.88 1.02 1.42

1qbu 0.97 0.72 10.36 2.59 0.86 0.66 10.36 2.59 0.86 0.66

1qhi 0.66 0.69 0.30 0.66 0.51 0.58 0.30 0.41 0.31 0.55

1qpe 0.63 0.67 1.50 0.52 0.44 0.47 0.52 0.34 0.25 0.45

1r09 5.99 5.95 0.82 1.81 1.85 1.50 0.82 0.53 0.49 1.21






CO

DE

ISE

LG

A

Glid

e

GO

LD

ISE

LG

A

Glid

e

GO

LD

ISE

LG

A

1thl 2.88 2.12 8.54 10.08 1.72 1.15 1.78 2.12 1.11 1.15

1uvt 0.85 0.60 0.44 1.47 0.66 0.49 0.44 0.54 0.66 0.49

1ydr 1.51 0.65 1.56 2.52 0.53 0.62 0.67 2.52 0.32 0.57

1yds 0.69 0.66 0.50 0.55 0.54 0.60 0.50 0.55 0.49 0.60

2cgr 0.79 0.85 0.85 6.54 0.62 0.73 0.67 6.35 0.62 0.66

2pcp 1 0.99 0.64 3.89 0.30 0.96 0.62 1.08 0.30 0.95

2qwi 0.56 0.71 0.70 1.30 0.37 0.60 0.70 0.96 0.37 0.51

3cpa 0.84 0.85 0.79 0.73 0.69 0.62 0.53 0.60 0.69 0.61

3erk 0.59 0.72 0.44 1.42 0.25 0.64 0.44 0.63 0.21 0.64

3ert 1.14 1.44 4.66 4.74 0.88 1.03 2.48 2.39 0.88 0.90

3std 0.60 0.56 2.44 0.85 0.40 0.48 2.44 0.85 0.39 0.35

3tmn 0.66 3.09 8.07 7.59 0.54 0.58 3.18 3.90 0.48 0.58

4dfr 1.10 1.01 1.27 1.20 0.74 0.81 1.10 1.18 0.72 0.81

5std 0.52 0.47 0.73 0.86 0.34 0.42 0.73 0.58 0.28 0.40

5tln 1.73 3.82 9.67 6.52 1.11 0.88 1.20 1.01 1.11 0.88

7est 0.84 0.79 1.02 3.76 0.75 0.63 0.82 0.87 0.75 0.63

966c 1.05 0.70 2.44 2.42 0.81 0.55 2.21 2.34 0.81 0.55

Table C.1: Detailed docking results of the flexible ligand – rigid proteindata set. RMSD[A]


C.2 Flexible ligand – rigid protein docking energylandscapes

Following are the energy vs RMSD plots for ISE-dock and AutoDock ofall the 81 complexes in the flexible ligand - rigid protein docking set. Thegraphs are sorted alphabetically according to the PDB code of the complex.


Fig

ure

C.1

:E

nerg

yvs

RM

SDpl

ots

for

ISE-d

ock

(red

)an

dA

utoD

ock

(gre

en)

ofco

mpl

exes

inth

efle

xibl

elig

and

-ri

gid

prot

ein

dock

ing

set.

The

grap

hsar

eso

rted

alph

abet

ical

lyac

cord

ing

toth

eP

DB

code

ofth

eco

mpl

ex.

Con

tinu

edon

the

follo

win

gfig

ures

.


Fig

ure

C.2

:C

onti

nued

from

the

prev

ious

figur

e.E

nerg

yvs

RM

SDpl

ots

forIS

E-d

ock

(red

)an

dA

utoD

ock

(gre

en)

ofco

mpl

exes

inth

efle

xibl

elig

and

-ri

gid

prot

ein

dock

ing

set.

The

grap

hsar

eso

rted

alph

abet

ical

lyac

cord

ing

toth

eP

DB

code

ofth

eco

mpl

ex.


Fig

ure

C.3

:C

onti

nued

from

the

prev

ious

figur

e.E

nerg

yvs

RM

SDpl

ots

forIS

E-d

ock

(red

)an

dA

utoD

ock

(gre

en)

ofco

mpl

exes

inth

efle

xibl

elig

and

-ri

gid

prot

ein

dock

ing

set.

The

grap

hsar

eso

rted

alph

abet

ical

lyac

cord

ing

toth

eP

DB

code

ofth

eco

mpl

ex.


Fig

ure

C.4

:C

onti

nued

from

the

prev

ious

figur

e.E

nerg

yvs

RM

SDpl

ots

forIS

E-d

ock

(red

)an

dA

utoD

ock

(gre

en)

ofco

mpl

exes

inth

efle

xibl

elig

and

-ri

gid

prot

ein

dock

ing

set.

The

grap

hsar

eso

rted

alph

abet

ical

lyac

cord

ing

toth

eP

DB

code

ofth

eco

mpl

ex.


Fig

ure

C.5

:C

onti

nued

from

the

prev

ious

figur

e.E

nerg

yvs

RM

SDpl

ots

forIS

E-d

ock

(red

)an

dA

utoD

ock

(gre

en)

ofco

mpl

exes

inth

efle

xibl

elig

and

-ri

gid

prot

ein

dock

ing

set.

The

grap

hsar

eso

rted

alph

abet

ical

lyac

cord

ing

toth

eP

DB

code

ofth

eco

mpl

ex.


Fig

ure

C.6

:C

onti

nued

from

the

prev

ious

figur

e.E

nerg

yvs

RM

SDpl

ots

forIS

E-d

ock

(red

)an

dA

utoD

ock

(gre

en)

ofco

mpl

exes

inth

efle

xibl

elig

and

-ri

gid

prot

ein

dock

ing

set.

The

grap

hsar

eso

rted

alph

abet

ical

lyac

cord

ing

toth

eP

DB

code

ofth

eco

mpl

ex.


Figure C.7: Continued from the previous figure. Energy vs RMSD plots for ISE-dock(red) and AutoDock (green) of complexes in the flexible ligand - rigid protein dockingset. The graphs are sorted alphabetically according to the PDB code of the complex.

Appendix D

Flexible ligand – flexible proteindocking. Trypsin data set

Table D.1

Ligand only atoms All movable atoms

ligand protein top1 top20 top4096 top1 top20 top4096

1ppc 1ppc 1.72 0.87 0.87 1.84 1.26 1.23

1ppc 1pph 3.38 2.48 1.42 2.8 2.17 1.6

1ppc 1tng 2.84 1.44 1.27 2.59 1.48 1.48

1ppc 1tnh 3.02 1.91 1.59 2.56 1.83 1.6

1ppc 1tni 2.59 1.53 1.08 2.3 1.53 1.31

1ppc 1tnj 1.99 1.3 1.25 2.21 1.73 1.64

1ppc 1tnk 2.73 1.13 1.02 2.48 1.41 1.4

1ppc 1tnl 3.05 1.34 1.34 2.85 1.74 1.65

1ppc 1tpp 3.44 1.6 1.33 3.02 2.01 1.69

1ppc 3ptb 2.49 1.69 1.45 2.4 1.94 1.84

1pph 1ppc 3.86 2.38 2.04 3.86 2.38 2.04

1pph 1pph 4.66 2.14 1.7 4.66 2.14 1.7

1pph 1tng 4.56 1.97 1.69 4.56 1.97 1.69

1pph 1tnh 4.31 2.21 1.74 4.31 2.21 1.74

1pph 1tni 4.5 1.99 1.57 4.5 1.99 1.57

1pph 1tnj 3.88 2.01 1.59 3.88 2.01 1.59

1pph 1tnk 4.27 2.7 1.79 4.27 2.7 1.79


119

Flexible ligand – flexible protein docking. Trypsin data set 120

Table D.1 – continued from previous page



1pph 1tnl 2.77 1.96 1.42 2.77 1.96 1.42

1pph 1tpp 4.46 2.51 1.9 4.46 2.51 1.9

1pph 3ptb 3.57 2.22 1.76 3.57 2.22 1.76

1tng 1ppc 0.97 0.64 0.43 1.85 1.55 1.03

1tng 1pph 1.12 0.99 0.96 1.92 1.75 1.6

1tng 1tng 0.53 0.43 0.28 1.58 1.27 0.89

1tng 1tnh 0.64 0.54 0.4 1.69 1.32 1.09

1tng 1tni 0.99 0.63 0.63 1.9 1.47 1.25

1tng 1tnj 0.77 0.54 0.42 2.07 1.4 1.01

1tng 1tnk 0.9 0.59 0.55 1.91 1.29 1.08

1tng 1tnl 0.62 0.5 0.38 1.28 1.03 0.85

1tng 1tpp 1.04 0.61 0.53 2.34 1.75 1.58

1tng 3ptb 1 0.78 0.66 2.25 2.04 1.86

1tnh 1ppc 3.36 1.56 1.3 3.36 1.56 1.3

1tnh 1pph 4.36 1.5 1.08 4.36 1.5 1.08

1tnh 1tng 2.82 1.36 1.15 2.82 1.36 1.15

1tnh 1tnh 3.39 1.4 1.18 3.39 1.4 1.18

1tnh 1tni 2.56 1.41 1.17 2.56 1.41 1.17

1tnh 1tnj 2.08 1.31 1.11 2.08 1.31 1.11

1tnh 1tnk 3.51 1.43 1.09 3.51 1.43 1.09

1tnh 1tnl 2.07 1.41 1.23 2.07 1.41 1.23

1tnh 1tpp 2.12 1.45 1.21 2.12 1.45 1.21

1tnh 3ptb 2.78 1.82 1.27 2.78 1.82 1.27

1tni 1ppc 2.95 2 1.29 2.95 2 1.29

1tni 1pph 3.09 1.85 1.39 3.09 1.85 1.39

1tni 1tng 2.83 1.6 1.19 2.83 1.6 1.19

1tni 1tnh 2.32 1.68 1.35 2.32 1.68 1.35

1tni 1tni 2.53 1.71 1.3 2.53 1.71 1.3

1tni 1tnj 2.31 1.81 1.2 2.31 1.81 1.2

1tni 1tnk 4.12 1.64 1.2 4.12 1.64 1.2






1tni 1tnl 3.81 1.5 1.14 3.81 1.5 1.14

1tni 1tpp 4.12 1.89 1.5 4.12 1.89 1.5

1tni 3ptb 2.66 1.85 1.22 2.66 1.97 1.22

1tnj 1ppc 3.99 2.17 1.27 3.99 2.17 1.27

1tnj 1pph 3.24 1.84 1.35 3.24 1.84 1.35

1tnj 1tng 3.85 1.48 1.15 3.85 1.49 1.15

1tnj 1tnh 2.5 1.49 1.19 2.5 1.49 1.19

1tnj 1tni 3.25 1.68 1.29 3.25 1.68 1.29

1tnj 1tnj 2.34 1.44 1.14 2.34 1.44 1.14

1tnj 1tnk 2.77 1.43 1.07 2.77 1.43 1.07

1tnj 1tnl 2.46 1.49 1.04 2.46 1.49 1.04

1tnj 1tpp 2.2 1.72 1.3 2.2 1.72 1.3

1tnj 3ptb 3.53 1.67 1.22 3.53 1.67 1.22

1tnk 1ppc 4.74 1.95 1.62 4.74 1.95 1.62

1tnk 1pph 4.28 1.79 1.42 4.28 1.79 1.42

1tnk 1tng 2.9 1.66 1.39 2.9 1.66 1.39

1tnk 1tnh 3.62 1.56 1.17 3.62 1.56 1.17

1tnk 1tni 2.66 1.61 1.42 2.66 1.61 1.42

1tnk 1tnj 2.6 1.59 1.28 2.6 1.59 1.28

1tnk 1tnk 3.52 1.41 1.25 3.52 1.41 1.25

1tnk 1tnl 4.44 1.73 1.43 4.44 1.73 1.43

1tnk 1tpp 3.67 1.75 1.45 3.67 1.75 1.45

1tnk 3ptb 3.09 1.65 1.37 3.09 1.65 1.37

1tnl 1ppc 4.46 1.86 1.12 4.46 1.86 1.12

1tnl 1pph 2.1 1.5 1.21 2.1 1.5 1.21

1tnl 1tng 3.05 1.34 1.14 3.05 1.34 1.14

1tnl 1tnh 2.56 1.33 1.18 2.56 1.33 1.18

1tnl 1tni 2.67 1.34 1.07 2.67 1.34 1.07

1tnl 1tnj 1.78 1.33 1.23 1.78 1.33 1.23

1tnl 1tnk 2.72 1.38 1.1 2.72 1.38 1.1






1tnl 1tnl 3.29 1.3 1.03 3.29 1.3 1.03

1tnl 1tpp 2.06 1.37 1.37 2.06 1.37 1.37

1tnl 3ptb 3.39 1.38 1.21 3.39 1.38 1.21

1tpp 1ppc 4.85 3.09 2.06 4.84 3.09 2.06

1tpp 1pph 5.58 2.73 1.94 5.58 2.73 1.94

1tpp 1tng 5.15 4.5 2.2 5.15 4.5 2.2

1tpp 1tnh 4.56 3.44 1.77 4.56 3.44 1.77

1tpp 1tni 4.61 4.39 2.49 4.61 4.39 2.49

1tpp 1tnj 4.5 3.54 1.96 4.5 3.54 1.96

1tpp 1tnk 5.53 4.06 1.99 5.53 4.06 1.99

1tpp 1tnl 5.19 4.11 2.74 5.19 4.11 2.74

1tpp 1tpp 4.23 2.61 1.78 4.23 2.61 1.78

1tpp 3ptb 5.97 3.99 2.82 5.97 3.99 2.82

3ptb 1ppc 3.04 1.75 0.98 3.04 1.75 0.98

3ptb 1pph 3.28 1.61 1.11 3.28 1.61 1.11

3ptb 1tng 3.16 2.59 0.97 3.16 2.59 0.97

3ptb 1tnh 3.7 2.3 1.18 3.7 2.3 1.18

3ptb 1tni 3.17 2.35 0.93 3.17 2.35 0.93

3ptb 1tnj 2.7 2.32 1.37 2.7 2.32 1.37

3ptb 1tnk 3.49 1.95 1.2 3.49 1.95 1.2

3ptb 1tnl 3.09 2.38 1.09 3.09 2.38 1.09

3ptb 1tpp 3.14 1.94 0.77 3.14 1.94 0.77

3ptb 3ptb 2.8 2.22 1.24 2.8 2.22 1.24

Table D.1: RMSD [A ] of all movable atoms and of ligand atoms onlyin the trypsin data set, 100 cross docking experiments

List of Figures

1.1 Schematic diagram of the main methods in the drug discov-ery process. Arrows designate process flow. Black asterisksmark steps that may involve molecular docking. Abbrevia-tions: SAR – structure-activity relationship; QSAR – quan-titative SAR; ADME-Tox – absorption, distribution, elimina-tion, toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Typical shapes of electrostatic interactions energy. The energyof two identical (full line) and opposite (dashed line) chargesin vacuum are shown . . . . . . . . . . . . . . . . . . . . . . . 14

1.3 Examples of inter- (left) and intra- (right) molecular H-bonds 151.4 Van der Waals interaction energy of argon dimer. Taken from

the Wikipedia [113] under the GNU Free Documentation License 161.5 Comparison of Morse (dashed line) and Hooke’s harmonic (full

line) potentials of bond stretching energy around the mini-mum. To construct this graph, all the parameters in equations(1.15) and (1.16) were assigned the value of 1 . . . . . . . . . 17

2.1 “Tearing off” atoms to represent side chain flexibility usingphenylalanine as an example. Dummy atoms are marked bythe letter “D” in their names. The N, Cα and Cβ atoms onthe receptor molecule overlap with their respective dummycounterparts. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.2 Structural alignment of 456c and 966c. Backbone traces ofthe proteins are color coded according to the distance (in A)between the aligned backbone atoms. RS-130830 (red) andRS-104966 (green) are shown as sticks models. . . . . . . . . . 58

123

List of Figures 124

2.3 Cross section of AChE complexed with acetylcholine (PDBcode: 2ace), colored by (A) partial charge of the atoms and(B) by the residue type (colored by PyMol): hydrophobic(GILMPV) – white, aromatic (FWY) – magenta, semipolar(C) – yellow, polar (HNQST) – cyan, positive (KR) – blue,negative (DE) – red. Acetylcholine is colored blue in bothpanes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2.4 AChE complexed with Huperzine A (PDB code: 1vot, lightgray) and with Aricept (PDB code: 1eve, dark gray). Theligands and Phe 330 side chains from both the complexes arehighlighted using sticks. . . . . . . . . . . . . . . . . . . . . . 61

2.5 Trypsin data set. 10 superimposed trypsin structures: 1ppc,1pph, 1tng, 1tnh, 1tni, 1tnj, 1tnk, 1tnl, 1tpp and 3ptb. Theligand molecules and the residues that are treated as flexibleare shown as sticks. The remaining parts of the proteins areshown as backbone trace. . . . . . . . . . . . . . . . . . . . . . 63

3.1 Top single docking poses at different RMSD bins with respectto crystal structures, 4 different programs. Results for Glideand GOLD were obtained by Perola et al.[84]. . . . . . . . . . 67

3.2 Top 20 docking poses, RMSD to corresponding crystal struc-tures. Results for Glide and GOLD were obtained by Perolaet al.[84]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.3 Top available docking poses produced in equal CPU times,RMSD to corresponding crystal structures. The numbers ofposes are 4096 (ISE) and 35 (LGA). . . . . . . . . . . . . . . . 71

3.4 Number of iterations before switching to exhaustive searchas a function of initial combinatorial size (number of initialcombinations). . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.5 A: Energy vs RMSD plot for docking populations of the com-plex 1yds obtained with ISE, showing a single distinct funnel.B: the same plot for 35 solutions obtained by LGA. The plotsare shown using the same scale. C: The first 35 solutions (darklines) docked by ISE vs the ligand in the crystal (gray sticks).Receptor residues with at least one atom within 5.5A of theligand are shown as light gray cartoon. All structures in thiswork were visualized using PyMol[15]. . . . . . . . . . . . . . . 74

List of Figures 125

3.6 A: Energy vs RMSD plot for docking populations of the com-plex 1bqo obtained with ISE, showing two distinct funnels. B:the same plot for 35 solutions obtained by LGA. The plotsare shown using the same scale. C: The crystal structure ofthe ligand (gray sticks) and the first 35 solutions (dark lines)docked by ISE. . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.7 A: Energy vs RMSD plot for docking populations of the com-plex 1hpv obtained with ISE, showing a scatter of the results.B: the same plot for 35 solutions obtained by LGA. The plotsare shown using the same scale. C: The crystal structure ofthe ligand and the first 35 solutions docked by ISE. . . . . . . 76

3.8 Cumulative fractions (Y-axis) of 81 ISE docking complexeswith an energy span between the global minimum of each (posenumber 1) and the other 4095 poses, below the given threshold(X-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.9 Complexes 1kv1 (light gray) and 1kv2 (dark gray) superim-posed using backbone atoms. The ligands are shown as sticksand backbone of closest (within 5.5 A) residues to the ligandare shown as PyMol cartoons. . . . . . . . . . . . . . . . . . 79

3.10 Energy vs RMSD plot for docking populations obtained byISE (A) and LGA (B) of the complex 1kv1. The plots areshown using the same scale. The best single ISE solutions ateach of the three funnels have ranks 1, 222 and 270 and aremarked with arrows. . . . . . . . . . . . . . . . . . . . . . . . 80

3.11 The best ISE-dock solution for 1kv1 (sticks). The crystalstructures of 1kv1 and 1kv2 ligands are shown for compari-son (lines). 1kv1 is colored according to: C – cyan, N – blue,Cl – green. 1kv2 is colored according to: C– yellow, N – blue,O – red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.12 ISE-dock solution for 1kv1, ranked 222 (sticks). The crystalstructures of 1kv1 and 1kv2 ligands are shown for comparison(lines). The coloring scheme is identical to that of Figure 3.11 81

3.13 ISE-dock solution for 1kv1 solution ranked 270 (sticks). Thecrystal structures of 1kv1 and 1kv2 ligands are shown for com-parison (lines). The coloring scheme is identical to that onFigure 3.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

List of Figures 126

3.14 Energy vs. RMSD plot for docking populations of the complex2rox, obtained by ISE (A) and LGA(B). The best single ISEsolutions at each of the two funnels have ranks 1 and 2 and aremarked with arrows. C: Antiparallel docking solutions ranked1 and 2 for 2rox (green and magenta sticks respectively). Thecarbons in the crystal structure of thyroxine are shown thinsticks colored cyan. The backbone of closest (within 5.5 A)residues to the ligand are shown in PyMol cartoon represen-tation colored cyan. . . . . . . . . . . . . . . . . . . . . . . . . 83

4.1 The best available docking solution for (A) 1eve-1vot and (B)1vot-1eve in unbound (cross-) docking experiments. The dock-ing solutions for all the movable atoms are shown as lines andthe crystal structures are shown as sticks. The protein struc-tures are shown as backbone trace. . . . . . . . . . . . . . . . 89

4.2 The best available docking solution for (A) 1eve-1eve and (B)1vot-1vot in bound docking experiments. The docking solu-tions for all the movable atoms are shown as lines and thecrystal structures are shown as sticks. The protein structuresare shown as backbone trace. . . . . . . . . . . . . . . . . . . 90

4.3 Top docking poses at different RMSD bins with respect tocrystal structures . . . . . . . . . . . . . . . . . . . . . . . . . 91

C.1 Energy vs RMSD plots for ISE-dock (red) and AutoDock(green) of complexes in the flexible ligand - rigid protein dock-ing set. The graphs are sorted alphabetically according to thePDB code of the complex. Continued on the following figures. 112

C.2 Continued from the previous figure. Energy vs RMSD plotsfor ISE-dock (red) and AutoDock (green) of complexes inthe flexible ligand - rigid protein docking set. The graphs aresorted alphabetically according to the PDB code of the complex.113



List of Figures 127




List of Tables

2.1 PDB codes of the 81 complexes in the rigid protein test set. . 512.2 Affinities to collagenase . . . . . . . . . . . . . . . . . . . . . . 57

3.1 Summary of docking results by ISE, LGA, Glide and GOLD. . 653.2 Binding modes of 1-(5-tert-butyl-2-methyl-2h-pyrazol-3-yl)- 3-

(4-chloro-phenyl)-urea (from 1kv1) . . . . . . . . . . . . . . . 81

4.1 Collagenase data set, best ligands’ RMSD (A) in top 1, top20 and all available (4096) solutions. RMSD of the backbonefrom the crystal position of the corresponding solution is alsoreported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.2 Results of Acetylcholinesterase cross docking . . . . . . . . . . 884.3 Torsion RMSD of flexible residues in the trypsin data set . . . 914.4 Trypsin data set, RMSD values of top single docking poses

and best docking poses in top 20 and top 4096 solutions . . . 934.5 Current status of protein flexibility handling ISE-dock and in

five popular docking programs (sorted according to the num-ber of citations in 2005[95]) . . . . . . . . . . . . . . . . . . . 94

C.1 Detailed docking results of the flexible ligand – rigid proteindata set. RMSD[A] . . . . . . . . . . . . . . . . . . . . . . . . 110

D.1 RMSD [A ] of all movable atoms and of ligand atoms only inthe trypsin data set, 100 cross docking experiments . . . . . . 122

128

Acknowledgments

First of all, I thank Prof. Amiram Goldblum, my supervisor, for the unlim-ited freedom and trust and for his guidance and support.

This research was supported by the Israel Science Foundation (ISF) grantno 608/02. I thank the Alex Grass Center for Drug Design and Synthesis forfurther support, Dr. Emmanuele Perola for sending his data as well as formaking useful suggestions, Dr. Anwar Rayan for helpful discussions and Mrs.Efrat Noy for her ideas and for helping with the programming. Dr. MorrisM. Garret was instrumental in solving our problems with AutoDock usageand for providing suggestions for improving LGA results.

This work would not have been possible without the support of my wife,Einat, who released me from all my domestic duties and supported me duringthe preparation of this work.

129

Bibliography

[1] R Abagyan and M Totrov. High-throughput docking for lead generation.Curr Opin Chem Biol, 5(4):375 – 82, 2001.

[2] R Abagyan, M Totrov, and D Kuznetsov. ICM – a new method for proteinmodeling and design: applications to docking and structure prediction fromthe distorted native conformations. J Comp Chem, 15(5):488 – 506, 1994.

[3] LM Amzel. Calculation of entropy changes in biological processes: folding,binding, and oligomerization. Methods Enzymol, 323:167–177, 2000.

[4] AC Anderson, RH O’Neil, TS Surti, and RM Stroud. Approaches to solvingthe rigid receptor problem by identifying a minimal set of flexible residuesduring ligand docking. Chem Biol, 8(5):445–457, May 2001.

[5] FC Bernstein, TF Koetzle, GJ B Williams, EF Meyer, MD Brice,JR Rodgers, O Kennard, T Shimanouchi, and M Tasumi. Protein databank – computer-based archival file for macromolecular structures. ArchBiochem Biophys, 185(2):584 – 591, 1978.

[6] A Bialonska and Z Ciunik. Hydrophobic ’lock and key’ recognition of n-4-nitrobenzoylamino acid by strychnine. Acta Crystallogr B Struct Sci, 62:1061– 1070, 2006.

[7] W Cai, X Shao, and B Maigret. Protein-ligand recognition using sphericalharmonic molecular surfaces: towards a fast and efficient filter for largevirtual throughput screening. J Mol Graph Model, 20(4):313–328, Jan 2002.

[8] CJ Camacho, DW Gatchell, SR Kimura, and S Vajda. Scoring dockedconformations generated by rigid-body protein-protein docking. Proteins,40(3):525–537, Aug 2000.

[9] MD Cameron, B Wen, KE Allen, AG Roberts, JT Schuman, AP Campbell,KL Kunze, and SD Nelson. Cooperative binding of midazolam with testos-terone and alpha-naphthoflavone within the CYP3A4 active site: a NMRT1 paramagnetic relaxation study. Biochemistry, 44(43):14143–14151, Nov2005.

130

Bibliography 131

[10] HA Carlson. Protein flexibility and drug design: how to hit a moving target.Curr Opin Chem Biol, 6(4):447–452, Aug 2002.

[11] C Catana and PFW Stouten. Novel, customizable scoring functions, param-eterized using n-pls, for structure-based drug discovery. J Chem Inf Model,47(1):85–91, 2007.

[12] H Claussen, C Buning, M Rarey, and T Lengauer. Flexe: efficient moleculardocking considering protein structure variations. J Mol Biol, 308(2):377–395,2001.

[13] JC Cole, CW Murray, JW Nissink, RD Taylor, and R Taylor. Comparingprotein-ligand docking programs is difficult. Proteins, 60(3):325–332, Aug2005.

[14] WD Cornell, P Cieplak, CI Bayly, IR Gould, KM Merz, DM Ferguson,DC Spellmeyer, T Fox, JW Caldwell, and PA Kollman. Second generationforce field for the simulation of proteins, nucleic acids, and organic molecules.J Am Chem Soc, 117:5179–5197, 1995.

[15] WL DeLano. The PyMol molecular graphics system. DeLano Scientific LLC,San Carlos, Ca, USA.

[16] KA Dill and HS Chan. From Levinthal to pathways to funnels. Nat StructBiol, 4(1):10–19, Jan 1997.

[17] OA Donini and PA Kollman. Calculation and prediction of binding freeenergies for the matrix metalloproteinases. J Med Chem, 43(22):4180–4188,Nov 2000.

[18] M Ekroos and T Sjogren. Structural basis for ligand promiscuity in cy-tochrome P450 3A4. Proc Natl Acad Sci U S A, 103(37):13682–13687, Sep2006.

[19] AM Ferrari, BQ Wei, LCostantino, and BK Shoichet. Soft docking and mul-tiple receptor conformations in virtual screening. J Med Chem, 47(21):5076–5084, Oct 2004.

[20] D Fischer, SL Lin, HL Wolfson, and R Nussinov. A geometry-based suite ofmolecular docking processes. J Mol Biol, 248(2):459–477, Apr 1995.

[21] E Fischer. Einfluss der configuration auf die wirkung derenzyme. Ber DtChem Ges, 27:2985 – 2993, 1894.

[22] E Freire. The propagation of binding interactions to remote sites in proteins:Analysis of the binding of the monoclonal antibody d1.3 to lysozyme. ProcNatl Acad Sci U S A, 96(18):10118 – 10122, 1999.

Bibliography 132

[23] RA Friesner, JL Banks, RB Murphy, T A Halgren, JJ Klicic, DT Mainz,MP Repasky, EH Knoll, M Shelley, JK Perry, DE Shaw, P Francis, andPS Shenkin. Glide: a new approach for rapid, accurate docking and scoring.1. method and assessment of docking accuracy. J Med Chem, 47(7):1739–1749, March 2004.

[24] RA Friesner, RB Murphy, MP Repasky, LL Frye, JR Greenwood, TA Hal-gren, PC Sanschagrin, and DT Mainz. Extra precision Glide: docking andscoring incorporating a model of hydrophobic enclosure for protein-ligandcomplexes. J Med Chem, 49(21):6177–6196, Oct 2006.

[25] HA Gabb, RM Jackson, and MJ Sternberg. Modelling protein docking usingshape complementarity, electrostatics and biochemical information. J MolBiol, 272(1):106–120, Sep 1997.

[26] P Gadakar, S Phukan, P Dattatreya, and V Balaji. Pose prediction accuracyin docking studies and enrichment of actives in the active site of gsk-3beta.J Chem Inf Model, Jun 2007.

[27] L Gales, S Macedo-Ribeiro, G Arsequell, G Valencia, MJ Saraiva, andAM Damas. Human transthyretin in complex with iododiflunisal: structuralfeatures associated with a potent amyloid inhibitor. Biochem J, 388(2):615–621, Jun 2005.

[28] J Gasteiger and M Marsili. Iterative partial equalization of or-bital electronegativity–a rapid access to atomic charges. Tetrahedron,36(22):3219–3228, 1980.

[29] F Glaser, DM Steinberg, IA Vakser, and N Ben-Tal. Residue frequenciesand pairing preferences at protein-protein interfaces. Proteins, 43(2):89–102,May 2001.

[30] M Glick and A Goldblum. A novel energy-based stochastic method for posi-tioning polar protons in protein structures from x-rays. Proteins, 38(3):273–287, Feb 2000.

[31] M Glick, Anwar Rayan, and A Goldblum. A stochastic algorithm for globaloptimization and for best populations: a test case of side chains in proteins.Proc Natl Acad Sci U S A, 99(2):703–708, Jan 2002.

[32] DS Goodsell, GM Morris, and AJ Olson. Automated docking of flexibleligands: applications of autodock. J Mol Recognit, 9(1):1 – 5, Jan-Feb 1996.

[33] DS Goodsell and AJ Olson. Automated docking of substrates to proteins bysimulated annealing. Proteins, 8(3):195–202, 1990.

Bibliography 133

[34] I Halperin, BY Ma, H Wolfson, and R Nussinov. Principles of docking: Anoverview of search algorithms and a guide to scoring functions. Proteins,47(4):409 – 443, 2002.

[35] JA Hamilton and MD Benson. Transthyretin: a review from a structuralperspective. Cell Mol Life Sci, 58(10):1491–1521, Sep 2001.

[36] JA Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., 1975.

[37] D Herschlag. The role of induced fit and conformational-changes of enzymesin specificity and catalysis. Bioorg Chem, 16(1):62 – 96, 1988.

[38] TL Hill. Steric effects. i van der waals potential energy curves. J Chem Phys,16:399, 1948.

[39] X Hu and WH Shelver. Docking studies of matrix metalloproteinase in-hibitors: zinc parameter optimization to improve the binding free energyprediction. J Mol Graph Model, 22(2):115–126, Nov 2003.

[40] MN James, A Sielecki, F Salituro, DH Rich, and T Hofmann. Confor-mational flexibility in the active sites of aspartyl proteinases revealed by apepstatin fragment binding to penicillopepsin. Proc Natl Acad Sci U S A,79(20):6137–6141, Oct 1982.

[41] J Janin and C Chothia. The structure of protein-protein recognition sites.J Biol Chem, 265(27):16027–16030, Sep 1990.

[42] G Jones, P Willett, RC Glen, AR Leach, and R Taylor. Development andvalidation of a genetic algorithm for flexible docking. J Mol Biol, 267(3):727– 48, 1997.

[43] A Kahraman, RJ Morris, RA Laskowski, and JM Thornton. Shape variationin protein binding pockets and their ligands. J Mol Biol, 368(1):283–301,Apr 2007.

[44] P Kallblad, RL Mancera, and NP Todorov. Assessment of multiple bindingmodes in ligand-protein docking. J Med Chem, 47(13):3334–3337, Jun 2004.

[45] CD Kirkpatrick. Optimization by simulated annealing. Science, 220:671 –680, 1983.

[46] RM Knegtel, ID Kuntz, and CM Oshiro. Molecular docking to ensembles ofprotein structures. J Mol Biol, 266(2):424–440, Feb 1997.

[47] RM A Knegtel, DM Bayada, RA Engh, W von der Saal, VJ van Geerestein,and PD J Grootenhuis. Comparison of two implementations of the incremen-tal construction algorithm in flexible docking of thrombin inhibitors. AngewChem Int Ed, 13(2):167–183., 1999.

Bibliography 134

[48] DE Koshland. Application of a theory of enzyme specificity to protein syn-thesis. Proc Natl Acad Sci U S A, 44(2):98–104, February 1958.

[49] B Kramer, M Rarey, and T Lengauer. Evaluation of the FLEXX incrementalconstruction algorithm for protein-ligand docking. Proteins, 37(2):228–241,Nov 1999.

[50] RT Kroemer, A Vulpetti, JJ McDonald, DC Rohrer, JY Trosset, F Gior-danetto, S Cotesta, C McMartin, M Kihlen, and PFW Stouten. Assessmentof docking poses: interactions-based accuracy classification (IBAC) versuscrystal structure deviations. J Chem Inf Comput Sci, 44(3):871–881, 2004.

[51] M Kumar and MV Hosur. Adaptability and flexibility of HIV-1 protease.Eur J Biochem, 270(6):1231 – 1239, 2003.

[52] S Kumar, B Ma, CJ Tsai, N Sinha, and R Nussinov. Folding and bindingcascades: dynamic landscapes and population shifts. Protein Sci, 9(1):10–19,Jan 2000.

[53] ID Kuntz, JM Blaney, SJ Oatley, R Langridge, and TE Ferrin. A geometricapproach to macromolecule-ligand interactions. J Mol Biol, 161(2):269 – 88,Oct 25 1982.

[54] AR Leach. Molecular Modelling. Principles and Applications, chapter Em-prical Force Field Models: Molecular Mechanics, pages 165 – 252. PrenticeHall, 2001.

[55] BM Lee, J Xu, BK Clarkson, MA Martinez-Yamout, HJ Dyson, DA Case,JM Gottesfeld, and PE Wright. Induced fit and ”lock and key” recognition of5S’ RNA by zinc fingers of transcription factor IIIA. J Mol Biol, 357(1):275– 291, 2006.

[56] PE Leopold, M Montal, and JN Onuchic. Protein folding funnels: A kineticapproach to the sequence-structure relationship. Proc Natl Acad Sci U S A,89(18):8721–8725, September 1992.

[57] PJ Lewis, M de Jonge, F Daeyaert, L Koymans, M Vinkers, J Heeres, PAJJanssen, E Arnold, K Das, AD Clark, SH Hughes, PL Boyer, M Bethune,R Pauwels, K Andries, M Kukla, and D Ludovici. On the detection ofmultiple-binding modes of ligands to proteins, from biological, structural,and modeling data. J Comput Aided Mol Des, 17(2 – 4):129–134, 2003.

[58] JH Lii and NL Allinger. Directional hydrogen bonding in the MM3 forcefield. J Comp Chem, 19(9):1001 – 1016, 1998.

[59] B Lovejoy, AR Welch, S Carr, C Luong, C Broka, and T et al. Hendricks.Crystal structures of MMP-1 and -13 reveal the structural basis for selectivityof collagenase inhibitors. Nat Struct Biol, 6(3):217 – 221, 1999.

Bibliography 135

[60] H Lu, J Macosko, D Habel-Rodriguez, RW Keller, JA Brozik, and DJ Keller.Closing of the fingers domain generates motor forces in the hiv reverse tran-scriptase. J Biol Chem, 279(52):54529–54532, Dec 2004.

[61] BH Luo, TA Springer, and J Takagi. High affinity ligand binding by integrinsdoes not involve head separation. J Biol Chem, 278(19):17185–17189, May2003.

[62] B Ma, S Kumar, CJ Tsai, and R Nussinov. Folding funnels and bindingmechanisms. Protein Eng, 12(9):713–720, Sep 1999.

[63] B Ma, M Shatsky, HJ Wolfson, and R Nussinov. Multiple diverse ligandsbinding at a single protein site: A matter of pre-existing populations. ProtSci, 11(2):184 – 197, 2002.

[64] AD Mackerell. Empirical force fields for biological macromolecules: overviewand issues. J Comput Chem, 25(13):1584–1604, Oct 2004.

[65] AD Mackerell, D Bashford, M Bellott, R L Dunbrack, JD Evanseck,MJ Field, S Fischer, J Gao, H Guo, S Ha, D Joseph-Mccarthy, L Kuchnir,K Kuczera, FT K Lau, C Mattos, S Michnick, T Ngo, DT Nguyen, B Prod-hom, WE Reiher, B Roux, M Schlenkrich, JC Smith, R Stote, J Straub,M Watanabe, J Wiorkiewicz-Kuczera, D Yin, and M Karplus. All-atom em-pirical potential for molecular modeling and dynamics studies of proteins. JPhys Chem B, 102(18):3586–3616, April 1998.

[66] TG Marshall, RE Lee, and FE Marshall. Common angiotensin receptorblockers may directly modulate the immune system via VDR, PPAR andCCR2b. Theor Biol Med Model, 3:1, 2006.

[67] BW Matthews. Protein structure initiative: getting into gear. Nat StructMol Biol, 14(6):459–460, Jun 2007.

[68] C McMartin and RS Bohacek. QXP: powerful, rapid computer algorithmsfor structure-based drug design. J Comput Aided Mol Des, 11(4):333–344,Jul 1997.

[69] S Miyazawa and RL Jernigan. A new substitution matrix for protein se-quence searches based on contact frequencies in protein structures. ProteinEng, 6(3):267–278, Apr 1993.

[70] GM Morris, DS Goodsell, RS Halliday, R Huey, WE Hart, RK Belew, andAJ Olson. Automated docking using a lamarckian genetic algorithm and anempirical binding free energy function. J Comp Chem, 19(14):1639 – 1662,1998.

Bibliography 136

[71] GM Morris, DS Goodsell, R Huey, and AJ Olson. Distributed automateddocking of flexible ligands to proteins: Parallel applications of autodock 2.4.J Comput Aid Mol Des, 10(4):293 – 304, 1996.

[72] A Murcko and MA Murcko. Computational methods to predict binding freeenergy in ligand-receptor complexes. J Med Chem, 38(26):4953–4967, Dec1995.

[73] R Najmanovich, J Kuttner, V Sobolev, and M Edelman. Side-chain flexibilityin proteins upon ligand binding. Proteins, 39(3):261–268, May 2000.

[74] R Norel, SL Lin, HJ Wolfson, and R Nussinov. Shape complementarity atprotein-protein interfaces. Biopolymers, 34(7):933–940, Jul 1994.

[75] J Norvell and JM Berg. The protein structure initiative, five years later.Scientist, 19(20):30 – 31, 2005.

[76] E Noy, T Tabakman, and A Goldblum. Constructing ensembles of flexiblefragments in native proteins by iterative stochastic elimination is relevant toproteinprotein interfaces. Proteins, 68:702 – 711, 2007.

[77] R Nussinov and HJ Wolfson. Efficient computational algorithms for dockingand for generating and matching a library of functional epitopes i rigid andflexible hinge-bending docking algorithms. Comb Chem High ThroughputScreen, 2(5):249 – 59, 1999.

[78] R Nussinov and HJ Wolfson. Efficient computational algorithms for dockingand for generating and matching a library of functional epitopes ii. computervision-based techniques for the generation and utilization of functional epi-topes. Comb Chem High Throughput Screen, 2(5):261–269, Oct 1999.

[79] VD Ozrin, MV Subbotin, and SM Nikitin. Plass: protein-ligand affinitystatistical score–a knowledge-based force-field model of interaction derivedfrom the pdb. J Comput Aided Mol Des, 18(4):261–270, Apr 2004.

[80] C Pargellis, L Tong, L Churchill, PF Cirillo, T Gilmore, AG Graham,PM Grob, ER Hickey, N Moss, S Pav, and J Regan. Inhibition of p38map kinase by utilizing a novel allosteric binding site. Nat Struct Biol,9(4):268–272, Apr 2002.

[81] P De La Paz, Burridge, SJ JM Oatley, and CCF. Blake. Multiple modesof binding of thyroid hormones and other iodothyronines to human plasmatransthyretin., chapter Multiple modes of binding of thyroid hormones andother iodothyronines to human plasma transthyretin., pages 119 – 172. 1992.

[82] DA Pearlman. Free Energy Calculations in Rational Drug Design, chapterTheory, pages 9 – 35. Springer, 2001.

Bibliography 137

[83] E Perola and PS Charifson. Conformational analysis of drug-like moleculesbound to proteins: an extensive study of ligand reorganization upon binding.J Med Chem, 47(10):2499–2510, May 2004.

[84] E Perola, WP Walters, and PS Charifson. A detailed comparison of cur-rent docking and scoring methods on systems of pharmaceutical relevance.Proteins, 56(2):235–249, Aug 2004.

[85] M Rarey, B Kramer, and T Lengauer. The particle concept: placing dis-crete water molecules during protein-ligand docking predictions. Proteins,34(1):17 – 28, 1999.

[86] A Rayan, E Noy, D Chema, i A Levitzk, and A Goldblum. Stochasticalgorithm for kinase homology model construction. Cur Med Chem, 11:675– 692, 2004.

[87] A Rayan, H Senderowitz, and A Goldblum. Exploring the conformationalspace of cyclic peptides by a stochastic search method. J Mol Graph Model,22(5):319–333, May 2004.

[88] TJ Rydel, A Tulinsky, W Bode, and R Huber. Refined structure of thehirudin-thrombin complex. J Mol Biol, 221(2):583–601, Sep 1991.

[89] B Sandak, R Nussinov, and HJ Wolfson. An automated computer vision androbotics-based technique for 3-d flexible biomolecular docking and matching.Comput Appl Biosci, 11(1):87–99, Feb 1995.

[90] B Sandak, R Nussinov, and HJ Wolfson. A method for biomolecular struc-tural recognition and docking allowing conformational flexibility. J ComputBiol, 5(4):631–654, 1998.

[91] DM Schulz, C Ihling, GM Clore, and A Sinz. Mapping the topologyand determination of a low-resolution three-dimensional structure of thecalmodulin-melittin complex by chemical cross-linking and high-resolutionfticrms: direct demonstration of multiple binding modes. Biochemistry,43(16):4703–4715, Apr 2004.

[92] J Singh, Z Deng, G Narale, and C Chuaqui. Structural interaction fin-gerprints: a new approach to organizing, mining, analyzing, and designingprotein-small molecule complexes. Chem Biol Drug Des, 67(1):5–12, January2006.

[93] FJ Solis and RJ-B Wets. Minimization by random search techniques. MathOper Res, 6:19–30, 1981.

[94] CA Sotriffer and I Dramburg. ”In situ cross-docking” to simultaneouslyaddress multiple targets. J Med Chem, 48(9):3122–3125, May 2005.

Bibliography 138

[95] SF Sousa, PA Fernandes, and MJ Ramos. Protein-ligand docking: currentstatus and future challenges. Proteins, 65(1):15–26, Oct 2006.

[96] RD Taylor, PJ Jewsbury, and JW Essex. FDS: flexible ligand and receptordocking with a continuum solvent model and soft-core energy function. JComput Chem, 24(13):1637–1656, Oct 2003.

[97] SJ Teague. Implications of protein flexibility for drug discovery. Nat RevDrug Discov, 2(7):527–541, Jul 2003.

[98] GE Terp, IT Christensen, and FS Jørgensen. Structural differences of matrixmetalloproteinases. homology modeling and energy minimization of enzyme-substrate complexes. J Biomol Struct Dyn, 17(6):933–946, Jun 2000.

[99] A Tovchigrechko and IA Vakser. How common is the funnel-like energylandscape in protein-protein interactions? Protein Sci, 10(8):1572–1583,Aug 2001.

[100] CJ Tsai, S Kumar, B Ma, and R Nussinov. Folding funnels, binding funnels,and protein function. Protein Sci, 8(6):1181–1190, Jun 1999.

[101] S Vajda, Z Weng, R Rosenfeld, and C DeLisi. Effect of conformational flex-ibility and solvation on receptor-ligand binding free energies. Biochemistry,33(47):13977–13988, Nov 1994.

[102] IA Vakser. Low-resolution docking: prediction of complexes for underdeter-mined structures. Biopolymers, 39(3):455–464, Sep 1996.

[103] IA Vakser, OG Matar, and CF Lam. A systematic study of low-resolutionrecognition in protein–protein complexes. Proc Natl Acad Sci U S A,96(15):8477–8482, Jul 1999.

[104] ADJ van Dijk and AMJ Bonvin. Solvated docking: introducing water intothe modelling of biomolecular complexes. Bioinformatics, 22(19):2340–2347,Oct 2006.

[105] GM Verkhivker, PA Rejto, DK Gehlhaar, and ST Freer. Exploring the energylandscapes of molecular recognition by a genetic algorithm: analysis of therequirements for robust docking of hiv-1 protease and fkbp-12 complexes.Proteins, 25(3):342–353, Jul 1996.

[106] DF Wang, O Wiest, P Helquist, HY Lan-Hargest, and NL Wiech. On thefunction of the 14 a long internal cavity of histone deacetylase-like protein:implications for the design of histone deacetylase inhibitors. J Med Chem,47(13):3409–3417, Jun 2004.

[107] J Wang, PA Kollman, and ID Kuntz. Flexible ligand docking: a multistepstrategy approach. Proteins, 36(1):1 – 19, 1999.

Bibliography 139

[108] J Wang, P Morin, W Wang, and PA Kollman. Use of mm-pbsa in reproduc-ing the binding free energies to hiv-1 rt of tibo derivatives and predictingthe binding mode to hiv-1 rt of efavirenz by docking and mm-pbsa. J AmChem Soc, 123(22):5221–5230, Jun 2001.

[109] R Wang, Y Lu, and S Wang. Comparative evaluation of 11 scoring functionsfor molecular docking. J Med Chem, 46(12):2287–2303, Jun 2003.

[110] GL Warren, CW Andrews, AM Capelli, B Clarke, J LaLonde, MH Lambert,M Lindvall, N Nevins, SF Semus, SSenger, G Tedesco, ID Wall, JM Woolven,CE Peishoff, and Martha S Head. A critical assessment of docking programsand scoring functions. J Med Chem, 49(20):5912–5931, Oct 2006.

[111] PK Weiner and PA Kollman. Amber: Assisted model building with energyrefinement. a general program for modeling molecules and their interactions.J Comp Chem, 2, 1981.

[112] SJ Weiner, PA Kollman, DA Case, UC Singh, C Ghio, G Alagona, S Profeta,and P Weiner. A new force field for molecular mechanical simulation ofnucleic acids and proteins. J Am Chem Soc, 106(3):765–784, 1984.

[113] Wikipedia. Interaction energy of argon dimer.

[114] Z Xiang and B Honig. Extending the accuracy limits of prediction for side-chain conformations. J Mol Biol, 311(2):421–430, Aug 2001.

[115] C Zhang, J Chen, and C DeLisi. Protein-protein recognition: exploring theenergy funnels near the binding sites. Proteins, 34(2):255–267, Feb 1999.

[116] L Zıdek, MV Novotny, and MJ Stone. Increased protein backbone con-formational entropy upon hydrophobic ligand binding. Nat Struct Biol,6(12):1118–1121, Dec 1999.

Hebrew abstract

140

Documents

Predicting Protein-Ligand Interactions using Iterative Stochastic Elimination Algorithm