An All-atom Distance-dependent Conditional Probability ...ram.org/compbio/publications/samudrala_1998a.pdf · An All-atom Distance-dependent Conditional Probability Discriminatory

An All-atom Distance-dependent ConditionalProbability Discriminatory Function for ProteinStructure Prediction

Ram Samudrala1,2 and John Moult1*

1Center for Advanced Researchin Biotechnology, University ofMaryland BiotechnologyInstitute, 9600, GudelskyDrive, Rockville, MD 20850USA2Molecular and Cell BiologyProgram, University ofMaryland at College ParkCollege Park, MD 20742, USA

We present a formalism to compute the probability of an amino acidsequence conformation being native-like, given a set of pairwise atom±atom distances. The formalism is used to derive three discriminatoryfunctions with different types of representations for the atom±atom con-tacts observed in a database of protein structures. These functions includetwo virtual atom representations and one all-heavy atom representation.When applied to six different decoy sets containing a range of correctand incorrect conformations of amino acid sequences, the all-atom dis-tance-dependent discriminatory function is able to identify correct fromincorrect more often than the discriminatory functions using approximaterepresentations. We illustrate the importance of using a detailed atomicdescription for obtaining the most accurate discrimination, and the neces-sity for testing discriminatory functions against a wide variety of decoys.The discriminatory function is also shown to be capable of capturing the®ne details of atom±atom preferences. These results suggest that the all-atom distance-dependent discriminatory function will be useful forprotein structure prediction and model re®nement.

# 1998 Academic Press Limited

Keywords: knowledge-based; conditional probability; discriminatoryfunction; decoy sets*Corresponding author

Introduction

Any algorithm that attempts to predict proteinstructure requires a discriminatory function thatcan distinguish between correct and incorrect con-formations. These discriminatory functions can beextremely simple: for example, counting atomic

contacts in a given conformation, or can involveelaborate calculations based on the physics of thesystem to determine the energy of a conformation(Brooks et al., 1983; Weiner et al., 1986; Jorgensen &Tirado-Rives, 1988).

A class of discriminatory functions is knowl-edge-based. These functions compile parametersfrom tendencies observed in a database of exper-imentally determined protein structures (Wodak &Rooman, 1993; Sippl, 1995). Historically, knowl-edge-based discriminatory functions have gainedpopularity in their application to the ``fold recog-nition'' problem, i.e. recognising the fold an aminoacid belongs to in the absence of detectablesequence homology (Sippl, 1990; Bowie et al., 1991;Jones et al., 1992; Bryant & Lawrence, 1993). Sincethen, knowledge-based discriminatory functionshave been used to validate experimentally deter-mined protein structures (LuÈ thy et al., 1992; Sippl,1993; MacArthur et al., 1994), for ab initio proteinstructure prediction (Sun, 1993; Simons et al., 1997),and have proven their worth in bona ®de fold rec-ognition experiments (Lemer et al., 1995; Madejet al., 1995; FloÈckner et al., 1995; Jones et al., 1995;Levitt, 1997).

Abbreviations used: CASP, critical assessment ofprotein structure prediction methods; CDF, contactdiscriminatory function; crabpi, cellular retinoic acidbinding protein I; edn, eosinophil derived neurotoxin;e5.2, immunoglobulin domain protein; GA, geneticalgorithm; hpr, histidine-containing phosphocarrierprotein; IFU, independent folding unit; IRAPDF, linearlyinterpolated residue-speci®c all-atom conditionalprobability discriminatory function; NMR, nuclearmagnetic resonance; mm23, nucleoside diphosphatekinase protein; NVPDF, non-residue-speci®c virtual-atom conditional probability discriminatory function;PDB, Protein Data Bank; PDF, probabilitydiscriminatory function; p450, heme protein; RAPDF,residue-speci®c all-atom conditional probabilitydiscriminatory function; RVPDF, residue-speci®c virtual-atom conditional probability discriminatory function;rmsd, root mean square deviation.

J. Mol. Biol. (1998) 275, 895±916

0022±2836/98/050895±22 $25.00/0/mb971479 # 1998 Academic Press Limited

Generally, knowledge-based discriminatoryfunctions have used a simple one- or two-point-per-residue representation. That is, they usuallyrepresent each residue in a protein sequence withone or two positions in three-dimensional space.Discrimination is based on each residue's prefer-ence to be buried or exposed, its preference for aparticular secondary structure conformation,and/or its preference to be in contact at a particu-lar distance and sequence separation from otherresidues (Sippl, 1990; Bowie et al., 1991; Jones et al.,1992; Bryant & Lawrence, 1993). However, to cap-ture the ®ner details of atom±atom interactions inproteins, a more detailed representation is necess-ary, and two such functions (DeBolt & Skolnick,1996; Subramaniam et al., 1996) have been devel-oped so far. For example, in a comparative model-ing scenario where two possible models can bequite similar (within 1.0 to 3.0 AÊ in terms of rootmean square deviation (rmsd) of the Ca atoms)to the experimentally determined structures(Mosimann et al., 1995), we need all the infor-mation we can possibly obtain from the twomodels to determine which one is more accurate.A one-point-per-residue discriminatory functionmay not be able to discriminate as well as an all-atom discriminatory function, which takes intoaccount the environment of all the atoms on themain-chain and the side-chain of each residue.Also, a detailed all-atom model cannot be builtusing a simple representation.

A major issue in developing any discriminatoryfunction for work with protein molecules is how totest performance. There are three principal strat-egies. Most popular for testing physics-based func-tions has been detailed comparison withexperimental data from small molecule systems(Halgren, 1995). The assumption is that goodresults on such data must imply adequate perform-ance on the large molecule systems. The secondmethod is the use of ``decoy'' sets (Park & Levitt,1992). That is, devising many incorrect structures,and testing whether a function can discriminatebetween these and the experimental conformation.Decoys have been based on lattice models (Park &Levitt, 1992), molecular dynamics trajectories(Wang et al., 1995), crystal structures of differentresolutions (Subramaniam et al., 1996) and aminoacid sequences mounted on radically differentfolds (Holm & Sander, 1992a). World Wide Websites have been established to provide decoy testsets for fold recognition functions (Fischer &Eisenberg, 1996; Fischer, 1997) and for general pro-tein structure prediction functions (Braxenthaleret al., 1997). To date, each function has generallyonly been tested on one or two classes of decoy. Adanger here is that discrimination may be achievedutilizing some speci®c aritfacts of the decoys. Forexample, non-compactness or systematic distortionof detailed features such as abnormal hydrogenbond length. The third approach is to use the func-tion to drive a search for a native like confor-mation, starting from some approximate structure.

Tests of this sort have so far only been reported forphysics-based potentials, and very rarely have theybeen even partially successful (Storch & Daggett,1995). For protein structure prediction to work, thisis the most relevant test, but it is also the most dif-®cult and time consuming. We have opted for test-ing against decoys, and have used as wide a rangeof types as possible, taking advantage of the testsets available in PROSTAR, the Protein PotentialTest Site (Braxenthaler et al., 1997).

Our goal is to develop a discriminatory functionthat will work well at identifying the best confor-mation among a set of incorrect or approximateconformations. To accomplish this, we derive pair-wise distance-dependent all-atom conditional prob-ability functions that represent atom±atompreferences in a residue speci®c manner. We evalu-ate the performance of these functions by seeinghow well they distinguish correct conformations ofan amino acid sequence from incorrect or approxi-mate (decoy) conformations. We perform thisevaluation for a wide variety of decoy types. Wecompare this discriminatory function to three moreapproximate representations to determine theeffect of decreasing detail in the representation.Two of the approximate representations treat com-binations of atoms as single ``virtual atoms''. Thethird approximate representation, a simple contact-based discriminatory function, is used to illustratehow much of the discriminatory information isobtained from compactness alone (Bahar &Jernigan, 1997). We discuss the implications of thethese results for protein structure prediction andmodel re®nement.

Methods

We will describe two formalisms. The ®rst com-putes the conditional probabilities, and the secondcomputes the free energies, of pairwise atom±atompreferences in proteins using statistical obser-vations on native structures. We make the obser-vation that these two formalisms are equivalent forall practical purposes. However, it is more straight-forward to think of pairwise preferences of atomsin proteins in terms of probabilities rather than interms of free energies: the Boltzmann formalismassumes an equilibrium distribution of atom±atompreferences, the physical nature of the referencestate is not clear, and the probability of observinga system in a given state must change with respectto the temperature (Moult, 1997).

The conditional probability formalism

We divide the possible conformations of a struc-ture or piece of structure into two subsets, the setof conformations we will regard as correct (i.e.native conformations, having low rms deviationfrom the experimental structure), {C}; and the rest,the set of incorrect structures {I}. We consider a setof properties of the structure {yk}. The propertiesmay be any aspects of protein structure that differ

896 All-atom Distance-dependent Discriminatory Function

signi®cantly between the set of incorrect confor-mations we are dealing with and the confor-mations we regard as correct. Examples might bethe strength of electrostatic interactions, close pack-ing, exposure of non-polar groups to solvent, or, asin the present application, simply a set of intera-tomic distances within a structure, {dij

ab}, where dijab

is the distance between atoms i and j, of type a andb, respectively. We wish to evaluate P(C|{dij

ab}), theprobability the structure is a member of the ``cor-rect'' set, given it contains the distances {dij

ab}. Toperform this evaluation, we express P(C|{dij

ab}) asan explicit function of probabilities that can bederived from experimental structures. These areP(dij

ab|C), the probability of observing a distance dbetween two atoms i and j of types a and b in acorrect structure, and P(dij

ab), the probability ofobserving such a distance in any structure, corrector incorrect. Also required is P(C), the probabilitythat any structure picked at random is a memberof the ``correct'' set. We use the general and exactchain rule for conditional probabilities (Mostelleret al., 1970):

P�C� � P�dijabjC� � P�dij

ab� � P�Cjdijab� �1�

and express the probabilities of observing the setof distances as products of the probabilities ofobserving each individual distance, making theimportant approximation that all distances areindependent of one another:

P�fdijabgjC� �

Yij

P�dijabjC�; P�fdij

abg� �Y

ij

P�dijab�

�2�It then follows that:

P�Cjfdijabg� � P�C� �

Yij

P�dijabjC�

P�dijab�

�3�

P(C) is a constant independent of conformation fora given amino acid sequence, and is not consideredfurther. Note however that its omission means thatscores from different sequences cannot be com-pared. It is useful to use a log form of equation (3),both to scale the quantities involved to a smallrange, and to give a form similar to that of a poten-tial of mean force. We use a scoring function Sproportional to the negative log conditional prob-ability that the structure is correct, given a set ofdistances:

S�fdijabg� � ÿ

Xij

lnP�dij

abjC�P�dij

ab�/ ÿ ln P�Cjfdij

abg� �4�

Given an amino acid sequence conformation, wecalculate all the distances between all pairs of atomtypes and compute the score S on the left-handside by summing up the probability ratios assignedto each distance between a pair of atom types.

To make use of equation (4), we need distri-butions of P(dab|C) and P(dab) for all combinations

of atom types at all observed distances. TheP(dab|C) are derived directly from experimentalstructures as follows:

Using a set of known structures from theBrookhaven Protein Data Bank (PDB) (Bernsteinet al., 1977), we can make observations of atom±atom contacts in particular distance bins. We com-pute the probability of observing atom type a andatom type b in a particular distance bin (with mid-point value d) in a native conformation C,P(dab|C),like so:

P�dabjC� � f �dab� � N�dab��dN�dab� �5�

Here N(dab) is the number of observations of atomtypes a and b in a particular distance bin d. Thedenominator is the number of a±b contactsobserved for all distance bins. We assume that thefrequency distributions obtained from the data-base, f(dab), here and elsewhere, represent the prob-abilities.

For example, if the number of lysine Nz and glu-tamate Od1 (KNz-EOd1) contacts within a distancerange of 4.0 to 5.0 AÊ was found to be equal to 10in the data set, and the total number of KNz-EOd1

contacts observed in all distance bins was 100, thefrequency of KNz±EOd1 contacts at distance bin 4.5is 10/100 � 0.1.

P(dab) may be thought of as describing a prop-erty of the reference state, speci®cally, the prob-ability of seeing a separation d between atomstypes a and b in any possible structure, correct orincorrect. In terms of Bayesian statistics (Mostelleret al., 1970), P(dab) is a prior distribution, that is,our knowledge of the interatomic distances in pro-teins for a particular pair of atom types, before wehave observed any experimental structures. Anadvantage of the Bayesian viewpoint comparedwith the requirements of the potential of meanforce is that it more clearly allows a choice of anyprior distribution convenient to our purpose. Manydifferent prior distribution choices are possible. Forexample, we may assume that all distances areequally probable (Subramaniam et al., 1996), orthat the set of distances observed in some randomcoil model is appropriate, a choice often made inpotential of mean force work (Avbelj & Moult,1995b) because of the relevance to the physicalfolding process. Such distributions are perfectlyvalid, but do not necessarily make the best use ofthe available information. For example, it has beenshown that much of the apparent signal in a poten-tial of mean force using a random coil referencestate originates in the non-speci®c compactness ofcorrect structures (Jernigan & Bahar, 1996; Bahar &Jernigan, 1997). To obtain the most informationsupporting or refuting the hypothesis that we areconsidering a native structure, we would like touse a distribution that includes all the possibleincorrect structures that might be encountered, butone that is not unnecessarily broad. In a particularapplication where a large number of representative

All-atom Distance-dependent Discriminatory Function 897

conformations have been generated, for example,when examining all possible conformations of aloop in a protein molecule (Fidelis et al., 1994), it ispossible and presumably advantageous to compilethe reference distributions from the available cor-rect and incorrect conformations. More typically,the sample of incorrect conformations available istoo sparse to allow this direct approach. In the pre-sent work, we deal only with a prior distributionthat is generally useful for structure prediction.Methods used to predict protein structure usuallyproduce reasonably compact models, even whenthe result is incorrect. Thus, a good choice of priordistribution is that found in the set of possiblecompact conformations.

We assume that averaging over different atomtypes in experimental conformations is an ade-quate representation of the random arrangementsof these atom types in any compact conformation.Then we can approximate P(dab), the probability of®nding atom types a and b in a distance bin d inany compact conformation, native or otherwise, asequal to P(d), the probability of seeing any twoatom types in a distance bin d:

P�dab� � P�d� � f �d� � �abN�dab��d�abN�dab� �6�

Here, �abN(dab) is the total number of contactsbetween all pairs of atom types in a particular dis-tance bin d, and the denominator is the total num-ber of contacts between all pairs of atom typessummed over the distance bins d.

There are three principal approximations in thisformalism:

(1) All the interatomic distances are independentof one another. This is clearly not correct: if atomA is close to atom B and to atom C, then we knowthat atoms B and C must also be fairly close. Forexample, if the Cd2 of leucine is close to the Cd1

atom of another leucine, it is likely that the Cd1 ofthe ®rst residue will also be close to this atom. Inthis sense, there is some degree of double countingof interactions. Although it is possible to write theprobabilities in a way that takes these correlationsinto account, much more data would be requiredto obtain adequate statistics, so that it is dif®cult toassess the impact of this approximation.

(2) The distribution of interatomic distances inproteins is not signi®cantly different in differentenvironments. This is not universally true. Twoexamples: the distribution of distances betweenionic groups on the surface of proteins is verydifferent from that of buried pairs (Wodak &Rooman, 1993), and the probability of a hydrogenbond being formed between main-chain groupsvaries sharply as a function of the sequence separ-ation between them (Sippl et al., 1996). In principle,given a large enough set of experimental struc-tures, it is possible to compile separate probabilitydistributions for each such environmental variable,subject to availability of suf®cient observations.

We have not attempted to do that in the presentstudy.

(3) The interatomic distance probability distri-butions derived from experimental structures willbe suf®ciently sharp, and different for differentatom types, that a useful discriminatory signal willbe obtained. We examine the extent to which thisis true later in this work.

The potential of mean force

The potential of mean force method for discrimi-nating between correct and incorrect protein struc-tures rests on three assumptions:

(1) The total free energy of a protein moleculerelative to some reference state, �Gtot, can beexpressed as a sum of the relative free energy�G(R) of a number of individual contributions,where R represents the value of a ``reaction co-ordinate''. As with the conditional probability anal-ysis, the reaction co-ordinate may be any con-venient measure of property of the molecule(Beveridge & DiCapua, 1989), such as the separ-ation of speci®c types of atoms (Bahar & Jernigan,1997) or residues (Sippl, 1990; Jones et al., 1992),the distance between hydrogen bonding groups(Sippl et al., 1996), or the value of the local mainchain electrostatic energy (Avbelj & Moult, 1995b).We again consider the particular case of a reactionco-ordinate representing the distance of d betweenatoms i and j of type a and b, respectively, so that:

�Gtot �X

ij

�G�dijab� �7�

Although enthalpy may be reasonably approxi-mated by such a pairwise sum of interactions, andtypically is in empirical force ®eld application(McCammon & Harvey, 1987), it is not generallyvalid to do this for free energy, because entropycontributions can not be regarded as additive(Mark & van Gunsteren, 1994). This issue is relatedto the assumption of independence of contributionsin the conditional probability formalism discussedearlier, but the nature of the approximation isharder to express quantitatively.

(2) The relative free energy of a particular inter-action between any pair of components can bededuced from the inverse of Boltzmann's law:

G�dab� � ÿkT lnr�dabjC�r�dab� �8�

where k is the Boltzmann constant, T is the absol-ute temperature, r(dab|C) is the density of groups aand b at a separation d in experimental proteinstructures, and r(dab) is the same density in thereference state (McQuarrie, 1976). Two assertionsunderlie this expression: each occurrence of dab inexperimental structures is independent of anything else in the environment, and the distributionover a set of observations of dab in a number ofdifferent proteins should obey Boltzmann's


distribution. Sippl et al. (1996) have arguedcogently that the latter point is true.

(3) The lowest free energy conformation rep-resents the native state (often called the thermo-dynamic hypothesis) (An®nsen, 1973).

Substituting equation (8) into equation (7), wehave:

Gtot � ÿkTX

ij

lnr�dij

abjC�r�dij

ab��9�

This expression is similar to our scoring functionfor the conditional probability formalism (equation(4)), except that we are dealing with the density ofstates with particular distance values, rather thanthe probability. However, such densities are alsoequivalent to the frequencies de®ned in equation(4), and are evaluated inexactly the same way fromexperimental structures. Thus, seeking the confor-mation with the lowest free energy based on apotential of mean force analysis is formally equiv-alent to seeking the one with the largest probabilityof presenting a native structure in terms of a Baye-sian statistics. Because of the less clear assumptionsinvolved in the potential of mean force analysis,we prefer to use the simpler Bayesian probabilityformalism in this work.

The residue-specific all-atom probabilitydiscriminatory function

The conditional probabilities for the residue-speci®c all-atom probability discriminatory func-tion (RAPDF) are compiled from frequencies ofcontacts between pairs of atom types in a databaseof protein structures. All non-hydrogen atoms areconsidered, and the description of the atoms is resi-due speci®c, i.e. the Ca of an alanine is differentfrom the Ca of a glycine. This results in a total of167 atom types. Contacts between atoms within asingle residue are excluded from the counts. Wedivide the distances observed into 1.0 AÊ bins ran-ging from 3.0 to 20.0. Contacts between pairs ofatom types in the 0.0 to 3.0 AÊ range are placed in aseparate bin, resulting in total of 18 distance bins.Table 1 lists the atom types used for this discrimi-natory function.

A table containing the negative log conditionalprobability scores, S, for all pairs of atom types forall distances is compiled from a database of knownstructures using equation (4). Given an amino acidsequence in a particular conformation, the scores ofall contacts between pairs of atom types with dis-tances that fall within the distance cutoff above

(ignoring contacts between atoms within a singleresidue) are summed up to yield the total scoresupporting the hypothesis that the conformation iscorrect. All counts were initialized to one to avoidtaking the log of zero. This procedure is used forall probability discriminatory functions describedin this paper.

The residue-specific virtual-atom probabilitydiscriminatory function

The conditional probabilities for the residue-speci®c virtual-atom probability discriminatoryfunction (RVPDF) are compiled from frequenciesof contacts between pairs of virtual atom types in adatabase of protein structures. A virtual atomapproximation similar to the one developed byHead-Gordon & Brooks (1991) is used. This rep-resentation combines a group of atoms into asingle virtual atom type by averaging over the cor-responding x, y, and z Cartesian coordinates of theindividual atoms. Aside from labeling conventions,the present representation differs from the originalone in the determination of the virtual centers forvirtual atoms vNH and vOH, taken here to be rep-resented by the positions of the N and O atoms,respectively, rather than the geometric centres. Thedistance bins are the same as in the RAPDF. Eachof the virtual atom types is pre®xed by the type ofthe residue, resulting in 105 different virtual atomtypes. Table 2 lists the virtual atoms used for thisdiscriminatory function and the combinations ofatom types they represent.

The non-residue-specific virtual-atomprobability discriminatory function

The non-residue-speci®c virtual-atom probabilitydiscriminatory function (NVPDF) differs fromRVPDF only in that the virtual atom types are notresidue-speci®c. For example, all vCa atom typesare considered the same, and all vCb atom typesare considered the same, and so on. The total num-ber of virtual atom types considered under thisapproximation is 21.

The contact discriminatory function

To examine how well non-speci®c compactnessalone can discriminate between correct and incor-rect conformations relative to the three PDFs(RAPDF, RVPDF and NVPDF) described above,we use a simple contact discriminatory functionwhich assigns a score of ÿ1.00 for every atom±atom contact within 6.0 AÊ in an amino acid

Table 1. List of atom types used in the all-atom residue speci®c probability discriminatory function(RAPDF). Each of these atom types is pre®xed by the type of the residue, resulting in 167 differentatom types

C Ca Cb Cd Cd1 Cd2 Ce Ce1 Ce2 Ce3 Cg Cg1

Cg2 CH2 Cz Cz1 Cz2 N Nd1 Nd2 Ne Ne1 Ne2 NH1

NH2 Nz O Od1 Od2 Oe1 Oe2 Og Og1 OH Sd Sg


sequence conformation, excluding contacts withina single residue.

The linearly interpolated residue-specificall-atom probability discriminatory function

The three PDFs (RAPDF, RVPDF, NVPDF) usediscrete bins to compile the probability scores. Thisleads to a situation where the score for observingany distance within a bin width is the same for agiven pair of atom types. In reality, the preferencesbetween atom types vary in a continuous manneras the distances between the contacts vary. Wethus also evaluate the total score of a conformationby linearly interpolating between the scores for the

discrete bins to uniquely de®ne the score for agiven distance. The linear interpolation is used tomodify the RAPDF.

Interpolation is based on two assumptions: (1)The value at the mid-point of the distance bins isgiven by the score for that distance bin. (2) There isa linear relationship between probability valuesobserved for neighbouring bins.

Thus, if da represents the actual distance encoun-tered, dl represents the mid-point of the closest dis-tance bin value on the left-hand side, dr representsthe mid-point of the closest distance bin on theright-hand side, and Sl represents the score for dl,and Sr the score for dr. Sa, the score for da is givenby:

Table 2. List of virtual atom types used in the residue-speci®c and non-residue-speci®c virtual-atom probability discriminatory functions (RVPDF and NVPDF,respectively). The table lists the virtual atom type, the atom types of the com-ponents, and the residues it is present in (in one-letter code). For the RVPDF,each of the virtual atoms is pre®xed by the type of residue (in one-letter code),resulting in 105 different virtual atom types.

Virtual atom Components Present in residue

vCO C � O allvNH1 N allvNH2 Nd2 NvNH2 Ne2 QvNH2 NH1 RvNH2 NH2 RvNH3 Nz KvNHE Nd1 HvNHE Ne2 HvNHE Ne RvNHE Ne1 WvCOS Cg � Od1 NvCOS Cd � Oe1 QvCSC Cg � Sd � Ce MvCCC Cb � Cg1 � Cg2 VvCCC Cg � Cd1 � Cd2 LvC3R Cb � Cg1 � Ce1 FvC3R Cd2 � Ce2 � Cz FvC3R Cb � Cg1 � Ce1 YvC3R Cd2 � Ce2 � Cz YvC3R Cd2 � Ce3 � Cz3 WvC3R Ce2 � Cz2 � CH2 WvCOO Cg � Od1 � Od2 DvCOO Cd � Oe1 � Oe2 EvOH Og SvOH Og1 TvOH OH YvCC Cb � Cg2 IvCC Cg1 � Cd1 IvCC Cb � Cg KvCC Cd � Ce KvCC Cg � Cd RvCC Cb � Cg2 TvCCP Cb � Cg PvCCR Cg � Cd1 WvCCH Cg � Cd2 HvSH Sg CvCE1 Ce1 HvC Cz RvCH3 Cb AvCH2 Ca GvCH2 Cb C,D,E,F,H,K,M,N,Q,R,S,W,YvCH2 Cg E,QvCH2 Cd PvCH Ca A,C,D,E,F,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y


Sa � Sl � �Sr ÿ Sl� � da ÿ dl

dr ÿ dl

� ��10�

Low counts analysis

We have investigated the effect of ®nite countsused to de®ne the frequencies required for a PDF(equation (4)), by introducing a simple model ofcount variation. For each set of counts obtainedfrom observations of particular pairs of atom typesin the database in equations (5) and (6), we assumea Gaussian distribution for variation in counts inrepeated experiments and modify the countsaccordingly. A set of possible alternative counts isthen obtained using the equation:

N0 �X

m

Rÿm

2

!� s�N �11�

where N is the observed count value, N0 is themodi®ed count value, �mR represents the sum ofm random numbers R in the interval [0,1] (m � 12),and s represents the standard deviation in theobserved counts, assumed to be N1/2.

We generate 100 different sets of modi®edcounts each of the terms in equation (4) for the 18distance bins using the above procedure, thus pro-viding 100 different sets of conditional probabilitycurves that might have been obtained from a pro-tein database of the same type and size used inthis work. These modi®ed curves are compared tothe observed one for particular pairs of atomtypes.

Selection of the structure library for obtainingconditional probabilities

Table 3 lists the PDB codes of the 265 structuresthat were used for compiling the conditional prob-abilities for the discriminatory functions describedabove. The PDB codes were initially obtained fromthe CATH database and are a set of non-homolo-gous (less than 30% sequence identity between anyproteins in the set) X-ray structures better than3.0 AÊ resolution (Orengo et al., 1993). Structureswith multiple side-chain conformations have beenmodi®ed such that only the side-chain confor-

Table 3. List of PDB (Bernstein et al., 1977) codes of the 265 protein chains used for compi-lation of conditional probabilities. In cases where a single chain of the protein is used for thecompilation, the chain identi®er is shown

1351 1c5a 1gox 1oma 1tabI 2hpdA 3pgk1aaf 1cauA 1gpb 1omf 1ten 2ltnA 3pgm1aak 1cdb 1gpr 1ovb 1tfi 2ltnB 3rubS1aba 1cde ahcc 1pba 1tgl 2mev4 3sc2A1aco 1cdg 1hgeA 1pfa 1thg 2mnr 3sc2B1acp 1cewI 1hgeB 1pdc 1tml 2msbA 4enl1add 1cmbA 1hleA 1pfkA 1tnfA 2nckL 4fgf1adn 1cobA 1hmy 1pgd 1tplA 2ohxA 4gcr1ads 1colA 1hoe 1pgx 1tpm 2ovo 4htcI1ak3A 1coy 1hsbA 1pha 1ttaA 2pia 4mt21ala 1cpcA 1hstA 1phh 1ttf 2plv4 4sbvA1alkA 1cpt 1huw 1pii 1ula 2pmgA 4sgbI1aozA 1csc 1hyp 1pkp 1utg 2polA 5fd11apa 1cseE 1ifc 1plc 1vil 2reb 5p211apmE 1cseI 1ipd 1poa 1vsgA 2rhe 5pti1aps 1ctf 1isuA 1poxA 1wsyA 2rn2 5rubA1arb 1d66A 1kst 1ppn 1wsyB 2sicI 5timA1arqA 1dhr 1lab 1prcC 1xis 2sn3 6insE1atnA 1dmb 1lct 1prcH 1ycc 2sns 7aatA1atr 1eca 1lfi 1prcL 1ysaC 2stv 7catA1atx 1ede 1lis 1ptf 1zaaC 2tgi 7rsa1ayh legf 1lla 1pyaA 256bA 2tmdA 8abp1bal 1etrL 1lmb3 1pyaB 2aaiB 2tmvP 8fabB1bbpA 1ezm 1ltsA 1pyp 2bbkH 2tsl 8rxnA1bbt1 1fbaA 1ltsC 1raiA 2bbkL 2tscA 9wgaA1bbt2 1fc2D 1lyaA 1raiB 2bopA 2yhx1bbt3 1fiaA 1mat 1rcb 2bpal 3b5c1bbt4 1fkb 1mfaH 1rec 2bpa2 3bcl1bds 1fnr 1mfa: 1rfbA 2cas 3blm1bgc1 1fus 1minA 1rhd 2cba 3cla1bgc2 1fxd 1minB 1ribA 2cdv 3cts1bgh 1gal 1mypA 1rip 2cmd 3dfr1bha 1gatA 1mypC 1rro 2cpl 3ebx1bia 1gd1O 1nar 1rveA 2ctc 3ecaA1bllE 1gdhA 1nipA 1sbp 2ctvA 3gapA1bmv1 1gky 1noa 1shaA 2cyp 3grs1bmv2 1glaG 1nrcA 1shg 2dnjA 3il8A1brnL 1glt 1nrd 1sim 2er7E 3mdsA1btc 1gluA 1nscA 1sryA 2gstA 3monA1bw3 1gof 1ofv 1stp 2hhmA 3monB


mations with atoms having the highest occupancyand/or lowest temperature factors are used.

Decoy set generation

The decoy sets used were obtained from the Pro-tein Potential Site (PROSTAR) (Braxenthaler et al.,1997) and can be divided into two classes. Decoysets in class I discriminate between one correct andone or more incorrect or approximate confor-mations. Decoys sets in class II are a set of approxi-mate conformations that vary in rmsd to theexperimental conformation, excluding the exper-imental conformation itself.

Table 4 lists the decoy sets in class I. The MIS-FOLD decoy set, generated by Holm & Sander(1992a), consists of 25 examples of pairs of proteinswith the same number of residues in the chain, butdifferent sequences and conformations. Sequenceswere swapped between members of a pair, andside-chain packing annealed using a Monte Carloprocess (Holm & Sander, 1992a). These provideinappropriate environments for most of the side-chains in the structures.

The ®rst experiment on the Critical Assessmentof protein Structure Prediction methods (CASP2)produced a set of 42 comparative models of sixdifferent proteins (Mosimann et al., 1995). Theseform the CASP1 decoy set. The models vary in Carmsd to the corresponding experimental confor-

mation, ranging from 0.53 to 7.40 AÊ , depending onthe dif®culty of the model building process.

The IFU decoy set is based on a set of 44 pep-tides which are proposed to be independent fold-ing units as determined by local hydrophobicburial and experimental evidence (Unger & Moult,1991). The set consists of the structure of the pep-tides as observed in the complete experimentalprotein structure, and a conformation of the frag-ment generated with a Genetic Algorithm(Pedersen & Moult, 1997) and a physics-basedpotential of mean force (Avbelj & Moult, 1995a).

The PDBERR decoy set consists of structuresdetermined using X-ray crystallography whichwere later found to contain errors, and the corre-sponding corrected experimental conformations(Braxenthaler et al., 1997). The SGPA decoy set con-sists of two conformations generated by moleculardynamics simulations starting with the Strepro-myces griseus Protease A experimental structure(PDB code 2sga) (Avbelj et al., 1990), and the 2sgaexperimental structure.

Among all the decoy sets referenced in thispaper, only the LOOP decoy set belongs to class II.This decoy set consists of sets of conformations forshort loops (four or ®ve residues) that were system-atically generated using the methods of Moult &James (1986) and Fidelis et al. (1994). The loop con-formations are evaluated in the context of the restof the experimental structure. Table 5 gives detailsabout each of the loops in the LOOP decoy set.

Table 4. Class I decoys. The name used to identify the speci®c decoy set, the number of decoys in the set, and Carmsd range of the decoys to the experimental structure, and the appropriate references are given

Decoy Number Ca

set of RMSDname decoys range (AÊ ) Reference

MISFOLD 25 8.66±22.43 (Braxenthaler et al., 1997; Holm & Sander, 1992b)CASPI 42 0.53±7.40 (Mosimann et al., 1995; Braxenthaler et al., 1997)IFU 44 0.21±10.02 (Braxenthaler et al., 1997; Pedersen & Moult, 1997)PDBERR 3 0.81±13.21 (Braxenthaler et al., 1997)SGPA 2 1.91±2.06 (Braxenthaler et al., 1997; Avbelj et al., 1990)

Table 5. Class II decoys. The LOOP decoy set is a set of loop conformations that weresystematically generated using the methods of Moult & James (1986) and Fidelis et al.(1994). For each loop, the number of different loop conformations and the all-atomrmsd range is given

PDB Residue Number of All-atom rmsdcode range Sequence conformations range (AÊ )

1 3dfr 20±23 PWHL 394 0.75±4.582 3dfr 27±30 LHYF 1390 0.81±3.473 3dfr 64±68 HQED 71439 0.89±4.194 3dfr 120±124 GSFEG 474 0.57±2.915 3dfr 136±139 FTKV 10782 1.39±2.156 2sga 35±39 TNISA 15453 1.20±3.177 2sga 97±101 GSTTG 2079 0.60±3.348 2sga 116±119 YGSS 26572 0.47±4.919 2sga 132±136 AQPGD 206 0.97±2.58

10 2fbj 265±269 HPDSG 393 0.96±3.9011 2hfl 264±268 LPGSG 339 1.11±2.81


Decoy set evaluation

For Class I decoy sets, the ratio of the score ofthe incorrect conformation to that of the correctconformation is determined. A discrimination ratioless than 1.0 (or log discrimination ratio less than0.0) indicates that the discriminatory function isable to distinguish between the correct confor-mation and the incorrect one. The lower the logdiscrimination ratio, the more reliable the discrimi-nation.

For class II decoy sets, two evaluation measuresare used. One measure is the probability of choos-ing by chance a conformation with an equal orlower rmsd than the one selected by the discrimina-tory function. That is, the number of conformationsthat have the same or lower all-atom rmsd, as thermsd of the conformation with the lowest score isdivided by the total number of conformations inthe set. The second measure is whether or not aconformation with an all-atom rmsd within 1.0 ofthe lowest rmsd conformation present in the decoyset is selected by the discriminatory function.

Results of a complete decoy set are summarizedby calculating the average discrimination ratio of allthe decoys and/or as a percentage of decoys cor-rectly discriminated. Further details on the evalu-ation protocols and the decoy set generation aregiven by Braxenthaler et al. (1997). Proteins in thestructure library that are more than 30% identical insequence to a protein in a particular decoy set werenot included in the compilation of the discrimina-tory function, i.e. the procedure was jack-knifed.

Results

The residue-specific all-atom discriminatoryfunction performs the best across a widevariety of decoys

An ideal discriminatory function is one that cor-rectly discriminates 100% of class I decoys andselects conformations with low all-atom rmsds(within 1.0 AÊ of the conformation with the lowestrmsd) in the LOOP decoy set. In addition, the aver-age discrimination ratios should be as low as poss-ible.

The RAPDF comes close to achieving the goal of100% discrimination, and performs signi®cantlybetter than the RVPDF and NVPDF. Figure 1a andb shows that the RAPDF has the best average dis-crimination ratio and the largest percentage ofdecoys correctly discriminated across a range ofdecoy sets. In the case of the MISFOLD, PDBERR,and SGA decoy sets, it correctly discriminates100% of the decoys in the set. Further, the averagediscrimination ratios show that the scores for thecorrect conformations in the MISFOLD, PDBERR,and SPGA decoy set are on average lower (better)by 60%, 50% and 25%, respectively, compared tothe scores for the incorrect conformations.

For the IFU decoy set, the percentage of confor-mations correctly discriminated by the RAPDF is

73%. The average difference between the scores forthe correct conformations and the incorrect confor-mations for the RAPDF is 10%.

Figure 2 shows the results in more detail forsome of decoys in the CASP1 set, which is a set ofcomparative models and their correspondingexperimental structures. Here, the overall percen-tage of decoys correctly discriminated by theRAPDF is 93%, for the 42 decoys in this set. Theaverage difference between the scores for the cor-rect conformations and the incorrect conformationsfor the RAPDF is 15%. The Figure shows theresults for only the models with the lowest Carmsd to the corresponding experimental structures.The RAPDF does not perform as well as the twoother discriminatory functions in terms of the dis-crimination ratio in two instances (see the barsrepresenting nm23 and hpr in Figure 2).

It is less obvious which discriminatory functionsperform best from the log probability and all-atomrmsd data for the LOOP decoy set (Figure 3). How-ever, if we examine the actual rmsd values, wenote that for 10 out of 11 loops, the RAPDF picks aconformation that is within 1.0 AÊ of the lowestrmsd conformation in the sample space (where theexperimental structure is not included). RVPDF,NVPDF, and CDF have ratios of 5/11, 9/11, and8/11, respectively.

Relationship between probability scoreand rmsd

Plotting the rmsd versus the score for a set ofconformations from the LOOP decoy set (Figure 4)shows that the RAPDF correlates reasonably wellwith the all-atom rmsd up to quite large rmsds.This suggests that this discriminatory function canbe useful in simulations that attempt to get closerto the native conformation starting from a distantconformation.

Discriminatory power decreases uponsuccessive approximations

There is an overall progressive deterioration inthe signal going from the all-atom residue-speci®cdiscriminatory function to the CDF, as thedescription gets more and more approximate(Figure 1). One major exception is the LOOPdecoy set (Figure 3) where the NVPDF performssigni®cantly better than the RVPDF. The otherexception is apparent in Figure 1b, for the MIS-FOLD decoy set.

The compactness term alone is useful fordiscriminating between correct andincorrect conformations

The contribution of the compactness term, whichis measured by the CDF, is better than some of theother PDFs for certain decoy sets (Figure 1b underMISFOLD, and Figure 3). Further, in a majority ofthe decoy sets, it is adequate to distinguish


between correct and incorrect conformations mostof the time (Figure 1b under PDBERR and SGA,and Figure 3).

Using a large distance cutoff helpsin discrimination

Figure 5 shows the percentage of decoys cor-rectly discriminated at different distance cutoffs(5.0, 10.0, 15.0, and 20.0 AÊ ). There is a signi®cantadvantage overall to using a larger distance cutoff.A distance cutoff of at least 15.0 AÊ is necessary toaccurately discriminate all the 25 decoys in theMISFOLD decoy set. In the cases of the PDBERRand SGPA decoy sets, it does not appear to make adifference which cutoff is chosen. The average dis-

crimination ratios for the four different cutoffsshow a similar trend for all decoy sets.

Comparison of the contribution ofelectrostatics and non-electrostatics terms

To gain some insight into the nature of the signalin the RAPDF, we partition the discriminatoryfunction according to contributions from electro-static and non-electrostatic contacts. Interactionsbetween any combination of nitrogen and oxygenatoms are de®ned to be electrostatic in nature. Allother contacts are considered non-electrostatic.Figure 6 compares the effectiveness (by measuringthe percentage of conformations correctly discrimi-nated) of using only the electrostatic or non-elec-

Figure 1. Comparison of the per-formances of the residue-speci®call-atom probability discriminatoryfunction (RAPDF), the residue-speci®c virtual-atom probabilitydiscriminatory function (RVPDF),the non-residue-speci®c virtual-atom probability discriminatoryfunction (NVPDF), and the contactdiscriminatory function (CDF) forclass I decoy sets. The log of theaverage discrimination ratiosbetween incorrect and correct con-formations for the ®ve decoy setsin class I are shown in (a). Thelower the log average discrimi-nation ratio, the better the discrimi-nation. The percentage of decoysthat were correctly discriminatedwithin a decoy set are shown in(b). In all cases, the most detailedfunction, RAPDF, performs best,and achieves 100% discriminationfor three out of ®ve decoy sets.


Figure 2. Comparison of the per-formances of the four discrimina-tory functions (RAPDF, RVPDF,NVPDF, and CDF) for selecteddecoys in the CASP1 set. The logdiscrimination ratios between theexperimental conformation and themodel is shown. The model withthe lowest Ca rmsd to the corre-sponding experimental confor-mation is chosen from a given setof models for this evaluation. Theidenti®ers used to label the decoysare those used by Mosimann et al.(1995). Best discrimination isachieved with the RAPDF, exceptin two cases where all the decoyshave a low Ca rmsd to the exper-imental structure.

Figure 3. Comparison of the per-formances of the residue-speci®call-atom probability discriminatoryfunction (RAPDF), the residue-speci®c virtual-atom probabilitydiscriminatory function (RVPDF),the non-residue-speci®c virtual-atom probability discriminatoryfunction (NVPDF), and the contactdiscriminatory function (CDF) forthe LOOP decoy set. The all-atomrmsd of the conformation selectedby each discriminatory function (a)and the log probabilities of observ-ing an equal or lower rmsd bychance (b) are shown. The loopnumbers in the horizontal axis cor-respond to the numbers in Table 5,column 1. Performance of thedifferent PDFs is more varied withthis decoy set, although the RAPDFpicks out a conformation with anrmsd that is within 1.0 AÊ of thelowest available in ten out of 11cases.


trostatic terms, relative to the full PDF. Althoughthe non-electrostatic terms alone are adequate forcorrect discrimination in most cases, the electro-static term does not play a signi®cant role inenhancing the signal. This is particularly noticeablein the CASP1, IFU, and LOOP decoy sets. Theaverage discrimination ratios for the decoy setsusing the electrostatics, non-electrostatics, andcombined terms (data not shown) indicate a simi-lar trend except in the case of the CASP1 and IFUdecoy sets, where the ratios are similar regardlessof the terms used.

Linear interpolation improves discrimination

Comparison between the IRAPDF and theRAPDF (Figure 7) shows that linear interpolationhelps discriminate between correct and incorrectconformations. This is most obvious in the LOOPdecoy sets where the improvement is quite dra-matic, but for each decoy set there is someimprovement. For all decoys, the percentage ofstructures correctly discriminated is identicalwhether or not the conditional probabilities areinterpolated.

The problem of sparse data for compilation ofprobabilities is negligible

We examine the uncertainty due to the ®nitecounts in the individual contributions to P (C|dab),the probability of observing a correct conformationgiven a contact between atom types a and b at adistance d:

P�Cjdab� � P�dabjC�P�dab� �

N�dab�=�dN�dab��abN�dab�=�d�abN�dab� �12�

Equation (12) is formed by expanding equation (4)in terms of equations (5) and (6). Detailed expla-nations for these terms are given in the Methodssection. To begin our analysis, let us examineTable 6 for the nature of the raw counts that weencounter in our observations for each of the fourterms in the above expression. Since both terms inthe denominator in equation (12) are sums over allatom types, they are well determined and do notlead to signi®cant uncertainty in the probabilities.In the denominator, �dN(dab) is also well deter-mined, even for the rarest atom types. Thus, all theuncertainty arises for the individual N(dab) counts,which may indeed be low: in a rare atom pair, wemight count zero or one observation, for example,and have an uncertainty of a factor of 2 in theresulting probability.

Figure 4. Performance of the residue-speci®c all-atomprobability discriminatory function (RAPDF) for aselected loop in the LOOP decoy set. The all-atom rmsdAÊ versus the negative log conditional probability of the26,572 conformations for the 2sga 116 to 119 LOOP setis shown. The negative log conditional probability of aconformation increases as the rmsd progressively getsworse, a useful property for long-range modelre®nement.

Figure 5. Performance of the resi-due-speci®c all-atom probabilitydiscriminatory function (RAPDF) atdifferent cutoffs. The percentages ofstructures correctly discriminatedfor six decoy sets at four differentcuttoffs is shown. In the case of theLOOP decoy set, ``correct'' dis-crimination is de®ned to be theselection of conformation that iswithin 1.0 AÊ of the lowest all-atomrmsd conformation for each loop.Generally, discrimination progress-ively improves up to a cutoff of20.0 AÊ .


To analyse the uncertainty in the conditionalprobabilities due to the dataset dependence of thecounts, we examine two atom±atom preferences,cysteine N±tryptophan O (CN-WO), a minimumcounts situation, and isoleucine Ca±leucine Cd2

(ICa-LCd2), an average counts situation.Figure 8 compares the effect of uncertain count

values (generated as described in the Methods sec-tion) on the conditional probabilities for the prefer-ences between these two pairs of atom types.There is signi®cantly more error in the conditionalprobabilities for the worst case (CN-WO) than forthe average case (ICa-LCd2). In the average case,the absolute error in the scores is generally lessthan 0.1 and has a maximum value of about 1.0 (inthe 0 to 3.0 AÊ distance bin). In the worst case, theabsolute error in the scores is generally less than0.5 and has a maximum value of about 1.0.

Relationship between the conditionalprobabilities and the nature of physicalinteractions in proteins

One would expect that the conditional prob-ability distributions should re¯ect the known

physical and entropic effects that are believed to beimportant in determining protein structure (Dill,1990). With so much data, it is not possible to sys-tematically assess this relationship. We have exam-ined a few of the distributions to try to understandthe features in these terms.

We ®rst examine the set of Ca±Ca and Cb±Cb

interaction probabilities (Figure 9). Although thereis a signi®cant spread among the residue types, thecurves show several consistent features. In theCa±Ca distributions, there are two minima,between 3.0 and 4.0 and between 5.0 and 6.0. TheCb±Cb curves show only the second, in the 5.0 to6.0 range. The ®rst Ca±Ca minimum is a simpleconsequence of the covalently determined distancebetween adjacent Ca atoms in the polypeptidechain, around 3.8 AÊ . The origin of the second oneis less obvious. An analysis of the interactions inthe proteins in Table 3 reveals that this featurearises primarily from Ca±Ca distances separatedby two or three residues (i, i � 2 and i, i � 3) in a-helices. Similarly, the single Cb±Cb minimum hasits origin in i, i � 1 and i, i � 2 separations in botha-helices and b-strands.

Figure 6. Comparison of electro-static, non-electrostatic, and com-bined terms in the residue-speci®call-atom probability discriminatoryfunction (RAPDF). The percentagesof structures correctly discrimi-nated for various decoy sets isshown. In general, the non-electro-static terms provide the mostsignal.

Figure 7. Comparison of theresidue-speci®c all-atom probabilitydiscriminatory function (RAPDF) tothe linearly-interpolated version(IRAPDF). For ®ve of the decoysets, the bars represent the log ofthe average discrimination ratio ofthe probabilities between thecorrect and incorrect structures. Forthe LOOP decoy set, the barsrepresent the log average of theprobabilities of ®nding at least onestructure with a lower all-atomrmsd than the one with the bestdiscrimination by chance (i.e. thesum of log probabilities dividedby the number of loops). Inall cases, interpolation improvesdiscrimination.


More dramatic than the consistent features is thelarge spread in values in both plots in the 0.0 to3.0 AÊ bin and in the 3.0 to 4.0 AÊ bin in the Cb±Cbplot. On closer inspection, these turn out to be

mostly irrelevant low count artifacts: there arealmost no such distances in experimental struc-tures. However, since all counts are initialized toone, zero probabilities are not possible, and appar-ent probabilities are dominated by the variation inthe total counts over all bins for a pair of atomtypes, �abN(dab). In practice these probabilities willnot be used to evaluate conformations. A real com-ponent of the signal in the lowest bin for the Cb±Ca plot is provided by cis-peptides: the lowestscores are for distances between Ca atoms of pro-line residues to Ca atoms of other residues, re¯ect-ing the higher frequency of such conformationsbefore prolines. In the Cb±Cb plot, the outlier withthe score of almost ÿ3.0 is for disulphide linkedcysteine-cysteine pairs.

Table 6. Details of the counts obtained when compilingthe conditional probabilities. For each term in equation(12), the minimum counts and the average counts isgiven for all atom types and all distance bins. Only the®rst term, N(dab), contributes signi®cantly to the uncer-tainty in the probabilities

Term Minimum counts Average counts

N(dab) 1 648�dN(dab) 1525 11,708�abN(dab) 1,011,903 9,143,295�d�adN(dab) 164,980,971 164,980,971

Figure 8. Comparison of the effectof counting uncertainties on theconditional probabilities for twopairs of atom types, cysteineN±tryptophan W (CN-WO), wherethe counts are among the lowest inthe database, and isoleucine Ca±leucine Cd2 (ICa-LCd2), where thecounts in the 0.0 to 3.0 AÊ bin dis-tance are similar to the averagecounts in Table 6. The broken lineconnects observed conditionalprobabilities and the points aroundthe broken line represent the vari-ation due to the uncertainty in thecounts (see the Methods section).


Ca±Ca distance plots for particular residuepairs show unique features. For example, differ-ences between valine±valine and alanine±alaninere¯ect the different frequency of occurrence ofthese residues in a and b secondary structures(Figure 10). a-Helices have Ca±Ca distancesbetween 5.0 and 6.0 AÊ for i, i � 1 and i, i � 3pairs, and this is re¯ected in the slightly deeperminimum in the alanine curve compared to thatof valine. Conversely, b strands have distances inthe 6.0 to 7.0 AÊ range for i, i � 2 residues, produ-cing the deeper minimum for the correspondingbin in the valine curve. This is an example of thetype of signal the conditional probabilities incor-porate automatically, but which is not easily cap-tured in a physics-based potential.

Similar signi®cant residue speci®c features canbe found in many of the curves. For example, thespread in values in the lowest distance bins for theN-O curves (Figure 11) re¯ects the different fre-quencies of backbone hydrogen bonding for differ-ent pairs of residue types and is also directlylinked to the variation in frequency of occurrenceof different residues in alpha and beta secondarystructure. Figure 12 shows two extremes of hydro-gen bonding preferences between residue pairs.Not surprisingly, curves for proline nitrogen havevery low probability of close contact with a main-chain oxygen, re¯ecting the inability of this nitro-gen atom to participate in hydrogen bonds. A littlemore subtly, the highest probability of hydrogenbonding is between aspartate N and lysine O,

Figure 9. Plot illustrating the con-ditional probabilities encounteredin the 18 distance bins for all Ca±Ca contacts and all Cb±Cb contacts.The spread at a given distance binillustrates the differences in prob-abilities for the atom types indifferent residues. The average ofthe scores for each bin is connectedby the broken line. Consistent,though weak, minima are seen.


re¯ecting the tendency for i, i � 4 salt bridges ina-helices, and for salt bridges between residues onadjacent b-strands. Figure 11 also shows the curvesfor a side-chain±main-chain interaction, N±aspar-tate Od1. The variation in the curve at low dis-tances re¯ects the different propensity for asparticacid side-chains to interact with the NH groups ofdifferent residue types.

Discussion

Performance of the all-atom residue-specificprobability discriminatory function

The most detailed discriminatory function wehave tested is successful against a wide range ofdecoys. Although there are some failures (dis-

cussed below), this level of performance suggeststhe function will at least be useful for protein struc-ture prediction. Other work (Samudrala & Moult,1997), which assess the predictive power of thisdiscriminatory function in blind tests, indicatesthat it can distinguish correct side-chain and main-chain conformations from incorrect ones in a real-life modeling scenario.

However, these tests do not rule out that thereare sparse false minima in the function's surface,not sampled by the decoys. Further decoy testingis planned, as well as tests searching for exper-imental structures starting from approximate ones.In this connection, it is encouraging to note thatthere is a good correlation between the score fromthe function and the all atom rmsd of a confor-

Figure 10. Plots illustrating theconditional probabilities for alanineCa±alanine Ca (ACa-ACa) andvaline Ca±valine Ca (VCa-VCa).The negative log conditional prob-abilities are plotted against the 18distance bins.


mation (Figure 4); so that a reasonable range ofconvergence can be expected in such searches.

Effect of approximating the detail in thediscriminatory function representation

The more approximate discriminatory functions(RVPDF, NVPDF, CDF) are all able to discriminatebetween the correct and incorrect structures tosome degree, but for the most accurate discrimi-nation across a range of different decoys, an all-atom residue-speci®c representation is necessary(Figure 1a and b).

Effect of the compactness term onpredictive power

As one would expect, the signal from the CDF isconsistently less strong than from the other discri-

minatory functions tested (Figure 1a), except withthe loop decoy set, where it performs competitively(Figure 3). In the case of the PDBERR, SGPA andthe LOOP decoy sets, the signal is suf®cientlystrong to provide overall discrimination in mostinstances. For the MISFOLD, CASP2, and IFUdecoy sets, it is less effective. Other workers havenoted that compactness alone provides a strongsignal when the competing conformations are ran-dom coil like (Jernigan & Bahar, 1996; Bahar &Jernigan, 1997). Here we see that the signal maystill be very signi®cant in much less obvious situ-ations. Before applying this test, we were unawarethat our high rmsd loop conformations tended tobe the least compact. These results provide a goodillustration of the need for multiple types of decoyset in testing a discriminatory function. A particu-lar set may have a speci®c property, such as being

Figure 11. Plot illustrating the con-ditional probabilities encounteredin the 18 distance bins for N-O pre-ferences and preferences betweenmain chain nitrogens and asparticacid Od1 for all residue pairs. Theaverage of the negative log con-ditional probabilities for each bin isconnected by the broken line. Thevariation in the low bin distancesre¯ects the varying propensities fordifferent residues to form theseinteractions.


less compact than the native structures, and excel-lent performance may be obtained from a functionutilizing that signal. But in a real life application,(for example, assessing comparative models) thatsignal may not be present.

Effect of using a large distance cuttoff

Using a 20.0 AÊ distance cutoff results in the mostaccurate discrimination for the RAPDF (Figure 5).The signal from each atom±atom interaction isextremely weak at such large distances (seeFigures 9, 11, 10, and 12), but each atom has a verylarge number of contacts, so that the combined sig-nal still has an impact. It is unlikely that the energyinteraction is signi®cant, except perhaps betweencharged groups. However, the overall tendency ofproteins to be organized ``hydrophobic inside,

hydrophilic outside'' may result in a signal in theprobability curves (Huang et al., 1995).

Contributions of electrostatic and non-electrostatic terms

Both electrostatic and non-electrostatic terms(Figure 6) contribute to discrimination, but neitherone alone performs as well as the combination.With these decoys, the non-electrostatic contri-bution is usually the most signi®cant, an effect thatis most pronounced for the PDBERR and theLOOP decoy sets. The electrostatic contribution isprobably dominated by the main-chain hydrogenbinding, since these probabilities tend to have pro-nounced features (Figure 10), and there are manysuch terms contributing. In the loops set, the back-bone hydrogen bonding groups tend to be more

Figure 12. Plots illustrating the con-ditional probabilities for aspartateN±lysine O and proline N±trypto-phan O. The shorter minimum inthe 0.0 to 3.0 AÊ bin for proline N±tryptophan O re¯ects the inabilityof proline N to hydrogen bond.The deep minima for aspartateN±lysine O re¯ects the tendencyof these residues to salt bridge ina-helices and b-sheets, resulting inmain-chain hydrogen bonds.


solvated than in most of the structure contributingto the probabilities.

Effect of linear interpolation and the problemof sparse data

The discriminatory functions described herewere constructed using the simplest possiblemodels. A more sophisticated model would takeinto account the effects of low counts in the com-putation of the frequencies and would also per-form some sort of ``smoothing'', or interpolation,between the discrete conditional probabilities.

The problem of low counts is less severe withthis formalism than with some others, wherecounts are further subdivided by the sequence sep-aration between the atoms involved (Sippl, 1990;Subramaniam et al., 1996). The total conditionalprobability score for a given protein conformationrepresents a sum over a very large number of indi-vidual terms (in the order of 106 contributions) sothat the effect of the uncertainties is reduced.

Linear interpolation of the conditional probabil-ities in the RAPDF does result in better discrimi-nation for these decoy sets (Figure 7). Thissuggests that the discrete points for each distancebin can be represented by a continuous functionand so used with a gradient-dependent proteinfolding simulation technique.

Effect of aritfacts in the decoy sets

A discriminatory function may be able to selectthe correct conformation by utilizing subtle differ-ences in the incorrect conformations which dis-tinguish them from the experimentally determinedstructures. For example, re®nement of structuresdetermined using X-ray crystallography is usuallydone with programs (like X-PLOR (BruÈ nger, 1992))which effectively restrain particular distances foratom±atom interactions, such as hydrogen bonds.Since the PDFs are parameterized on high resol-ution X-ray crystallography structures, it is import-ant to demonstrate that discrimination of correctconformations from incorrect ones is not based onthis kind of ®ne detail.

For the LOOP decoy set (Figures 3 and 4) this isnot the case, as the criteria for correctness dependon selecting a low rmsd conformation to the exper-imental structure, not the experimental structureitself. For the crystallographic error (PDBERR)decoy set, discrimination based on ®ne detail isunlikely, as both the correct and incorrect confor-mations were re®ned using X-PLOR (BrunÈger,1992) or PROLSQ (Hendrickson et al., 1985).

To test whether such an artifact is responsiblefor accurate discrimination in the CASP1 decoy set,consisting of homology models and their corre-sponding experimental structures, we energy mini-mized two models together with the experimentalstructures using the DISCOVER CVFF force®eld(Hagler et al., 1974; Dauber-Osguthorpe et al.,1988). The discrimination ratios of the negative log

conditional probabilities between the unminimizedmodel and the unminimized experimental struc-ture are 0.82 and 0.87 for the two cases The corre-sponding ratios of the minimized model to theminimized experimental structure are 0.78 and0.83. Thus, surprisingly, discrimination is slightlyimproved by this procedure, rather than deteriorat-ing as would be the case if the signal were arisingfrom ®ne scale differences. While this is not anexhaustive test, it suggests that the signal that sep-arates correct and incorrect conformations is notdue to ®ne details in the experimental structures.

A similar test was performed for the SGPAdecoy set, in this case, minimizing all structureusing 1000 steps of steepest descent using theCHARMM force®eld (Brooks et al., 1983). The dis-crimination ratios between the simulation struc-tures and the unminimized experimental structureare 0.72 and 0.74 for the two cases. The corre-sponding ratios of the minimized simulation struc-tures to the minimized experimental structure are0.79 and 0.80.

In the MISFOLD decoy set, the main chains ofboth the correct and incorrect conformations arefrom structures determined using X-ray crystallo-graphy. However, the percentages of structurescorrectly discriminated using non-speci®c compact-ness, measured by the CDF (Figure 1a) indicatesthat the side-chain packing in the incorrect confor-mations does not generate as many atom±atomcontacts within 6.0 AÊ as would normally beobserved in an experimental conformation for thatsequence. So for these decoys the signal may bepartially due to ®ne differences between correctand incorrect conformations.

Limits on the discriminatory power

There are de®nite limits as to what the discrimi-natory functions can achieve. The weakest per-formance is for the IFU decoy set, where thepercentage discrimination by the RAPDF is 70%.The independent folding units in this set are shortfragments (between 10 and 20 residues) that areremoved from the context of the rest of the struc-ture. In some cases, the failure may be becausethese fragments do not represent bona ®de indepen-dent folding units. In others, the information avail-able in the fragments may not be adequate touniquely identify the correct conformation fromthe incorrect conformation, as the scores are evalu-ated on relatively small numbers of atom pairs.

Discrimination is not attained for one member ofthe LOOP decoy set (residues 265 to 269 in theimmunoglobulin A FAB fragment; PDB code 2fbj).This region is involved in several (>10) intermole-cular crystallographic contacts of less than 4.0 AÊ inthe experimental structure. However, the CDF(Figure 3) and other discriminatory functions(Braxenthaler et al., 1997) are able to discriminatewell.

In the CASP1 decoy set (Figure 2), the RAPDFwith one protein, nucleoside diphosphate kinase


(nm23), fails to discriminate between the correctand approximate conformations, and with anotherprotein (hpr) performs worse than RVPDF andNVPDF. In both these cases, the approximate con-formations are very similar (within 0.53 and 1.05 AÊ

Ca rmsd, respectively) to the experimental struc-ture. This suggests that the discriminatory functionmay be unable to consistently discriminate correctfrom incorrect when the conformations are close(around 1.0 AÊ Ca rmsd) to the experimental confor-mation. Present comparative modelling techniquesdo not approach this level of accuracy except atvery high sequence identity (Martin et al., 1997), someanwhile this is not a concern.

Effect of experimental accuracy

An alternative explanation for poor discrimi-nation with nm23 is the relatively low quality ofthe experimental structure (2.8 AÊ resolution withan R-factor of 0.25 (Webb et al., 1995)). The parentstructure used for the comparative modelling, PDBcode 1ndl, has a high sequence identity (77%), andhas been solved to a 2.4 AÊ resolution with an R-factor of 0.16 (Chiadmi et al., 1993). Thus the poordiscrimination of the RAPDF may simply re¯ectthe moderate resolution and incomplete re®nementof the target experimental structure relative to theparent structure.

Relevance of conditional probabilities to thenature of physical interactions observedin proteins

The discriminatory function was compiled usingstatistical observations, averaging over differentenvironments in protein structures. As a result, itdisplays some features not observed in a directway in a physics-based energy function, as shownin Figures 9, 10, 11, and 12.

Caution should be used when interpreting theconditional probability or potential of mean forcedata in physical terms because of the averaging ofenvironments that occurs during the compilationof the probabilities. To illustrate with an extremeexample, the minimum in the N±O plot (Figure 11)in the 0.0 to 3.0 AÊ bin averages over hydrogenbonds between N and O atoms in i, i � 4 residuesin a-helices and between N±O distances in i, i � 1(neighbouring) residues. It is thus dif®cult to ascer-tain exactly where the signal is coming from giventhe two different environments.

Nevertheless, many of these features seen arephysically signi®cant. Some are obvious: forexample, the largest minima in the plot of Ca±Cacontacts (Figure 9) re¯ects the geometrical con-straints imposed by the covalent structure of thepolypeptide chain, and the low score for contactsbetween proline nitrogen and other main-chainoxygen re¯ects the absence of the hydrogen atomin proline nitrogen (Figure 11).

Some of the features are less obvious: the smal-lest negative log conditional probabilities for nitro-

gen±oxygen contacts are in the 0.0 to 3.0 AÊ bin foraspartate and lysine, re¯ecting the tendency ofthese residues to form salt bridges in a-helices andb-sheets, resulting in main-chain hydrogen bondsbetween them (Figures 11 and 12).

Availability of the PDFs

The Tables containing the parameters for thePDFs, the decoy sets, and software to handle thedata are available from the PROSTAR Web site(Braxenthaler et al., 1997).

Acknowledgements

Thanks to Jan Pedersen, Brett Milash, MichaelBraxenthaler, and Rui Luo for valuable discussions. Wealso thank Lisa Holm and Chris Sander for making avail-able the co-ordinates for the MISFOLD decoys. Thiswork was supported in part by a Life TechnologiesFellowship to Ram Samudrala and NIH grant GM41034to John Moult.

References

An®nsen, C. (1973). Principles that govern the folding ofprotein chains. Science, 181, 223±230.

Avbelj, F. & Moult, J. (1995a). Determination of the con-formation of folding initiation sites in proteins bycomputer simulations. Proteins: Struct. Funct. Genet.23, 129±141.

Avbelj, F. & Moult, J. (1995b). Role of electrostaticscreening in determining protein main chain confor-mational preferences. Biochemistry, 34, 755±764.

Avbelj, F., Moult, J., Kitson, H., James, M. & Hagler, A.(1990). Molecular dynamics study of the structureand dynamics of a protein molecule in crystallineionic environment, Streptomyces griseus Protease A.Biochemistry, 29, 8658±8676.

Bahar, I. & Jernigan, R. (1997). Inter-residue potentials inglobular proteins: dominance of highly speci®chydrophilic interactions at close separation. J. Mol.Biol. 266, 185±214.

Bernstein, F., Koetzle, T., Williams, G., Meyer, E., Brice,M., Rodgers, J., Kennard, O., Shimanouchi, T. &Tsumi, M. (1977). The protein data bank: a compu-ter-based archival ®le for macromolecularstructures. J. Mol. Biol. 112, 535±542.

Beveridge, D. & DiCapua, F. (1989). Computer Simu-lations of Biomolecular Systems. Theoretical and Exper-imental Applications (Gunsteren, W. & Weiner, P.,eds), Escom, Leiden, The Netherlands.

Bowie, J., LuÈ thy, R. & Eisenberg, D. (1991). Method toidentify protein sequences that fold into knownthree-dimensional structure. Science, 253, 164±170.

Braxenthaler, M., Samudrala, R., Pedersen, J., Luo, R.,Milash, B. & Moult, J. (1997). PROSTAR: Theprotein potential test site. <http://prostar.carb.nist.-gov>.

Brooks, B., Bruccoleri, R., Olafson, B., States, D.,Swaminathan, S. & Karplus, M. (1983). CHARMM:A program for macromolecular energy, minimiz-ation, and dynamics calculations. J. Comp. Chem. 4,187±217.


BruÈ nger, A. (1992). X-PLOR Version 3.1: A System forX-ray Crystallography and NMR, Yale UniversityPress.

Bryant, S. & Lawrence, C. (1993). An empirical energyfunction for threading protein sequence through thefolding motif. Proteins: Struct. Funct. Genet. 16, 92±112.

Chiadmi, M., Morera, S., Lascu, I., Dumas, C., Le Bras,G., Veron, M. & Janin, J. (1993). Crystal structure ofthe Awd nucleotide diphosphate kinase fromDrosophila. Structure, 1, 283±293.

Dauber-Osguthorpe, P., Roberts, V., Osguthorpe, D.,Wolff, J., Genest, M. & Hagler, A. (1988). Structureand energetics of ligand binding to proteins: E.colidihydrofolate reductase-trimethoprime, a drugreceptor system. Proteins: Struct. Funct. Genet. 4, 31±47.

DeBolt, S. & Skolnick, J. (1996). Evaluation of atomiclevel mean force potentials via inverse folding andinverse re®nement of proteins structures: atomicburial position and pairwise non-bondedinteractions. Protein Eng. 9, 637±655.

Dill, K. (1990). Dominant forces in protein folding. Bio-chemistry, 29, 7133±7155.

Fidelis, K., Stern, P., Bacon, D. & Moult, J. (1994).Comparison of systematic search and databasemethods for constructing segments of proteinstructure. Protein Eng. 7, 953±960.

Fischer, D. (1997). UCLA-DOE Fold Recognition Server.<http://www.mbi.ucla.edu:-88/>.

Fischer, D. & Eisenberg, D. (1996). Fold recognitionusing sequence-derived predictions. Protein Sci. 5,947±955.

FloÈckner, H., Braxenthaler, M., Lackner, P., Jaritz, M.,Ortner, M. & Sippl, M. (1995). Progress in foldrecognition. Proteins: Struct. Funct. Genet. 23, 376±386.

Hagler, A., Huler, E. & Lifson, S. (1974). Energy func-tions for peptides and proteins. I. Derivation of aconsistent force ®eld including the hydrogen bondfrom amide crystals. J. Amer. Chem. Soc. 96, 5319±5335.

Halgren, T. (1995). Potential energy functions. Curr.Opin. Struct. Biol. 5, 205±210.

Head-Gordon, T. & Brooks, C. (1991). Virtual rigid bodydynamics. Biopolymers, 31, 77±100.

Hendrickson, W., Smith, J. & Sheriff, S. (1985). Directphase determination based on anomalousscattering. Methods Enzymol. 115, 41±55.

Holm, L. & Sander, C. (1992a). Evaluation of proteinmodels by atomic solvation preference. J. Mol. Biol.225, 93±105.

Holm, L. & Sander, C. (1992b). Fast and simple montecarlo algorithm for side chain optimization in pro-teins: application to model building by homology.Proteins: Struct. Funct. Genet. 14, 213±223.

Huang, E., Subbiah, S. & Levitt, M. (1995). Recognisingnative folds by the arrangement of hydrophobicand polar residues. J. Mol. Biol. 252, 709±720.

Jernigan, R. & Bahar, I. (1996). Structure-derived poten-tials and protein simulations. Curr. Opin. Struct.Biol. 6, 192±209.

Jones, D., Miller, R. & Thornton, J. (1995). Successfulprotein folding recognition by optimal sequencethreading validated by rigorous blind testing. Pro-teins: Struct. Funct. Genet. 23, 387±397.

Jones, D., Taylor, W. & Thornton, J. (1992). A newapproach to protein fold recognition. Nature, 258,86±89.

Jorgensen, W. & Tirado-Rives, J. (1988). The OPLSpotential function for proteins. Energy minimis-ations for crystals of cyclic peptides and crambin.J. Amer. Chem. Soc. 110, 1657±1666.

Lemer, C. M.-R., Rooman, M. & Wodak, S. (1995).Protein structure prediction by threading methods:evaluation of current techniques. Proteins: Struct.Funct. Genet. 23, 337±355.

Levitt, M. (1998). Competitive assessment of proteinfolding recognition and threading accuracy. Pro-teins: Struct. Funct. Genet. In the press.

LuÈ thy, R., Bowie, J. & Eisenberg, D. (1992). Assessmentof protein models with three-dimensional pro®les.Nature (London), 356, 83±85.

MacArthur, M., Laskowski, R. & Thornton, J. (1994).Knowledge-based validation of protein structurecoordinates derived by X-ray crystallography andNMR spectroscopy. Curr. Opin. Struct. Biol. 4, 731±737.

Madej, T., Gibrat, J. & Bryant, S. (1995). Threading adatabase of protein cores. Proteins: Struct. Funct.Genet. 23, 356±369.

Mark, A. & van Gunsteren, W. (1994). Decomposition ofthe free energy of a system in terms of speci®cinteractions. J. Mol. Biol. 240, 167±176.

Martin, A., MacArthur, M. & Thornton, J. (1998).Assessment of comparative modelling in CASP2.Proteins: Struct. Funct. Genet. In the press.

McCammon, J. & Harvey, S. (1987). Dynamics of Proteinsand Nucleic Acids, Cambridge University Press.

McQuarrie, D. (1976). Statistical Mechanics, Harper andRow, NY.

Mosimann, S., Meleshko, R. & James, M. (1995). A criti-cal assessment of comparative molecular modellingof tertiary structures in proteins. Proteins: Struct.Funct. Genet. 23, 301±317.

Mosteller, F., Rourke, R. & Thomas, G., Jr. (1970). Prob-ability with Statistical Applications, Addison-WesleyPublishing Company, Reading, MA.

Moult, J. (1997). Comparison of database potentials andmolecular mechanics force ®elds. Curr. Opin. Struct.Biol. 7, 194±199.

Moult, J. & James, M. N. G. (1986). An algorithm fordetermining the conformation of polypeptide seg-ments in proteins by systematic search. Proteins:Struct. Funct. Genet. 2, 146±163.

Orengo, C., Michie, A., Jones, S., Swindells, M., Jones,D. & Thornton, J. (1993). Protein structure classi®ca-tion. <http://www.biochem.ucl.ac.uk/bsm/cath/>.

Park, B. & Levitt, M. (1992). Energy functions that dis-criminate X-ray and near native folds from well-constructed decoys. J. Mol. Biol. 258, 367±392.

Pedersen, J. T. & Moult, J. (1997). Folding simulationwith genetic algorithms and a detailed moleculardescription. J. Mol. Biol. 269, 240±259.

Samudrala, R. & Moult, J. (1998). Handling context-sen-sitivity in protein structures using graph theory:bona ®de prediction. Proteins: Struct. Funct. Genet.In the press.

Simons, K., Kooperberg, C., Huang, E. & Baker, D.(1997). Assembly of protein tertiary structures fromfragments with similar local sequences using simu-lated annealing and bayesian scoring functions.J. Mol. Biol. 268, 209±225.

Sippl, M. (1990). Calculation of conformational ensem-bles from potentials of mean force. An approach tothe knowledge based prediction of local structuresin globular proteins. J. Mol. Biol. 213, 859±883.


Sippl, M. (1993). Recognition of errors in three-dimen-sional structures of proteins. Proteins: Struct. Funct.Genet. 17, 355±362.

Sippl, M. (1995). Knowledge based potentials forproteins. Curr. Opin. Struct. Biol. 5, 229±235.

Sippl, M., Ortner, M., Jaritz, M., Lackner, P. & FloÈckner,H. (1996). Helmholtz free energies at atom pairinteractions in proteins. Folding and Design, 1, 289±298.

Storch, E. & Daggett, V. (1995). Molecular dynamics ofcytochrome b5: implications for protein±proteinrecognition. Biochemistry, 34, 9682±9693.

Subramaniam, S., Tcheng, D. K. & Fenton, J. (1996). Aknowledge-based method for protein structurere®nement and prediction. In Proceedings of theFourth International Conference on Intelligent Systemsin Molecular Biology (States, D., Agarwal, P.,Gaasterland, T., Hunter, L. & Smith, R., eds),pp. 218±229, AAAI Press, Boston, MA.

Sun, S. (1993). Reduced representation model of proteinstructure prediction: statistical potential and geneticalgorithms. Protein Sci., 762±785.

Unger, R. & Moult, J. (1991). An analysis of protein fold-ing pathways. Biochemistry, 30, 3816±3823.

Wang, Y., Zhang, H. & Scott, R. (1995). Discriminatingcompact non-native structures from the nativestructure of globular proteins. Proc. Natl Acad. Sci.USA, 92, 709±713.

Webb, P., Perisic, O., Mendola, C., Backer, J. &Williams, R. (1995). The crystal structure of ahuman nucleoside diphosphate kinase, NM23-H2.J. Mol. Biol. 251, 574±587.

Weiner, S., Kollman, P., Nguyen, D. & Case, D. (1986).An all atom force ®eld for simulations of proteinsand nucleic acids. J. Comp. Chem. 7, 230±252.

Wodak, S. & Rooman, M. (1993). Generating and test-ing protein folds. Curr. Opin. Struct. Biol. 3, 247±259.

Edited by F. Cohen

(Received 10 July 1997; received in revised form 21 October 1997; accepted 21 October 1997)


Documents

An All-atom Distance-dependent Conditional Probability ...ram.org/compbio/publications/samudrala_1998a.pdf · An All-atom Distance-dependent Conditional Probability Discriminatory