of 19 /19
Int. J. Data Mining and Bioinformatics, Vol. 4, No. 3, 2010 281 Mining the Protein Data Bank with CReF to predict approximate 3-D structures of polypeptides Márcio Dorn and Osmar Norberto de Souza* Laboratório de Bioinformática, Modelagem e Simulação de Biossistemas – LABIO, Programa de Pós-Graduação em Ciência da Computação, Faculdade de Informática, Pontifícia Universidade Católica do Rio Grande do Sul, Avenida Ipiranga, 6681, Prédio 32 – Sala 602, CEP 90619-900, Porto Alegre, RS, Brazil Fax: +55 51 3320-3621 E-mail: [email protected] E-mail: [email protected] *Corresponding author Abstract: In this paper we describe CReF, a Central Residue Fragment- based method to predict approximate 3-D structures of polypeptides by mining the Protein Data Bank (PDB). The approximate predicted structures are good enough to be used as starting conformations in refinement procedures employing state-of-the-art molecular mechanics methods such as molecular dynamics simulations. CReF is very fast and we illustrate its efficacy in three case studies of polypeptides whose sizes vary from 34 to 70 amino acids. As indicated by the RMSD values, our initial results show that the predicted structures adopt the expected fold, similar to the experimental ones. Keywords: CReF; 3-D protein structure prediction; ab initio; de novo; knowledge-based methods and data mining; central residue fragment-based method; PDB; protein data bank. Reference to this paper should be made as follows: Dorn, M. and Norberto de Souza, O. (2010) ‘Mining the Protein Data Bank with CReF to predict approximate 3-D structures of polypeptides’, Int. J. Data Mining and Bioinformatics, Vol. 4, No. 3, pp.281–299. Biographical notes: Márcio Dorn received a Master Degree in Computer Science from the Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre, Brazil 2008. He is currently a PhD student in the area of Computer Science and Applied Mathematics at the same university. His main research interests include bioinformatics, high performance computing and numerical and verified algorithms. Osmar Norberto de Souza received his PhD in Science, with Emphasis in Computational Molecular Biophysics, from the University of London, London, England. He is a Senior Lecturer at the Faculties of Computer Science and Biosciences in the Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre, Brazil. His primary research interest is the development and application of computational methods Copyright © 2010 Inderscience Enterprises Ltd.

Mining the Protein Data Bank with CReF to predict approximate 3-D structures of polypeptides

  • Upload
    pucrs

  • View
    66

  • Download
    0

Embed Size (px)

Text of Mining the Protein Data Bank with CReF to predict approximate 3-D structures of polypeptides

Int. J. Data Mining and Bioinformatics, Vol. 4, No. 3, 2010 281

Mining the Protein Data Bank with CReF to predict

approximate 3-D structures of polypeptides

Márcio Dorn and Osmar Norberto de Souza*

Laboratório de Bioinformática,Modelagem e Simulação de Biossistemas – LABIO,Programa de Pós-Graduação em Ciência da Computação,Faculdade de Informática,Pontifícia Universidade Católica do Rio Grande do Sul,Avenida Ipiranga, 6681, Prédio 32 – Sala 602,CEP 90619-900, Porto Alegre, RS, BrazilFax: +55 51 3320-3621E-mail: marcio.d[email protected]: [email protected]*Corresponding author

Abstract: In this paper we describe CReF, a Central Residue Fragment-based method to predict approximate 3-D structures of polypeptidesby mining the Protein Data Bank (PDB). The approximate predictedstructures are good enough to be used as starting conformations inrefinement procedures employing state-of-the-art molecular mechanicsmethods such as molecular dynamics simulations. CReF is very fast andwe illustrate its efficacy in three case studies of polypeptides whose sizesvary from 34 to 70 amino acids. As indicated by the RMSD values, ourinitial results show that the predicted structures adopt the expected fold,similar to the experimental ones.

Keywords: CReF; 3-D protein structure prediction; ab initio;de novo; knowledge-based methods and data mining; central residuefragment-based method; PDB; protein data bank.

Reference to this paper should be made as follows: Dorn, M. andNorberto de Souza, O. (2010) ‘Mining the Protein Data Bank with CReFto predict approximate 3-D structures of polypeptides’, Int. J. DataMining and Bioinformatics, Vol. 4, No. 3, pp.281–299.

Biographical notes: Márcio Dorn received a Master Degree in ComputerScience from the Pontifícia Universidade Católica do Rio Grande doSul (PUCRS), Porto Alegre, Brazil 2008. He is currently a PhD studentin the area of Computer Science and Applied Mathematics at thesame university. His main research interests include bioinformatics, highperformance computing and numerical and verified algorithms.

Osmar Norberto de Souza received his PhD in Science, with Emphasisin Computational Molecular Biophysics, from the University of London,London, England. He is a Senior Lecturer at the Faculties of ComputerScience and Biosciences in the Pontifícia Universidade Católica do RioGrande do Sul (PUCRS), Porto Alegre, Brazil. His primary researchinterest is the development and application of computational methods

Copyright © 2010 Inderscience Enterprises Ltd.

282 M. Dorn and O. Norberto de Souza

to the sequence-structure-dynamics-function relationship in biologicalmolecules. He is mostly interested in protein structure prediction, proteindynamics, and bioinformatics applied to neglected diseases.

1 Introduction

A polypeptide or protein molecule is a covalent chain of amino acids residues that,in physiological conditions, also called the native environment, adopt a uniquethree-dimensional (3-D) structure. This native structure dictates the biochemicalfunction of the protein (Baxevanis and Ouellette, 2005; Branden and Tooze, 1998).Experiments by Anfinsen et al. (1961) demonstrated that a protein molecule whendenatured by disrupting conditions in environment can be refolded to its nativestructure when the physiological conditions are restored. Therefore, the amino acidsequence contains all the information necessary to determine the native structureof the protein. Based on this principle, the native fold or structure of a protein canbe predicted computationally using only the physical-chemical information of itsprimary structure.

Protein fold (Creighton, 1990) and Protein Structure Prediction (PSP)(Tramontano, 2006) are two of the greatest questions in structural bioinformaticsand consists in understanding and predicting how the information coded in thelinear primary structure (amino acid sequence) is translated into the 3-D structureof a protein. Although most existing in silico prediction methods focus onfinding a precise structure at first, the prediction of an approximate 3-D structureusing computational methods can provide valuable structural information onexperimentally unsolved protein structures.

Many computational methodologies and algorithms have been proposed asa solution to the PSP problem (reviewed in Tramontano, 2006; Bujnicki, 2006;Moult, 2005; Osguthorpe, 2000). These methods are divided into two groups. Thefirst, called ab initio or de novo methods (Rohl et al., 2004; Srinivasan and Rose,1995), aims at predicting news protein folds. The second, named template-basedmethods, such as comparative homology modelling (Martí-Renom et al., 2000) andfold recognition via threading (Jones et al., 1992), are capable of making fast andeffective prediction of protein 3-D structures if known template structures and foldlibraries are available (Kolinski, 2004).

Both methodologies have limitations. Comparative homology modelling canonly predict structures of protein sequences which are similar or nearly identicalto other protein sequences of known structure. Fold recognition via threadingis limited to the fold library derived from the PDB structures (Berman et al.,2000). Only ab initio or de novo predictions can obtain novel structures with newfolds. However, the complexity and high dimensionality (Ngo et al., 1997) of thesearch space even for a small protein molecule still makes the problem intractable(Levinthal, 1968), despite the current availability of high performance computingplatforms.

As the interest is focused on the prediction and discovery of new proteinfolds, the academic efforts are concentrated in developing hybrid methods which

Mining the Protein Data Bank with CReF 283

should combine the accuracy of homology modelling with the capacity of ab initiomethods in predicting novel folds. In order to reduce the complexity and the highdimensionality of the conformational search space inherent to ab initio methods,information about structural motifs found in known protein structures can be usedto construct approximated conformations. Rosetta (now Robetta) (Rohl et al.,2004; Simons et al., 1999) and LINUS (Srinivasan and Rose, 1995, 2002) are twoexamples of ab initio or de novo fragment-based methods. Robetta, however, hasbeen the most successful predictor as revealed by the last CASP (Moult et al., 2005)experiments.

In this paper, we present a new, simple and very fast method to build initialand approximate, native-like polypeptide conformation for a target amino acidsequence using information about conserved residues from PDB (Berman et al.,2000) templates. These approximate conformations are expected to be good enoughto be further refined by means of molecular Mechanics Methods (MM) suchas Molecular Dynamics (MD) simulations (van Gunsteren and Berendsen, 1990).In such a refinement step global interactions between all atoms in the molecule areevaluated and deviations in the torsion angles can be corrected.

Section 2 briefly introduces the concepts of peptide, polypeptide or protein, andhow we represent them in this work. In Section 3, we explain our approach to thePSP problem and detail the proposed method. Three case studies are conductedin Section 4. In Section 5, we analyse the results demonstrating the suitabilityof our method for generating approximate initial 3-D conformations with goodquality Root Mean-Square Deviations (RMSDs) and at very low computationalcost. Finally, in Section 6, we discuss our method and results and highlight futurework to improve its accuracy.

2 Polypeptide representation

A peptide is a molecule composed of two or more amino acid residues chainedby a bond called the ‘peptide bond’. Larger peptides are generally referred to aspolypeptides or proteins (Creighton, 1990). A peptide has three main chain torsionangles, namely phi, psi and omega (the triplet in Figure 1). The main chain torsionangles of a polypeptide determine its conformation or fold.

In this work we represent a protein conformation C in the form of a vectorC = {xi, xi+1, . . . , xp}, where xi is a triplet of torsion angles, φ(phi), ψ(psi), andω(omega) for each amino acid residue in the primary structure of the protein(Figure 1). The set of consecutive triplets represent the internal rotations of apolypeptide main chain (Branden and Tooze, 1998). In the model peptide the bondsbetween N and Cα, and between Cα and C are free to rotate. These rotationsare described by the φ and ψ torsion angles, respectively. This freedom is mostlyresponsible for the conformation adopted by the main chain. The angle ω is eitherclose to 0◦ (cis) or 180◦ (trans), with the latter value being the preferred one(Branden and Tooze, 1998). Thus, only the φ and ψ torsion angles, named hereintorsion angles duplet, are considered in our analysis of the main chain conformation(duplet in Figure 1).

284 M. Dorn and O. Norberto de Souza

Figure 1 Schematic representation of a model peptide illustrating a triplet and a duplet ofmain chain torsion angles. N is nitrogen, C and Cα are carbons and Ri is anarbitrary side-chain

3 The proposed method

The fundamental goal of our method is to build approximate 3-D conformationsfor a target amino acid sequence using structural information from PDB templates.We split the target sequence into contiguous fragments and search for theirhomologues in the PDB (Berman et al., 2000). Only the central amino acid residueconformation (φ, ψ pair) of each template fragments is considered. This informationis then used to build the approximate predicted models for the target sequence.

The rational for this approach is that the conformation of this central residueis influenced by all other residues before and after it in the linear sequence and byits spatial neighbours. We do not use the entire fragment in order to allow a morevariable set of phi, psi torsion angles. These generic torsion angles might be moreappropriate for predictions of novel polypeptide folds. This approach also avoidsthe need for methods to assemble fragments (Bujnicki, 2006). Our procedure differsfrom methods such as Rosetta which instead uses assembly of a range of fragmentsizes (Rohl et al., 2004).

3.1 The basic algorithm

The algorithm proposed in this work consists of seven steps as follow:

1 The target sequence is fragmented.

2 The PDB databank is searched for fragment templates.

3 Pairs of phi, psi torsion angles are calculated for the central amino acidresidue of each template fragment and 2-tuple defined.

4 All torsion angles 2-tuple for each fragment are clustered.

5 The clusters are classified and labelled according to regions of theRamachandran plot.

6 A consensus secondary structure for the target sequence is predicted.

7 An approximate conformation is built.

Mining the Protein Data Bank with CReF 285

These seven steps are detailed below:

1. Fragmenting the target sequence. In this step a target sequence K is fragmentedinto many short si contiguous fragments with l (windows size) amino acids each.Figure 2 shows a schematic representation of the fragmentation step and how weobtain all contiguous fragments for a target sequence K. A set S of contiguousfragments, representing all possible fragments of length l is created and representedas S = {si, si+1, . . . , sp}, where si and sp are the first and the last fragment,respectively. If n is the number of amino acids in the target sequence K and l anodd value for the size of each si, then the number P of possible fragments obtainedin the fragmentation step is given by P = [n − (l − 1)]. A fragment si starts at theith amino acid residue and terminates at the jth residue, consisting of a set ofconsecutive triplets of torsion angles {(wi−1, φi, ψi), . . . , (wj−1, φj , ψj)}.

Figure 2 Schematic representation of the fragmentation of the target sequence K.The central amino acid residues and their phi, psi pairs of torsion angles arehighlighted

The target sequence K with n amino acids residues is fragmented in all Pcontiguous fragments of size l. This type of fragmentation has been used beforeby Zhang et al. (1989). Each si fragment has an odd value for l because we onlyconsider the phi, psi torsion angles of the fragments central amino acid.

2. Searching the PDB for templates using BLASTp. Each of the si fragments ofsize l obtained above is used to search the PDB for templates, using the web versionof BLASTp (Altschul et al., 1997) for short and near exact matches. Only hits withthe same length as the query fragment sequence are considered for further analysis.We used our in house version of the BioPython (Chapman and Chang, 2000) libraryto automate this step.

The final result of this step is a list of templates with their PDB accessioncodes (PDB ID) for each si. Only identical or very similar fragments alongthe five residues length are allowed. No gaps are allowed. Figure 3 illustratesthis step with the target fragment si = FNMQC used as example. The firstfour PDB hits returned for FNMQC are: 1W0E (template sequence: FDMEC),2AXE (template sequence: FNSQC), 1A14, (template sequence: FNLEC) and 1BS9(template sequence: FNSQC).

3. Calculating pairs of phi, psi torsion angles and defining 2-tuples. For eachtarget si a set of template PDB files is obtained. Pairs of phi, psi torsion

286 M. Dorn and O. Norberto de Souza

angles for every central amino acid of all templates (Figure 3) associated witha si fragment are calculated using the program Torsions (kindly provided byDr. Andrew C.R. Martin, UCL-London).

Figure 3 Example of template fragments obtained from the PDB for a target fragmentand the pairs of phi, psi torsion angles for the central amino acid residues

Each pair of torsion angles is represented as one 2-tuple ti = (φ, ψ). Now, eachtarget si can be represented as a set of 2-tuples si = {ti, ti+1, . . . , tp}, where tiand tp are the first and the last 2-tuples of the central amino acid of a template,respectively. Hence, we may represent the set S of si fragments and our templates tias S = {si = {ti, ti+1, . . . , tp}, si+1 = {ti, ti+1, . . . , tp}, . . . , sp = {ti, ti+1, . . . , tp}}.

4. Clustering of 2-tuples torsion angles. All ti 2-tuples belonging to one si

fragment are clustered using the probabilistic Expectation Maximisation (EM)method (Witten and Frank, 2005). A ki represents a cluster of ti ∈ si. EM considersthe different probabilities of distribution for each individual cluster in order toidentify which set of clusters are more favourable for a given set of data. It beginsby clustering the ti 2-tuples based in the k-means algorithm to obtain an initialsolution. k-means minimises a function E of quadratic error (equation (1)), in whichf ki clusters are present, with i = 1, 2, . . . , f and m(ki) is the mean value of alltj ∈ ki, that is, j denotes a ti 2-tuple belonging specifically to ki.

E =f∑

i=1

tj∈ki

|tj − m(ki)|2. (1)

After determining the initial solution, the probability of a ti 2-tuple to belongto one of the f ki clusters (Expectation) is calculated. From this probability,distribution parameters are calculated and the probabilities of distribution foreach cluster are ‘maximised’. See Witten and Frank (2005) for details about EMalgorithm implementation. The mean value m(ki) between the 2-tuples ti of acluster ki, j = 1, 2, . . . , n, is obtained by equation (2).

m(ki) =1n

n∑

j=1

tj . (2)

After the identification of the f ki clusters for ti ∈ si, the average of all tj2-tuples of ki is calculated. The average m(ki, θ) of all θ angles of ki is calculated

Mining the Protein Data Bank with CReF 287

(equation (3)), in which θ represents either of the two torsion angles (φ or ψ) andn is the number of 2-tuples associated to a ki cluster.

m(ki, θ) =1n

n∑

j=1

t[θ]j . (3)

We define f = 4 as the number of cluster to be identified. This number waschosen because the (φ, ψ) pair of torsion angles is restricted to four majorregions of conformations in the Ramachandran plot (Hovmöller et al., 2002;Ramachandran and Sasisekharan, 1968). At the end of the clustering step we endup with 4 ki clusters, and for each one of them we have an associated 2-tuple,ki = (mφ, mψ) and si = {ki = (mφ, mψ), ki+1 = (mφ, mψ), . . . , kf = (mφ, mψ)},with i = 1, 2, . . . , f in k, and mφ and mψ represents the average of φ or ψ of a ithcluster ki. The WEKA (Witten and Frank, 2005) data mining package was used forclustering.

5. Classifying and labelling the clusters. From m(ki, φ) and m(ki, ψ) values wecreate identification labels to a cluster ki. These labels map the four identifiedclusters to three conformational regions in the Ramachandran plot. All ki ∈ S arelabelled.

We have built a library that represents the most favourable region ofthe Ramachandran plot. This library is based on the works of Thorntonand collaborators (Laskowski et al., 1993; Morris et al., 1992) that dividesthe Ramachandran plot in 11 preferred regions. However, for simplification,we combined these regions into a 3-states Secondary Structure (SS) model: α-helix(h), β-sheet (b) and coil (c) (Table 1).

Table 1 The 3-states Secondary Structure (SS) model used in this work and itsrelationship to the Thornton and collaborators definition (Laskowski et al., 1993;Morris et al., 1992)

3-States Secondary Structure 11 regions

h α-helix A, ab β-sheet B, bc Coil a, b, L, l, l, p, p

The average (φ, ψ) pair of each ki is submitted to a mapping function, based in thelibrary developed. The function assigns a label to each ki cluster. Labelled clustershave the form ki : rot, where rot can be either h, b or c. Each si ∈ S is representedas si = {k1 : rot, k2 : rot, k3 : rot, k4 : rot}. After labelling the clusters ki they areordered according to the number of tj 2-tuples they contain: si = {ki : rot > ki+1 :rot, . . . , > kj : rot}.

6. Predicting the secondary structure. To assist in the building of theapproximate conformation, we predict the secondary structure of the targetamino acid sequence K. We use the consensus prediction by the DSC(King and Sternberg, 1996), PHD (Rost and Sander, 1993), and PREDATOR(Frishman and Frishman, 1996) methods at the NPS@ Consensus SecondaryStructure Prediction server (Combet et al., 2000).

288 M. Dorn and O. Norberto de Souza

7. Building the approximate 3-D conformation. The central amino acid of afragment si corresponds to the (i + 2)th residue in sequence K. We select the clusterki of si with the largest number of 2-tuples, for which the label matches one of thethree SS conformational states (h, c, or b) identified in the consensus SS prediction.The mean value m(ki, θ), for θ equal to φ or ψ, for each fragment si, is saved tobuild an approximate conformation of the target sequence (Figure 4).

Figure 4 Schematic representation of the procedure to build the approximate 3-Dconformation. A cluster ki of a fragment si with the largest number of 2-tuplesfor which the label matches one of the 3-states SS model is selected. Then aφ, ψ pair of ki is used to represent the torsion angle of the (i + 2)th residue insequence K. The first and last two residues at the N - and C-termini have theirpairs of torsion angles fixed at 180◦

Fragmentation with l = 5 results in the loss of information for the first and lasttwo amino acid residues. The pairs of torsion angles for each of these four residuesare fixed at 180◦. Then the approximate 3-D conformation is built using the teLeapmodule of AMBER7 (Case et al., 2005). All ω torsion angles are set to 180◦

(Branden and Tooze, 1998). If no cluster matching one of the 3-states SS modelidentified in the consensus SS prediction is found for a residue i, then we calculatethe mean value for residues i − 1 and i + 1 and apply it to residue i.

4 Experiments and initial results

We used as a test set for our three studies the amino acid sequences of threepolypeptides with PDB IDs 1ZDD (Starovasnick et al., 1997), 1ROP (Banneret al., 1987), and 1UTG (Morize et al., 1987). As our objective is to predict novelfolds we only consider templates (Step 2) which have no evolutionary relationshipwith the target sequence K. Thus, all PDB templates identical or closey-related(≥50% identity) to the test sequences, over their full lengths, were removed.

4.1 Case study 1

In this case we tested CReF with a disulfide-stabilised mini-protein (PDB ID:1ZDD) (Starovasnick et al., 1997) composed of 34 amino acids known tobe arranged as two α-helices connected by a turn (Figure 5(a)), a structural

Mining the Protein Data Bank with CReF 289

motif known as an α-helical hairpin (Murzin et al., 1995). The target sequenceK = {FNMQCQRRFYEALHPNLNEEQRNAKIKSIRDDC}, called 1ZDD−P, isfragmented into 30 short si contiguous pentapeptide fragments (l = 5). For eachsi fragment we searched for PDB (Berman et al., 2000) templates using BLASTp(Altschul et al., 1997). As explained above, we removed all PDBs whose sequenceswere similar or identical to 1ZDD, namely: 1ZDC, 1ZDD, 1L6X, 1OQO, 1OQX,1ZDA, 1ZDB, 2SPZ, 1LP1, 1Q2N, 1FC2, 1BDC, 1BDD, 1SS1, 1DEE, 1EDK,1EDJ, 1EDI, 1EDL. This should eliminate any bias due to sequences of knowstructures very closely-related to 1ZDD. Using the PDB templates, phi, psi pairswere calculated as described in Section 3. Table 2 summarises these results.

Figure 5 Ribbon representation of the backbone of: (a) experimental 1ZDDand (b) predicted 1ZDD−P

It shows the central amino acid residue (column 2) of each of the 30 si fragments(column 1), the number of templates found for each si fragment (column 3), andtheir conformational preferences (columns 4–6) according to the 3-states SS model.

The central residues conformational preferences in Table 2 already indicatethe types of secondary structures the target sequence will adopt. This observationis relevant for it reiterates that a consensus secondary structure (Figure 6) forthe target sequence is only calculated to guide CReF to the approximate 3-Dconformation. Using these data and the methods described in Steps 4–7 ofSection 3.1 we built an initial 3-D conformation for the predicted 1ZDD−Pmini-protein (Figure 5(b)) with a Cα RMSD of 3.4 Å with respect to theexperimental, 1ZDD, NMR structure (Starovasnick et al., 1997).

4.2 Case study 2

Here CReF was applied to the prediction of a 56 amino acids polypeptidealso known to be arranged as an α-helical hairpin (PDB ID: 1ROP) (Banneret al., 1987) (Figure 7(a)). The target sequence K = {MTKQEKTALNMARFIRSQTLTLLEKLNELDADEQADICESLHDHADELYRSCLA} is fragmentedinto 52 target short contiguous fragments with l = 5.

290 M. Dorn and O. Norberto de Souza

Table 2 Classification of all templates for 1ZDD−P in the 3-states SS model adoptedin this work

Fragment Central residue Number of templates Helix (%) Sheet (%) Coil (%)

FNMQC M 21 71.43 28.57 00.00NMQCQ Q 19 63.16 31.58 05.26MQCQR C 21 85.71 14.29 00.00QCQRR Q 21 85.71 09.52 04.77CQRRF R 23 69.57 26.08 04.35QRRFY R 16 87.50 06.25 06.26RRFYE F 23 95.65 04.35 00.00RFYEA Y 17 58.82 41.18 00.00FYEAL E 18 94.44 05.56 00.00YEALH A 13 84.62 07.70 07.68EALHD L 51 70.59 27.45 01.96ALHDP H 39 61.53 38.47 00.00LHDPN D 22 09.09 81.82 09.09HDPNL P 10 50.00 30.00 20.00DPNLN N 11 90.91 09.09 00.00PNLNE L 06 33.33 66.67 00.00NLNEE N 27 29.63 59.26 11.11LNEEQ E 46 93.48 04.35 02.17NEEQR E 10 100.0 00.00 00.00EEQRN Q 18 77.78 22.22 00.00EQRNA R 37 72.98 18.92 08.10QRNAK N 04 75.00 25.00 00.00RNAKI A 11 00.00 100.0 00.00NAKIK K 13 76.93 23.07 00.00AKIKS I 12 33.33 66.67 00.00KIKSI K 20 85.00 15.00 00.00IKSIR S 48 58.33 33.33 08.34KSIRD I 10 90.00 10.00 00.00SIRDD R 08 75.00 12.50 12.50IRDDC D 22 54.54 40.90 04.56

Figure 6 Consensus secondary structure prediction of 1ZDD−P

PDB templates search and redundancy elimination were performed as describedabove. Thus, 1ROP, 1B6Q, 1GMG, 1RPR, 2GHY, 1RPO, 1NKD, 1YO7, 1QX8,1F4M, and 1F4N were removed from the list of templates. Pairs of phi, psi torsionangles were calculated as described in Section 3. Table 4 summarises these results.

Mining the Protein Data Bank with CReF 291

Again, the consensus SS (Figure 8) was only calculated to guide CReF to theapproximate 3-D conformation of 1ROP−P. The approximate 3-D conformation(Figure 7(b)) of 1ROP−P has a Cα RMSD of 7.1 Å with respect to the 1ROP,experimental structure.

Figure 7 Ribbon representation of the backbone of: (a) experimental 1ROPand (b) predicted 1ROP−P

Figure 8 Consensus secondary structure prediction of 1ROP−P

4.3 Case study 3

A 70 amino acids long protein (PDB ID: 1UTG) (Morize et al., 1987) (Figure 9(a)),K = {GICPRFAHVIENLLLGTPSSYETSLKEFEPDDTMKDAGMQMKKVDSLPQTTRENIMKTEKIKPLCM}, is fragmented into 66 short contiguouspentapeptide fragments. PDB templates search and redundancy elimination wereperformed as described above. 1UTG, 2UTG, and 1UTR were removed from thelist of templates. Pairs of phi, psi torsion angles were calculated as described inSection 3.

For space reasons we do not show the table with the 3-states SS preferencesfor all central amino acid residues of 1UTG−P. Guided by the consensus SSprediction of 1UTG−P (Figure 10) we built an approximate conformation for thepredicted 1UTG−P (Figure 9(b)), with a Cα RMSD of 11.7 Å with respect to theexperimental, 1UTG, NMR structure (Morize et al., 1987).

292 M. Dorn and O. Norberto de Souza

Figure 9 Ribbon representation of the backbone of: (a) experimental 1UTGand (b) predicted 1UTG−P

Figure 10 Consensus secondary structure prediction of 1UTG−P

5 Analysis of the initial results

In this section we analyse the preliminary results obtained by CReF using threepolypeptides as test cases (Section 4). We ran PROCHECK (Laskowski et al.,1993) for the 1ZDD−P, 1ROP−P, and 1UTG−P approximate 3-D conformations.Figures 11–13 show Ramachandran plots for both the experimental (a) andpredicted (b) approximate conformations.

Figure 11 Ramachandran plot for the: (a) experimental 1ZDD and (b) predicted1ZDD−P conformation. Phi, psi angles are in degrees

Mining the Protein Data Bank with CReF 293

Figure 12 Ramachandran plot for the: (a) experimental 1ROP and (b) predicted 1ROP−Pconformation. Phi, psi angles are in degrees

Figure 13 Ramachandran plot for the: (a) experimental 1UTG and (b) predicted1UTG−P conformation. Phi, psi angles are in degrees

1ZDD−P, 1ROP−P, and 1UTG−P have, respectively, 95.7%, 87.2%, and 87.3%of their residues found in the most favourable regions of the maps. Clearly, thesecond structure is well formed and visual inspection (Figures 6–9) show thatthe folds of the predicted conformations are correct, despite the large overall Cα

RMSD observed, particularly for 1UTG−P. The percentage of occupied preferredregions decreases with increasing complexity of the test polypeptides. This resultsomewhat expected given that 1UTG-P has a more complex folding pattern whencompared to the other test cases. Nonetheless, when we compare segments of thepredicted approximated 3-D conformations for each of the test cases, against theirexperimental counterpart, we obtain improved values for the RMSDs as reportedin Table 3.

The RMSDs of the segments indicate that the individual helices are well formed.The RMSD of 3.5 Å observed for the C-terminus residues of 1UTG−P (32 to 65),last line of Table 3, is practically identical to the RMSD of 3.4 Å for 1ZDD−P over

294 M. Dorn and O. Norberto de Souza

its entire length (Section 4.1). This shows that the major problem in our methodis its inability to properly bring the secondary structure elements together to forma complete tertiary fold. This is likely due to problems in sampling the phi, psipairs belonging to the turn regions, which we expect to adjust in the future usingpolypeptide structure refinement protocols.

Table 3 Cα RMSDs of the predicted approximate conformations with respect to theirexperimental structures

Polypeptide Amino acids interval RMSD (Å)

1ZDD−P 3–14 0.61ZDD−P 21–32 0.51ROP−P 3–28 0.71ROP−P 31–54 1.41UTG−P 4–14 0.41UTG−P 20–27 0.61UTG−P 32–47 1.01UTG−P 50–65 0.81UTG−P 32–65 3.5

Table 4 Classification of all templates for 1ROP−P in the 3-states SS model adoptedin this work. It shows the central amino acid residue (column 2) of each of the56 si fragments (column 1), the number of templates found for each si fragmentin the PDB (column 3), and their conformational preferences (columns 4–6)according to the 3-states SS model

Fragment Central residue Number of templates Helix (%) Sheet (%) Coil (%)

MTKQE K 58 70.69 27.59 01.72TKQEK Q 75 62.67 36.00 01.33KQEKT E 57 59.65 33.33 07.02QEKTA K 69 81.16 15.94 02.90EKTAL T 41 58.54 41.46 00.00KTALN A 59 86.44 10.17 03.39TALNM L 55 81.82 18.18 00.00ALNMA N 62 77.42 17.74 04.84LNMAR M 71 85.92 09.86 04.23NMARF A 72 55.56 38.89 05.56MARFI R 77 90.91 09.09 00.00ARFIR F 84 60.71 36.90 02.38RFIRS I 83 62.65 37.35 00.00FIRSQ R 76 71.05 28.95 00.00IRSQT S 53 64.15 32.08 03.77RSQTL Q 82 54.88 43.90 01.22SQTLT T 84 58.33 41.67 00.00QTLTL L 80 27.50 67.50 05.00TLTLL T 80 38.75 60.00 01.25LTLLE L 78 89.74 10.26 00.00

Mining the Protein Data Bank with CReF 295

Table 4 Classification of all templates for 1ROP−P in the 3-states SS model adoptedin this work. It shows the central amino acid residue (column 2) of each of the56 si fragments (column 1), the number of templates found for each si fragmentin the PDB (column 3), and their conformational preferences (columns 4–6)according to the 3-states SS model (continued)

Fragment Central residue Number of templates Helix (%) Sheet (%) Coil (%)

TLLEK L 82 79.27 18.29 02.44LLEKL E 79 96.20 03.80 00.00LEKLN K 84 51.19 08.33 40.48EKLNE L 86 55.81 41.86 02.33KLNEL N 81 90.12 09.88 00.00LNELD E 77 85.71 12.99 01.30NELDA L 83 65.06 28.92 06.02ELDAD D 82 50.00 43.90 06.10LDADE A 79 56.96 36.71 06.33DADEQ D 73 57.53 39.73 02.74ADEQA E 76 78.95 17.11 03.95DEQAD Q 78 79.49 20.51 00.00EQADI A 87 45.98 50.57 03.45QADIC D 86 73.26 20.93 05.81ADICE I 85 62.35 37.65 00.00DICES C 90 47.78 48.89 03.33ICESL E 88 59.09 37.50 03.41CESLH S 84 76.19 23.81 00.00ESLHD L 82 71.95 26.83 01.22SLHDH H 87 80.46 16.09 03.45LHDHA D 80 70.00 12.50 17.50HDHAD H 82 57.32 37.80 04.88DHADE A 83 51.81 34.94 13.25HADEL D 84 84.52 10.71 04.76ADELY E 54 81.48 16.67 01.85DELYR L 73 84.93 15.07 00.00ELYRS Y 88 67.05 31.82 01.14LYRSC R 89 67.42 31.46 01.12YRSCL S 46 69.57 28.26 02.17RSCLA C 82 71.95 25.61 02.44SCLAR L 89 55.06 42.70 02.25CLARF A 85 50.59 48.24 01.18

As we stated in the Introduction, our objective is to obtain approximate 3-Dconformation for polypeptide sequences of unknown structure which, in turn,can be refined by state-of-the-art MM methods such as MD simulations. Theseapproximate 3-D conformations can be used as a starting conformation in ab initiomethods, hence considerably reducing the conformational space to be searched.Undoubtedly Rosetta (Rohl et al., 2004; Simons et al., 1999), LINUS (Srinivasanand Rose, 1995, 2002), and other ab initio or de novo methods (Bujnicki, 2006) are

296 M. Dorn and O. Norberto de Souza

predicting polypeptides structures with higher precision, in terms of RMSD, thanwe are doing now. However, we emphasise that CReF is designed to predict anapproximate 3-D conformation in a very fast manner. For instance, after retrievingthe PDB templates, CReF takes approximately 120 s to build the approximateconformation. CPU timings for the performance of other methods are not known.

All programs have been implemented using the Python language and executedin a Linux environment of a PC Pentium IV 2.4 GHz, 1GB RAM. Structureillustrations were prepared with PyMol (DeLano, 2002).

6 Conclusions and further work

We introduced CReF, a central-residue-fragment-based approach for predictingapproximate 3-D polypeptides structures. In contrast to other PSP methods(Rohl et al., 2004; Simons et al., 1999) we do not use entire fragments, but onlythe phi, psi torsion angle information of the central residue in the fragment.The predicted consensus secondary structure is used to guide the constructionof the approximate 3-D conformation. CReF does not utilise techniques forfragment assembly and optimisation which makes it a very fast method. The threecase studies showed correct prediction of secondary structures and the overallfold. For larger proteins such as 1UTG−P the approximate 3-D conformation isless accurate. Complete packing of the secondary structures into supersecondary(1ZDD−P and 1ROP−P) and tertiary (1UTG−P) structures could not be achievedat this stage of CReF development. This is likely to be due to the poorer definitionof coiled regions such as turns and loops. However, we expect these deviationsto be well adjusted or corrected by properly designed protocols of refinement byMM methods, particularly MD simulations. This would in turn reduce the totaltime ab initio methods, which usually start from a fully extended conformation(Breda et al., 2007) of a polypeptide, would take to fold a sequence of unknownstructure.

We see the main contributions of this work in:

• proposal of an approach for the generation of an approximate 3-Dconformation of polypeptides

• using phi, psi torsion angle information of the central residue in contiguousfragments of a target sequence

• and clustering the torsion angles according to a 3-states secondary structuremodel.

There is ample room to further the development of CReF:

• test the effect of using different clusters to generate the initial predictedconformation (this will be particularly relevant for proteins of real interest,for these usually have a mean size of about 200–250 residues)

• test the method with other classes of proteins

• develop MD simulation protocols for the refinement of the approximate3-D conformations obtained by CReF.

Mining the Protein Data Bank with CReF 297

Acknowledgements

This project was supported by grant 410505/2006-4 from CNPq to Osmar Norbertode Souza. ONS is a CNPq Research Fellow. Márcio Dorn is supported by a CNPqMSc scholarship.

References

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. andLipman, D.J. (1997) ‘Gapped BLAST and PSI-BLAST: a new generation of proteindatabase search programs’, Nucleic Acids Research, Vol. 25, No. 17, pp.3389–3402.

Anfinsen, C.B., Haber, E., Sela, M. and White Jr., F.H. (1961) ‘The kinetics offormation of native ribonuclease during oxidation of the reduced polypeptide chain’,The Proceedings of the National Academy of Sciences USA, Vol. 47, pp.1309–1314.

Banner, D.W., Kokkinidis, M. and Tsernoglou, D. (1987) ‘Structure of the ColE1 ropprotein at 1.7 A resolution’, Journal of Molecular Biology, Vol. 196, No. 3, pp.657–675.

Baxevanis, A.D. and Ouellette, B.F.F. (2005) Bioinformatics: A Practical Guide to theAnalysis of Genes and Proteins, 3rd ed., John Wiley and Sons, Inc., New Jersey, EUA.

Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bath, T.N., Weissig, H.,Shindyalov, I.N. and Bourne, P.E. (2000) ‘The protein data bank’, Nucleic AcidsResearch, Vol. 28, No. 1, pp.235–242.

Branden, C. and Tooze, J. (1998) Introduction to Protein Structure, 2nd ed., GarlangPublishing Inc., New York, EUA.

Breda, A., Santos, D.S., Basso, L.A. and de Souza, O.N. (2007) ‘Ab initio 3-D structureprediction of an artificially designed three-a-helix bundle via all-atom moleculardynamics simulations’, Genetics and Molecular Research, Vol. 6, No. 4, pp.901–910.

Bujnicki, J.M. (2006) ‘Protein structure prediction by recombination of fragments’,ChemBioChem Journal, Vol. 7, No. 1, pp.19–27.

Case, D.A., Cheatham, T.E., Darden, T., Gohlke, H., Luo, R., Merz, K.M. and Onufriev, A.(2005) ‘The AMBER biomolecular simulation program’, Journal of ComputationalChemistry, Vol. 26, No. 16, pp.1668–1688.

Chapman, B. and Chang, J. (2000) ‘Biopython: python tools for computational biology’,ACM SIGBIO Newsletter, Vol. 20, No. 2, pp.15–19.

Combet, C., Blanchet, C., Geourjoun, C. and Deleage, G. (2000) ‘NPS@: network proteinsequence analysis’, Trends in Biochemical Sciences, Vol. 25, No. 3, pp.147–150.

Creighton, T.E. (1990) ‘Protein folding’, Biochemical Journal, Vol. 270, pp.1–16.

DeLano, W.L. (2002) The PyMOL Molecular Graphics System, DeLano Scientific,San Carlos, CA, USA.

Frishman, D. and Frishman, P. (1996) ‘Incorporation of non-local interactions in proteinsecondary structure prediction from the amino acid sequence’, Protein Engineering,Vol. 9, No. 2, pp.133–142.

Hovmöller, T.Z., Zhou, T.Z. and Ohlson, T. (2002) ‘Conformation of amino acids inprotein’, Acta Crystallographica Section D: Biological Crystallography, Vol. 58, No. 5,pp.768–776.

Jones, D.T., Taylor, W.R. and Thornton, J.M. (1992) ‘A new approach to protein foldrecognition’, Nature, Vol. 358, No. 6381, pp.86–89.

298 M. Dorn and O. Norberto de Souza

King, R.D. and Sternberg, M.J. (1996) ‘Identification and application of the conceptsimportant for accurate and reliable protein secondary structure prediction’, ProteinScience, Vol. 5, No. 11, pp.2298–2310.

Kolinski, A. (2004) ‘Protein modelling and structure prediction with a reducedrepresentation’, Acta Biochimica Polonica, Vol. 51, pp.349–371.

Laskowski, R.A., MacArthur, M.W., Moss, D.S. and Thornton, J.M. (1993) ‘PROCHECK:a program to check the stereochemical quality of protein structures’, Journal AppliedCrystallography, Vol. 26, No. 2, pp.283–291.

Levinthal, C. (1968) ‘Are the pathways for protein folding?’, Journal of Chemical Physics,Vol. 65, pp.44–45.

Martí-Renom, M.A., Stuart, A., Fiser, A., Sanchez, R., Mello, F. and Sali, A. (2000)‘Comparative protein structure modelling of genes and genomes’, Annual Review ofBiophysics and Biomolecular Structure, Vol. 29, No. 16, pp.291–235.

Morris, A.L., MacArthur, M.W., Hutchinson, E.G. and Thornton, J.M. (1992)‘Stereochemical quality of protein structure coordinates’, Proteins, Vol. 12, No. 4,pp.345–364.

Moult, J., Fidelis, K., Rost, B., Hubbard, T. and Tramontano, A. (2005) ‘Critical assessmentof methods of protein structure prediction (CASP)-round 6’, Proteins, Vol. 61, No. 7,pp.3–7.

Moult, J.A. (2005) ‘Decade of CASP: progress, bottlenecks an prognosis in protein structureprediction’, Current Opinion in Structural Biology, Vol. 15, pp.285–289.

Morize, I., Surcouf, E., Vaney, M.C., Epelboin, Y., Buehner, M., Fridlansky, F.,Milgrom, E. and Mornon, J.P. (1987) ‘Refinement of the C222(1) crystal form ofoxidized uteroglobin at 1.34 Å resolution’, Journal of Molecular Biology, Vol. 194,No. 4, pp.725–739.

Murzin, A.G., Brenner, S.E., Hubbard, T. and Chothia, C. (1995) ‘SCOP: a struturalclassification of proteins database for the investigation of sequences and structures’,Journal of Molecular Biology, Vol. 247, No. 1, pp.536–540.

Ngo, J.T., Marks, J. and Karplus, M. (1997) ‘Computational complexity, proteinstructure prediction and the Levinthal Paradox’, in Merz Jr., K. and Grand, S.L.(Eds.): The Protein Folding Problem and Tertiary Structure Prediction, Chapter 14,Birkhäuser, Boston, MA, EUA, pp.435–508.

Osguthorpe, D.J. (2000) ‘Ab initio protein folding’, Current Opinion in Structural Biology,Vol. 10, No. 2, pp.146–152.

Ramachandran, G.N. and Sasisekharan, V. (1968) ‘Conformation of polypeptides andproteins’, Advances in Protein Chemistry, Vol. 23, pp.283–438.

Rohl, C.A., Strauss, C.E., Misura, K.M.S. and Baker, D. (2004) ‘Protein structure predictionusing rosetta’, Methods in Enzymology, Vol. 383, pp.66–93.

Rost, B. and Sander, C. (1993) Prediction of protein secondary structure at better than 70%accuracy’, Journal of Molecular Biology, Vol. 232, No. 2, pp.584–599.

Simons, K.T., Bonneau, R., Ruczinski, I. and Baker, D. (1999) ‘Ab initio protein structureprediction of CASP III targets using ROSETTA’, Proteins, Vol. 3, pp.171–176.

Srinivasan, R. and Rose, G.D. (1995) ‘LINUS – a hierarchic procedure to predict the foldof a protein’, Proteins, Vol. 22, No. 2, pp.81–99.

Srinivasan, R. and Rose, G.D. (2002) ‘Ab initio prediction of protein structure usingLINUS’, Proteins, Vol. 47, No. 4, pp.489–495.

Starovasnick, M.A., Brasisted, A.C. and Wells, J.A. (1997) ‘Structural mimicry of a nativeprotein by a minimized binding domain’, The Proceedings of the National Academyof Sciences Online USA, Vol. 94, pp.10080–10085.

Mining the Protein Data Bank with CReF 299

Tramontano, A. (2006) Protein Structure Prediction, 1st ed., John Wiley and Sons, Inc.,Weinheim, Germany.

van Gunsteren, W.F. and Berendsen, H.J.C. (1990) ‘Computer simulation of moleculardynamics: methodology, applications, and perspectives in chemistry’, AngewandteChemie International Edition English, Vol. 29, No. 9, pp.992–1023.

Witten, I.H. and Frank, E. (2005) Data Mining: Practical Machine Learning Toolsand Techniques, 2nd ed., Morgan Kaufmann, Series in Data Management Systems,Oxford, UK.

Zhang, X., Waltz, D. and Mesirov, J.P. (1989) ‘Protein structure prediction by a data-levelparallel algorithm’, Supercomputing ’89: Proceedings of the 1989 ACM/IEEEConference on Supercomputing, Reno, Nevada, USA, pp.215–223.