Better 1D predictions by experts with machines

Better 1D Predictions by Experts With MachinesBurkhard Rost*European Molecular Biology Laboratory, Heidelberg, Germany

ABSTRACT Accuracy of predicting pro-tein secondary structure and solvent accessibil-ity has been improved significantly by usingevolutionary information contained in mul-tiple sequence alignments. For the second Asi-lomar meeting, predictions were made auto-matically for all targets using the publiclyavailable prediction service PredictProtein. Ad-ditionally, a semiautomatic procedure for gen-erating more informative alignments was usedin combination with the PHD prediction meth-ods. Results confirmed the estimates for predic-tion accuracy. Furthermore, the more informa-tive alignments yielded better predictions. Thefairly accurate predictions of 1D structure weresuccessfully used by various groups for theAsilomar meeting as first step toward predict-ing higher dimensions of protein structure. Pro-teins,Suppl.1:192–197,1997. r 1998 Wiley-Liss, Inc.

Key words: prediction of protein secondarystructure; residue solvent accessi-bility; multiple alignments; neuralnetworks

INTRODUCTIONSimplifying the Structure Prediction Problem

The second Asilomar meeting has confirmed thatafter 40 years of ardent research, theory still cannotpredict protein three-dimensional structure (3D) fromsequence, in general.1 However, the rapidly growingsequence-structure gap (number of known proteinstructures vs. number of known protein sequences)has enticed theoreticians to solve simplified predic-tion problems.2 An extreme simplification is theprediction of protein structure in one dimension(1D), as represented by strings of, for example,secondary structure, and residue solvent accessibil-ity. Theoreticians are lucky because the 1D predic-tion problem is not only the task they can accomplishbest, but in that even partially correct predictions of1D structure are useful, for example, for predictingprotein function, or functional sites.

Breakthrough of Third GenerationPrediction Methods

The first generation of 1D prediction methodswere based on physicochemical principles, expertrules, and statistics of single residues.3 The secondgeneration incorporated the influence of residues

adjacent to the residue for which 1D structure waspredicted (local information).4 These secondary struc-ture prediction methods shared three major short-comings: (1) prediction accuracy was limited to about60% accuracy (percentage of residues predicted cor-rectly in either of the three states helix, strand,other), (2) b strands were predicted at typically ,40% accuracy, (3) predicted secondary structure seg-ments were, on average, only half as long as ob-served segments. Some methods were tailored toovercome one of these problems (long-range informa-tion5,6; b strand accuracy7; length8). However, onlyrecently have automatic methods been developedthat overcome most of these shortcomings. The mostimportant trick of the third generation predictiontools of the 90’s is the use of evolutionary informationcontained in multiple alignments of protein fami-lies.2,9–21 To outsiders the superiority of the thirdgeneration tools over their predecessors (which unfor-tunately are still being used by major sequenceanalysis packages, such as GCG22) may appear mar-ginal. However, the usefulness of the third genera-tion methods was demonstrated in the second Asilo-mar meeting in which these automated tools wereroutinely used by sequence analysis experts.

MATERIALS AND METHODSFrom Sequence to 1D Structure

The major step in improving 1D predictions hasbeen the use of evolutionary information containedin multiple sequence alignments. Generating theinformation fed into the neural network systemPHD20 required four steps: a data base search forhomologues (method BLAST23), (2) a refined profile-based dynamic-programming alignment of the mostlikely homologues (method MAXHOM24), (3) a deci-sion for which proteins will be considered as homo-logues (length-depend cut-off for pairwise sequenceidentity25,26), and (4) a final refinement, and extrac-tion of the resulting multiple alignment. In general,prediction accuracy is better when predictions arebased on better alignments. Better alignments aredefined by: (i) fewer incorrectly aligned residues;(ii) greater divergence within the family of se-quences. In practice, these two conditions are oppo-

*Correspondence to: Dr. Burkhard Rost, EMBL, 69 012Heidelberg, Germany.

E-mail: [email protected];http://www.embl-heidelberg.de/,rost/

Received 9 May 1997; Accepted 18 August 1997

PROTEINS: Structure, Function, and Genetics, Suppl. 1:192–197 (1997)

r 1998 WILEY-LISS, INC.

nents in that less similar homologues are more likelyto be misaligned.

Completely Versus Almost Automatic

The PHD prediction methods are automaticallyavailable via the internet service PredictProtein20

(send the word help to [email protected], or use the WWW interface27). Usershave the choice between the fully automatic proce-dure taking the query sequence through the entirecycle, or expert intervention into the generation ofthe alignment. For the Asilomar contest, both thesemodes of operation were explored. The followingchanges were made with respect to the usual Predict-Protein service:

1. Rather than SWISS-PROT, a nonredundant database of all known protein sequences was searched.

2. The cutoff for accepted homologues was loweredfrom 30% to 25% pairwise sequence identity.

3. The final list of putative homologues was in-spected visually; some proteins were excludedfrom the list. PredictProtein users typically con-tinue with two additional time-consuming expertinterventions.

4. Visual correction of the final alignment.5. Investigation of how the prediction of particular

segments depends on the alignment.

Steps 4 and 5 were not performed for the Asilomartargets.

Prediction Targets

Secondary structure and residue solvent accessibil-ity was predicted for all Asilomar targets. Here,results were compiled for the 15 targets for whichstructures were available. Secondary structure pre-dictions comprised the predictions of secondary struc-ture state (helix, strand, other), and a reliabilityindex for each residue; relative solvent accessibilitypredictions gave the percentage of predicted solventaccessibility (additionally projected onto a two-statemodel (buried: # 16%, exposed . 16%), and areliability index for each residue.

RESULTSWhat Went Well?

Prediction accuracy within expected range

(1) Secondary structure prediction: the accuracywas about 74% (three-state per-residue and per-segment scores). (2) Solvent accessibility prediction:the accuracy was about 69% (two-state score); thecorrelation between observed and predicted relativesolvent accessibility was 0.5; predictions were bestfor residues observed in strands, and worst forresidues with no regular secondary structure; predic-tions were best for the charged amino acids aspartic,

glutamic, and lysine, and for methionine. Overall,secondary structure prediction accuracy using PHDon the Asilomar proteins was slightly higher thanexpected; solvent accessibility prediction accuracyslightly lower than expected.20

Reliability of prediction correlated withaccuracy

Prediction accuracy varied largely between differ-ent proteins (Fig. 1A). However, the reliability ofPHD predictions enabled estimating on which side ofsuch a distribution the prediction for a given proteinwas expected (Fig. 1B). Furthermore, individualresidues could be correctly labelled for which theprediction was expected to be more likely accurate(Fig. 1C). For example, half of all residues predictedin a helix were predicted with the highest reliabilityindex; 90% of these were correctly predicted (Fig. 1C).

More informative alignments yielded betterpredictions

By manually improving the multiple alignmentsused for the PredictProtein server (Fig. 1), theprediction of secondary structure improved from70% (three-state per-residue accuracy for fully-automatic alignment selections from PredictProteinserver20) to 74% (semiautomatic alignment selectionused for CASP2 submissions). Solvent accessibilitypredictions were improved from 67% (two-state per-residue accuracy for fully automatic alignment selec-tion) to 69% (CASP2 submissions).

What Went Wrong?Confusing helices and strands

The two examples shown in Figure 2 representedthe worst cases for the secondary structure predic-tion in terms of per-residue and per-segment accu-racy. However, even more fatal for using 1D predic-tions for further steps toward 3D prediction werecases for which the prediction confused helices andstrands. On average, such bad predictions weremade for about 8% of all residues. Particularly badwere the values for target t11 (19% of residuesconfused), and for target t32 (Fig. 2; 13% of residuesconfused). None of the confused segments was pre-dicted with high reliability indices.

Helices too long, strands too short

Helices were predicted at a higher than averagelength (12.5 predicted vs. 10.3 observed); strandswere predicted at a lower than average length (5.2vs. 6.7). These values were not representative for thePHD averages, and might have originated fromunusually high percentages of secondary structurein the 15 CASP2 targets (38% helix, 24% strandcompared to about 32 % helix and 21% strand in arepresentative subset of PDB,28 data not given).

193BETTER 1D PREDICTIONS

Fig. 1. Prediction accuracy for CASP2 targets. A: 1D predic-tion accuracy was described by three scores: (1) per-segmentaccuracy in predicting secondary structure (Sov3, defined in Ref.37), (2) per-residue accuracy in predicting secondary structure(Q3, defined in Ref. 37), and (3) per-residue accuracy in predictingresidue solvent accessibility (Q2, defined in Ref.13). Exceptionallybad predictions did not coincide, i.e., the lowest value for eachscore did not occur for the same protein. In fact, for the threeproteins for which secondary structure prediction was worst (t31,t32, t38) accessibility predictions were rather accurate; and theworst predicted accessibility for t16 coincided with an extremelygood secondary structure prediction. B: The reliability index,scaled from 0 (low) to 9 (high), reflects the strength of theprediction for each residue. Here, the reliability index was aver-aged over each protein. The protein average correlated with theoverall per-residue accuracy of secondary structure prediction: theworst predicted protein (t32) had the lowest average reliabilityindex; the best predicted ones (t08, t42) had the highest average

reliability indices. C and D: The expected prediction accuracy canbe raised above the 90% level at the expense of not predictingsecondary structure for regions with a low reliability index. Howlikely was a residue predicted in an a helix with a reliability index ofn , predicted correctly? The two plots were derived for different testsets, (C) reflected the statistics on 15 Asilomar targets, (D) statisticson a 50 times larger set of 705 sequence-unique proteins. Forexample, prediction accuracy tended to surpass the 70% accuracylevel for residues predicted at levels of RI _ 6; about two-thirds ofall residues were predicted at that level. To illustrate the fraction ofresidues predicted at highest reliability: for the set of 705 about40% of the helical residues were predicted at RI 5 9 (93% of thesewere correct); and for the Asilomar set about 50% of the helicalresidues were predicted at RI 5 9 (90% of which were correctlypredicted). Note that Figures C and D show the noncumulativevalues. The cumulative distributions answering the question ‘‘Howhigh is the expected prediction accuracy for all residues predictedat higher reliability?’’ is given elsewhere.10,12,13,20,33

194 B. ROST

Structural class predicted at accuracy levelsbelow average

The composition of secondary structure enables arough classification of proteins into structuralclasses.29 On average, secondary structure predic-tions from PHDsec predict 75% of all proteins cor-rectly in one of the four classes: all-a, all-b, a/b,other.20,30 For the CASP2 targets the class classifica-tion was correct for 67% of the proteins, only. Thedominant error was to predict strands for all-aproteins (and to consequently place those proteinsinto the class ‘‘other,’’ rather than into the class‘‘all-a’’). However, the average content of secondarystructure was predicted about as accurately as ex-pected: differences between predicted and observedcompositions were 7% for helix, and 9% for strand.Thus, the difference between CASP2 and expectedclassification error could be attributed to the smalldataset.

Overprediction of buried residues

Most buried residues (#16% relative solvent acces-sibility) were predicted as buried (76%). However,this was accomplished by an overprediction of buriedresidues, as only 60% of the residues predicted to beburied were actually observed in that state. Thedominant error was a strong overprediction of com-pletely buried (0% accessible) residues. In general,

residues were clearly overpredicted in the ranges49–64% accessibility, and clearly underpredicted inthe ranges 1–4% and 64–81%. (Note: the other side ofthe same coin was that exposed residues were under-predicted: 80% of the residues predicted to be ex-posed were observed in that state, however, only 64%of the residues observed in the exposed state wereactually predicted.)

Why?Correct alignment crucial for correct prediction

Alignments used for the input to the PHD neuralnetworks should be both informative (high level ofdiversity; many sequences), and correct. The semiau-tomatic generation of multiple alignments used forthe CASP2 submissions clearly improved the infor-mation content in the alignments, and thus predic-tion accuracy. However, including proteins from thetwilight zone31 of 25–30% pairwise sequence identitymay be fatal in two ways. First, some of the includedproteins may not be structurally similar to theprotein for which 1D structure is predicted. Second,the lower the level of pairwise sequence identity, thehigher the likelihood of misaligning some residues.This became particularly obvious, when the align-ments for the worst predicted proteins were changed(after the meeting). Secondary structure predictionaccuracy could be increased by simply excluding

Fig. 2. Examples for prediction errors. Two examples of errorsin secondary structure prediction. Secondary structure predictionwas worst for these two proteins (Fig. 1A, note: for t38 theprediction had to be based on a single sequence). AA, amino acidin one-letter code; Obs, secondary structure assignment based on

3D structure by DSSP38; PHD, prediction by neural networksystem; RI, reliability of prediction (0 is low, 9 is high). Symbols forsecondary structure assignments: H, a helix; E (extended), bstrand; blank, other.


some less likely family members: for target t31, Q3

(three-state per-residue accuracy) from 62% to 68%,Sov3 (three-state per-segment accuracy) from 54% to65%; for target t32, Q3 from 53% to 56%, Sov3 from54% to 56%; for target t38 Q3 from 57% to 63%, Sov3

from 42% to 48%. Prediction accuracy was clearlybelow average for proteins for which no alignmentswere available (such as for t38, Fig. 2). The secondeffect of falsely aligned residues was difficult toestimate. However, the extent of the first effectillustrated that alignment errors were fatal.

Prediction accuracy lower for unusualproteins

The PHD neural networks were trained on globu-lar water-soluble proteins; predictions tend to bewrong for other proteins. One example from theCASP2 set was t32 a small protein (98 residues; Fig.2) which is stabilized by three cysteine-bridges.Fundamental mistakes in the secondary structureprediction were around the cysteine-bridges (Fig. 2).However, on average proteins with cysteine-bridgeswere not predicted less accurately (Arthur Lesk, thisissue). For the prediction of solvent accessibilityanother effect becomes crucial: the interaction be-tween protein chains: overall accessibility was pre-dicted worst for t16. However, many of the residues‘‘falsely’’ predicted as buried were actually observedat interfaces between the three chains of the protein(data not shown). In general, residues predicted to beburied and observed to be exposed, often indicatebinding interfaces.32

Confusing helices and strands partly due tousing local information

A fatal error for prediction-based modeling is theconfusion of helices and strands. Exactly this fatalerror happens frequently for PHD predictions (for 7of the 15 CASP2 targets; statistics on a largerdataset33). Often the beginning and the ends of theconfused segments are correctly identified (targett02, strand 13; target t11, strand 1; target t14,strand 4, helix 8; target t16, helix 2; target t31strand 8). Only for two confused segments the reliabil-ity of at least one residue was above a value of RI 5 6(helix 8 in t14, and helix 1 in t16). Nevertheless, howcan a segment be placed correctly if the type isconfused? Secondary structure formation is partlydetermined by residue interactions non local insequence. Such information is captured by the PHDpredictions only to some extent. A region may have ahigher preference for forming a helix than a strand(and vice versa), but interactions nonlocal in se-quence may result in that the formation of a b sheet(a helix) is energetically more favorable. Indeed, theconfusion between helices and strands can often beattributed to hydrogen bonds stabilized by nonlocalinterresidue contacts.34

Fifteen proteins are not representative

Some of the ‘‘errors’’ were specific for the CASP2targets. The major reason for that was that 15proteins were not enough to comprise a representa-tive subset of all proteins (difference between Fig. 1Cand D).

CONCLUSIONS: WHAT DID WE LEARN?Easy To Be Wise Afterward?

Inspecting the examples for which predictionswent wrong tended to produce arguments for whythat was so. However, such reasoning in some casesappeared rather premature: proteins for which sec-ondary structure was predicted below average tendedto differ from those for which solvent accessibilitywas predicted below average (Fig. 1A), althoughmany of the arguments would apply to both predic-tion methods (such as the stabilization by cysteinebridges for target t32).

Generating More Informative Alignments IsStraightforward

The difference in prediction accuracy between thefully automatic and the semiautomatic selection ofthe alignment (two to four percentage points) illus-trated that prediction accuracy could be improvedsignificantly without changing the final predictionmethod (PHD20). The procedure used for the CASP2submission could be automated. (The major techni-cal problem, currently, is the lack of CPU resourcesavailable at EMBL for the PredictProtein service.)Another point was illustrated for the CASP2 targets:monitoring how predictions change in response tothe alignment (including more or less proteins) is anexcellent means of arriving at better expert-drivenpredictions.

CASP: Good for Testing Methods, But NotRepresentative

1D structure predictions comprise excellent ex-amples for prediction methods, in general, since wehave large datasets for which we can estimateprediction accuracy. Such tests reveal that predic-tion accuracy differs between different proteins (withone standard deviation of about ten percentagepoints). How many proteins are representative? Toapproach the answer, consider the following experi-ment: first, average prediction accuracy and itsstandard distribution are compiled for a set of 705unique proteins chains33; second, from the set of 705chains 20 are picked at random; this is repeateduntil average accuracy and standard distributionmatch that of the set of 705 proteins. How manyrepeats would it take? The answer: about five to ten.Thus, the following conclusions from 1D predictionsin CASP2 evolve for users (and editors): don’t trusttoo much methods that (1) were not tested in CASP,(2) revealed much lower values of accuracy than

196 B. ROST

published, and (3) that were successful in CASP, butnever evaluated on larger databases.

1D Predictions Now Accurate Enough as FirstStep in Structure Prediction

Many of the third-generation predictions of 1Dstructure are accurate enough to become a first stepin predicting higher dimensions of protein structure(Arthur Lesk, this issue). A prominent application ofPHD predictions was threading of the CASP2 targets(e.g., Murzin, or Fischer, Eisenberg et al., this issue).EvenanautomaticPHD-threadingprocedureyielded35,36

relatively good results for recognizing the correct fold.

ACKNOWLEDGMENTS

I thank Sean O’Donoghue (EMBL, Heidelberg) forhelpful discussions and for critically reading themanuscript, all those who contributed essentially tothe CASP2 meeting by making their experimentalstructure determinations available before publica-tion, the assessors Arthur Lesk (LMB, Cambridge)and Michael Levitt (Stanford University), and allthose who were involved in organizing that meeting,to name a few: John Moult (CARB, Washington), TimHubbard (Sanger Centre, England), Stephen Bryant(NIH, Washington), Jan Pedersen (CARB, Washing-ton), and Krzysztof Fidelis (LNL, Livermore).

REFERENCES

1. Rost, B., O’Donoghue, S.I. Sisyphus and prediction ofprotein structure. CABIOS 13:345–356, 1997.

2. Rost, B., Sander, C. Bridging the protein sequence-structure gap by structure predictions. Annu. Rev. Biophys.Biomol. Struct. 25:113–136, 1996.

3. Kabsch, W., Sander, C. How good are predictions of proteinsecondary structure? FEBS Lett. 155:179–182, 1983.

4. Fasman, G.D. Prediction of Protein Structure and the Prin-ciples of Protein Conformation. New York: Plenum, 1989.

5. Maxfield, F.R., Scheraga, H.A. Improvements in the predic-tion of protein topography by reduction of statistical errors.Biochemistry 18:697–704, 1979.

6. Zvelebil, M.J., Barton, G.J., Taylor, W.R., Sternberg, M.J.E.Prediction of protein secondary structure and active sitesusing alignment of homologous sequences. J. Mol. Biol.195:957–961, 1987.

7. Gascuel, O., Golmard, J.L. A simple method for predictingthe secondary structure of globular proteins: Implicationsand accuracy. CABIOS 4:357–365, 1988.

8. Kabsch, W., Sander, C. Segment83. unpublished 1983.9. Gerloff, D.L., Jenny, T.F., Knecht, L.J., Gonnet, G.H.,

Benner, S.A. The nitrogenase MoFe protein. FEBS Lett.318:118–124, 1993.

10. Rost, B., Sander, C. Prediction of protein secondary struc-ture at better than 70% accuracy. J. Mol. Biol. 232:584–599, 1993.

11. Benner, S.A., Badcoe, I., Cohen, M.A., Gerloff, D.L. Bonafide prediction of aspects of protein conformation. J. Mol.Biol. 235:926–958, 1994.

12. Rost, B., Sander, C. Combining evolutionary informationand neural networks to predict protein secondary struc-ture. Proteins 19:55–72, 1994.

13. Rost, B., Sander, C. Conservation and prediction of solventaccessibility in protein families. Proteins 20:216–226, 1994.

14. Wako, H., Blundell, T.L. Use of amino acid environment-dependent substitution tables and conformational propen-sities in structure prediction from aligned sequences ofhomologous proteins. I. Solvent accessibility classes. J.Mol. Biol. 238:682–692, 1994.

15. Wako, H., Blundell, T.L. Use of amino acid environment-

dependent substitution tables and conformational propen-sities in structure prediction from aligned sequences ofhomologous proteins. II. Secondary structures. J. Mol. Biol.238:693–708, 1994.

16. Barton, G.J. Protein secondary structure prediction. Curr.Opin. Struct. Biol. 5:372–376, 1995.

17. Gerloff, D.L., Chelvanayagam, G., Benner, S.A. A predictedconsensus structure for the protein-kinase c2 homology(c2h) domain, the repeating unit of synaptotagmin. Pro-teins 22:299–310, 1995.

18. Salamov, A.A., Solovyev, V.V. Prediction of protein second-ary structure by combining nearest-neighbor algorithms andmultiple sequence alignment. J. Mol. Biol. 247:11–15, 1995.

19. Di Francesco, V., Garnier, J., Munson, P.J. Improvingprotein secondary structure prediction with aligned homolo-gous sequences. Protein Sci. 5:106–113, 1996.

20. Rost, B. PHD: Predicting one-dimensional protein struc-ture by profile based neural networks. Methods Enzymol.266:525–539, 1996.

21. Thompson, M.J., Goldstein, R.A. Predicting solvent accessi-bility: Higher accuracy using Bayesian statistics and opti-mized residue substitution classes. Proteins 25:38–47, 1996.

22. Devereux, J., Haeberli, P., Smithies, O. GCG package.Nucleic Acids Res. 12:387–395, 1984.

23. Altschul, S.F., Gish, W. Local alignment statistics. MethodsEnzymol. 266:460–480, 1996.

24. Schneider, R. Sequenz und Sequenz-Struktur Vergleicheund deren Anwendung fur die Struktur- und Funktions-vorhersage von Proteinen. Doctoral thesis, University ofHeidelberg, 1994.

25. Chothia, C., Lesk, A.M. The relation between the diver-gence of sequence and structure in proteins. EMBO J.5:823–826, 1986.

26. Sander, C., Schneider, R. Database of homology-derivedstructures and the structural meaning of sequence align-ment. Proteins 9:56–68, 1991.

27. Rost, B. PredictProtein: Internet prediction service. WWWdocument (http://www.embl-heidelberg.de/predictprotein):EMBL, 1997.

28. Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., et al. TheProtein Data Bank: A computer-based archival file formacromolecular structures. J. Mol. Biol. 112:535–542, 1977.

29. Levitt, M., Chothia, C. Structural patterns in globularproteins. Nature 261:552–558, 1976.

30. Rost, B. Observed secondary structure content for 721proteins. WWW document (http://www.embl-heidelberg.de/,rost/Res/96A-SecStrContent.html): EMBL Heidel-berg, Germany, 1996.

31. Doolittle, R.F. Of URFs and ORFs: A primer on How ToAnalyze Derived Amino Acid Sequences. Mill Valley, CA:University Science Books, 1986.

32. Hubbard, T., Tramontano, A., Barton, G., et al. Update onprotein structure prediction: Results of the 1995 IRBMworkshop. Folding Design 1:R55–R63, 1996.

33. Rost, B. Expected prediction accuracy of PHD. WWW docu-ment (http://www.embl-heidelberg.de/,rost/Res/96D-ExpAccu-racyPHD.html): EMBL Heidelberg, Germany, 1996.

34. Rychlewski, L., Godzik, A. Secondary structure predic-tions: In quest of forces that shape the local proteinstructure. The Scripps Research Institute, 10666 N. TorreyPines Road, La Jolla, CA 92037, USA, 1996.

35. Rost, B. TOPITS: Threading one-dimensional predictionsinto three-dimensional structures. In: Rawlings, C., Clark,D., Altman, R., Hunter, L., Lengauer, T. and Wodak, S.(eds.). Third International Conference on Intelligent Sys-tems for Molecular Biology. Cambridge, England. MenloPark, CA: AAAI Press, 1995:314–321.

36. Rost, B., Schneider, R., Sander, C. Protein fold recognitionby prediction-based threading. J. Mol. Biol. 270:471–480,1997.

37. Rost, B., Sander, C., Schneider, R. Redefining the goals ofprotein secondary structure prediction. J. Mol. Biol. 235:13–26, 1994.

38. Kabsch, W., Sander, C. Dictionary of protein secondarystructure: Pattern recognition of hydrogen-bonded andgeometrical features. Biopolymers 22:2577–2637, 1983.


Documents

Better 1D predictions by experts with machines