8
Prediction of Protein Secondary Structure Content Using Amino Acid Composition and Evolutionary Information Soyoung Lee, Byung-chul Lee, and Dongsup Kim * Department of Biosystems, Korea Advanced Institute of Science and Technology, Daejeon, South Korea ABSTRACT Knowing protein structure and in- ferring its function from the structure are one of the main issues of computational structural biology, and often the first step is studying protein second- ary structure. There have been many attempts to predict protein secondary structure contents. Previ- ous attempts assumed that the content of protein secondary structure can be predicted successfully using the information on the amino acid composi- tion of a protein. Recent methods achieved remark- able prediction accuracy by using the expanded composition information. The overall average error of the most successful method is 3.4%. Here, we demonstrate that even if we only use the simple amino acid composition information alone, it is possible to improve the prediction accuracy signifi- cantly if the evolutionary information is included. The idea is motivated by the observation that evolu- tionarily related proteins share the similar struc- ture. After calculating the homolog-averaged amino acid composition of a protein, which can be easily obtained from the multiple sequence alignment by running PSI-BLAST, those 20 numbers are learned by a multiple linear regression, an artificial neural network and a support vector regression. The over- all average error of method by a support vector regression is 3.3%. It is remarkable that we obtain the comparable accuracy without utilizing the ex- panded composition information such as pair- coupled amino acid composition. This work again demonstrates that the amino acid composition is a fundamental characteristic of a protein. It is antici- pated that our novel idea can be applied to many areas of protein bioinformatics where the amino acid composition information is utilized, such as subcellular localization prediction, enzyme sub- class prediction, domain boundary prediction, sig- nal sequence prediction, and prediction of unfolded segment in a protein sequence, to name a few. Proteins 2006;62:1107–1114. © 2005 Wiley-Liss, Inc. Key words: computational structural biology; ho- molog-averaged amino acid composi- tion; multiple sequence alignment; pro- tein secondary structure content prediction INTRODUCTION Protein secondary structure content is the proportion of each secondary structure of a protein. Formally, it is defined as the ratio of the number of residues in a certain secondary structure to the number of total residues of a protein. According to the conventional classification by DSSP, 1 there are eight secondary structure types, namely, -helix, -strand, -bridge, three-turn helix, -helix, hydro- gen-bonded turn, bend, and random coil. Protein second- ary structure is the fundamental information of a protein, and knowing the secondary structure content of a protein is often the first step towards getting more detailed knowledge on its structure and function. However, the experiment methods to determine the secondary structure content have not been sufficiently accurate. 2,3 For that reason there have been many attempts to predict the secondary structure content. Among the early attempts to predict the secondary structure content, notable prediction methods were the multiple linear regression approach, 4–7 the artificial neural network approach, 8 and the analytic vector decomposition method. 9 In these methods, it was assumed that the information of the amino acid composi- tion, combined with compositional couplings and various other features such as sequence length and structural class, is enough to predict protein secondary structure content successfully. Recently, Liu and Chou 10 expanded the idea and intro- duced the new information of the amino acid composition to expand the amount of information. They assumed that considering the coupling effect of residues along the se- quence would improve prediction accuracy. They intro- duced a new feature, coupled amino acid composition, for which two adjacent residues were considered simulta- neously, and 20 20 400 pair amino acid occurrence probabilities ranging from P(AA) to P(YY), where A,…, and Y represent the single-letter codes of 20 amino acids, were computed. They demonstrated that when the cou- pling effects were took into account the average absolute errors for predicting the contents of -helices and -sheets were reduced to 0.056 and 0.046 from 0.103 and 0.090, respectively, in the self-consistency test. Furthermore, Cai and colleagues 11 achieved a remarkable improvement in prediction accuracy by developing an artificial neural Grant sponsor: Ministry of Science and Technology of Korea; Grant number: M1052900000205-N290000210. *Correspondence to: Dongsup Kim, Department of Biosystems, Korea Advanced Institute of Science and Technology, Daejeon 305- 701, South Korea. E-mail: [email protected] Received 11 July 2005; Revised 12 September 2005; Accepted 26 September 2005 Published online 12 December 2005 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.20821 PROTEINS: Structure, Function, and Bioinformatics 62:1107–1114 (2006) © 2005 WILEY-LISS, INC.

Prediction of protein secondary structure content using amino acid composition and evolutionary information

Embed Size (px)

Citation preview

Page 1: Prediction of protein secondary structure content using amino acid composition and evolutionary information

Prediction of Protein Secondary Structure Content UsingAmino Acid Composition and Evolutionary InformationSoyoung Lee, Byung-chul Lee, and Dongsup Kim*Department of Biosystems, Korea Advanced Institute of Science and Technology, Daejeon, South Korea

ABSTRACT Knowing protein structure and in-ferring its function from the structure are one of themain issues of computational structural biology,and often the first step is studying protein second-ary structure. There have been many attempts topredict protein secondary structure contents. Previ-ous attempts assumed that the content of proteinsecondary structure can be predicted successfullyusing the information on the amino acid composi-tion of a protein. Recent methods achieved remark-able prediction accuracy by using the expandedcomposition information. The overall average errorof the most successful method is 3.4%. Here, wedemonstrate that even if we only use the simpleamino acid composition information alone, it ispossible to improve the prediction accuracy signifi-cantly if the evolutionary information is included.The idea is motivated by the observation that evolu-tionarily related proteins share the similar struc-ture. After calculating the homolog-averaged aminoacid composition of a protein, which can be easilyobtained from the multiple sequence alignment byrunning PSI-BLAST, those 20 numbers are learnedby a multiple linear regression, an artificial neuralnetwork and a support vector regression. The over-all average error of method by a support vectorregression is 3.3%. It is remarkable that we obtainthe comparable accuracy without utilizing the ex-panded composition information such as pair-coupled amino acid composition. This work againdemonstrates that the amino acid composition is afundamental characteristic of a protein. It is antici-pated that our novel idea can be applied to manyareas of protein bioinformatics where the aminoacid composition information is utilized, such assubcellular localization prediction, enzyme sub-class prediction, domain boundary prediction, sig-nal sequence prediction, and prediction of unfoldedsegment in a protein sequence, to name a few.Proteins 2006;62:1107–1114. © 2005 Wiley-Liss, Inc.

Key words: computational structural biology; ho-molog-averaged amino acid composi-tion; multiple sequence alignment; pro-tein secondary structure contentprediction

INTRODUCTION

Protein secondary structure content is the proportion ofeach secondary structure of a protein. Formally, it is

defined as the ratio of the number of residues in a certainsecondary structure to the number of total residues of aprotein. According to the conventional classification byDSSP,1 there are eight secondary structure types, namely,�-helix, �-strand, �-bridge, three-turn helix, �-helix, hydro-gen-bonded turn, bend, and random coil. Protein second-ary structure is the fundamental information of a protein,and knowing the secondary structure content of a proteinis often the first step towards getting more detailedknowledge on its structure and function. However, theexperiment methods to determine the secondary structurecontent have not been sufficiently accurate.2,3 For thatreason there have been many attempts to predict thesecondary structure content. Among the early attempts topredict the secondary structure content, notable predictionmethods were the multiple linear regression approach,4–7

the artificial neural network approach,8 and the analyticvector decomposition method.9 In these methods, it wasassumed that the information of the amino acid composi-tion, combined with compositional couplings and variousother features such as sequence length and structuralclass, is enough to predict protein secondary structurecontent successfully.

Recently, Liu and Chou10 expanded the idea and intro-duced the new information of the amino acid compositionto expand the amount of information. They assumed thatconsidering the coupling effect of residues along the se-quence would improve prediction accuracy. They intro-duced a new feature, coupled amino acid composition, forwhich two adjacent residues were considered simulta-neously, and 20 � 20 � 400 pair amino acid occurrenceprobabilities ranging from P(A�A) to P(Y�Y), where A,…,and Y represent the single-letter codes of 20 amino acids,were computed. They demonstrated that when the cou-pling effects were took into account the average absoluteerrors for predicting the contents of �-helices and �-sheetswere reduced to 0.056 and 0.046 from 0.103 and 0.090,respectively, in the self-consistency test. Furthermore, Caiand colleagues11 achieved a remarkable improvement inprediction accuracy by developing an artificial neural

Grant sponsor: Ministry of Science and Technology of Korea; Grantnumber: M1052900000205-N290000210.

*Correspondence to: Dongsup Kim, Department of Biosystems,Korea Advanced Institute of Science and Technology, Daejeon 305-701, South Korea. E-mail: [email protected]

Received 11 July 2005; Revised 12 September 2005; Accepted 26September 2005

Published online 12 December 2005 in Wiley InterScience(www.interscience.wiley.com). DOI: 10.1002/prot.20821

PROTEINS: Structure, Function, and Bioinformatics 62:1107–1114 (2006)

© 2005 WILEY-LISS, INC.

Page 2: Prediction of protein secondary structure content using amino acid composition and evolutionary information

network approach based on the pair-coupled amino acidcomposition. These works have shown that the amino acidcomposition is a fundamental property of a protein andcontains enough information with which it is possible todevelop a successful method to predict the protein’s essen-tial characteristics such as secondary structure content.Moreover, these works have also exposed an obviouslimitation of information content of the amino acid compo-sition itself that in order to develop a more accuratepredictor it is necessary to consider other features, such ascoupling effects.

In this work, we show that if the evolutionary informa-tion is incorporated into the prediction algorithm, it ispossible to develop a very accurate predictor by only usingthe amino acid composition. The idea is motivated by theobservation that evolutionarily related proteins share asimilar structure.12 It is well known that the key to one ofthe most successful prediction methods, PSI-PRED13 isthe usage of the evolutionary information that can beobtained from the multiple sequence alignments. Follow-ing earlier work by Rost and Sander,14 D. Jones used theprofiles of well-selected set of proteins as an input to hisneural network training. Before Rost and Sander’s semi-nal work, the prediction accuracy (Q3) was below 70% dueto the absence of the evolutionary information in predic-tion algorithms. It is reasonable to expect that the similarperformance gain would be achieved in predicting thesecondary structure content if the evolutionary informa-tion is incorporated into the prediction methods. Thepresent work demonstrates that in fact that is the case; theperformance of our secondary structure content predictorthat uses only the amino acid composition combined withthe evolutionary information is comparable to that of themost accurate predictors that use not only the compositioninformation but also the coupling effects. In this work, weintroduce a new feature, the homolog-averaged amino acidcomposition of a protein, which is the amino acid composi-tion of all the homologs of a protein. Instead of using theamino acid composition of a protein, we first search for theproteins that are homologous to a query protein, and thenthe homolog-averaged amino acid composition is esti-mated by counting the number of each amino acid in allprotein sequences. By doing so, the evolutionary informa-tion can be effectively incorporated into the predictionscheme. It turns out that the homolog-averaged aminoacid composition greatly improves the content predictionaccuracy.

METHODSDataset

We have used two different datasets. The first dataset(Dataset-I) is the same dataset used in Chou’s methods15

in order to compare the prediction accuracy of our predic-tor with theirs; the training dataset consists of 244 pro-teins, of which no more than 35% have homology with oneanother and the test dataset of 202 proteins, of which nomore than 35% have homology with the others, nor withthose in the training dataset. The second dataset (Dataset-II) is based on the protein domains defined by SCOP

version 1.67.16 Among the domain subsets with less than40% sequence identity to each other prepared by ASTRALCompendium,17 the domains belonging to all alpha pro-teins, all beta proteins, alpha and beta proteins (a/b),alpha and beta proteins (a � b), and multidomain proteins(alpha and beta) classes are used. Total of 5796 proteinsare randomly divided into the training set with 4350proteins and the test set with 1446 proteins.

Algorithm

Our approach is based on the fact that the evolutionarilyrelated proteins of a protein share the similar structure12

and the amino acid composition of the protein containsenough information from which we can successfully pre-dict protein secondary structure content. It is possible toimprove the prediction accuracy by considering not onlythe amino acid composition of a protein but also that ofevolutionarily related proteins all together. We calculatethe composition of the twenty amino acids of the proteinsthat are evolutionarily related to a query protein, whichcan be easily done by simple numerical manipulation ofthe frequency matrix obtained by running PSI-BLAST.18

Furthermore, we assume that the relative composition ismore reliable than the absolute composition of an aminoacid. Instead of using the amino acid compositions them-selves as input parameter of learning tools, we use theratio of those of evolutionarily related proteins with aquery protein to the background probabilities, which areobtained by counting the number of 20 amino acids of allthe representative proteins in nature. We construct thethree predictors by applying three learning tools — amultiple linear regression, an artificial neural network,and a support-vector regression — to the manipulatedinput data.

Input feature

(1) Calculating the background probability of each aminoacid (X), P0�X� � ��Nk�X�/Lk��/3352, where k repre-sents one of 3352 proteins in FSSP,19 one of the mostwidely used nonredundant protein structure data-bases, Lk the length of kth protein, and Nk(X) thenumber of X amino acid occurrences in the kth protein.

(2) Getting the frequency matrix, Si(j, X), of the ith proteinamong 244 proteins for training and 202 proteins forvalidation obtained by running PSI-BLAST, whereSi(j, X) represents the composition of X amino acid atthe jth position of the multiple sequence alignment ofall the proteins that are evolutionarily related to theith protein.

(3) Calculating the homolog-average amino acid composi-tion of ith protein, Pi(X), X� A,C,D,…,Y. The occur-rence of amino acid X of all the proteins that areevolutionarily related to the ith protein is given by

Pi�X� �j�1

Li Si� j,X�

Li. To sum up, this equation should be

understood as the summation of all possible occur-rences of a specific amino acid at each position from themultiple sequence alignment.

1108 S. LEE ET AL.

Page 3: Prediction of protein secondary structure content using amino acid composition and evolutionary information

(4) CalculatingPi�X�

P0�X�as input data of all learning tools

with values from procedure (1) and (3).

Output feature

Output parameters are prepared by eight values thatrepresent the observed contents of eight secondary struc-ture types of a protein derived from the DSSP file of eachprotein for training dataset and test dataset.

Training with learning tools

Multiple Linear Regression (MLR). At first, we ap-plied a MLR to the computed results of 244 proteins fortraining. We executed a MLR with a backward stepwiseselection algorithm eight times for eight protein secondarystructures. A backward stepwise selection algorithm con-tinues to drop least effective parameter of the previousmodel that is not statistically significant and then make anew model with parameters that are not dropped until allselected parameters are statistically significant.

Artificial Neural Network (ANN). Next, we appliedan ANN to the computed results of training dataset. Thearchitecture of the neural network was as shown in Figure1. It consisted of 20 input data that are computed by ouralgorithm, 80 hidden nodes, and 8 output nodes thatrepresent the observed contents of eight secondary struc-ture types of a protein. The back-propagation network20

uses a sigmoidal function to provide a continuous activa-tion function. We changed the number of hidden node andthe learning rate to optimize the model and the bestaccurate model was obtained when the number of hiddennode is 80 and the learning rate is 0.1.

Support Vector Regression (SVR). The support vec-tor machine (SVM) that was introduced by Vapnik21 in1998 is a supervised learning algorithm, useful for recog-nizing delicate patterns in complex datasets. The algo-rithm performs discriminative classification, learning byexample to predict the classification of previously unseendata. We used the free software mySVM22 for the trainingand testing. This software package supports pattern recog-

nition and regression estimation SVMs by support vectormachine algorithm. The mySVM offers various kerneloptions: a dot kernel that is also called a linear kernel, apolynomial kernel, a radial basis function kernel, etc. Wechose a dot kernel to optimize the performance by choosingthe optimal parameters such as the regularization param-eter. We also used a radial base function kernel, andobtained the similar results. We executed mySVM eighttimes separately and obtain the eight predictors for each ofeight protein secondary structures.

Error test. We measured the prediction error of eachmethod to compare prediction accuracy of them with 202proteins for testing. There are three criteria for measuringthe prediction error, the average absolute error, and theoverall average error. The first is defined as e

�k�1

202 |�k � dk�|

202 , where e is the average absolute error

of structure, and k represents one of the test proteins.The predicted value and the observed composition of structure in kth protein are denoted as �k and dk, respec-tively. Another test criterion is the overall average error,

�e . It is the mean of eight absolute errors, �e �e

8 .

RESULTS AND DISCUSSIONComparison of Prediction Errors of Models byMLR, ANN, and SVR

The average absolute error and the overall average errorfor three methods when they are tested on Dataset-I aredisplayed in Table I. It is evident that the method by a SVRis the most accurate, with the lowest prediction error forall average absolute errors of eight protein secondarystructures and the overall average error. On the contrary,the method by a MLR is the least accurate one with thehighest prediction error for most average absolute errorsand the overall average error. As shown in Table I, thepredictor by a SVR is more accurate than the model by anANN. The advantage of an ANN in this work is theconvenient computation. We need to perform eight timesto make the learning models of eight secondary structurecontents with a SVR and a MLR. However, we make onlyone learning model that expresses eight secondary struc-ture contents at once with an ANN. However the neuralnetwork algorithm needs a larger dataset than the SVR to

Fig. 1. Artificial neural network architecture used in this work. Itconsists of three layers: input layer, hidden layer, and output layer.

TABLE I. The Prediction Errors of Modelsby MLR, ANN, and SVR

Protein secondary structure

Error for each learning tool

MLR ANN SVR

Alpha-helix 0.089 0.085 0.078Beta-strand 0.086 0.081 0.072Three-turn helix 0.021 0.022 0.019Pi-helix 0.001 0.000 0.000H bonded turn 0.027 0.027 0.024Beta-bridge 0.010 0.009 0.008Bend 0.032 0.027 0.027Random coil 0.042 0.040 0.039Overall average error 0.038 0.036 0.033

PREDICTION OF PROTEIN SECONDARY STRUCTURE CONTENT 1109

Page 4: Prediction of protein secondary structure content using amino acid composition and evolutionary information

find an optimized model. In this work, we use 244 proteinsfor training. This size of dataset may not be enough to findan optimized model by the neural network algorithm.

According to the work by Liu and Chou,10 when thecoupling effects were included the average absolute errorsfor predicting the contents of �-helices and �-sheets werereduced to 0.056 and 0.046 from 0.103 and 0.090, respec-tively, in the self-consistency test. With our method, theprediction errors for �-helices and �-sheets are 0.078 and0.072, respectively. Even though our method and the Liuand Chou method without coupling effects use the samenumber of input features (20), the prediction errors (0.078and 0.072) of our method are much lower than those of Liuand Chou’s method (0.103 and 0.090). Improvement isreducing prediction errors by 0.025 and 0.018 for �-helicesand �-sheets, respectively. The difference is that in ourmethod evolutionary information is effectively incorpo-rated into the prediction scheme. It should be noted thatmore meaningful measure of prediction accuracy should beestimated by testing a prediction method on an indepen-dent test set, not by self-consistency test whose predictionaccuracy is generally overestimated. Indeed, when the Liuand Chou method was tested on independent test set,prediction errors on �-helices and �-sheets were 0.077 and0.076, respectively.11 Thus, it is expected that improve-ment of our method over the Liu and Chou method is morethan reducing prediction errors by 0.025 and 0.018 for�-helices and �-sheets, respectively.

Although the prediction accuracy is low, with the methodusing a MLR with a backward stepwise selection algo-rithm, we observed that different input parameters influ-ence different secondary structure contents. The leasteffective parameter that is not statistically significant aredropped in the next turn as going through the process ofbackward stepwise selection until there are no otherparameters that are not statistically significant. As aconsequence, all parameters that contain in the last modelare statistically significant, therefore, influence each sec-ondary structure contents. They are displayed in Table II.Their detailed equations are as follows:

(1) For �-helix, Content � 0.7714 � 5.4687A � 1.9655D �3.9577G � 2.5539H � 3.4173L � 5.9148M � 3.3446N �4.1480P � 4.6496T � 3.2696V

(2) For �-strand, Content � 2.311 � 5.678A � 3.563C �3.949D � 1.372E � 4.254H � 2.128I � 1.055K �4.636L � 6.564M � 2.726Q � 3.261R � 3.787S �3.699T � 3.206W � 2.388Y

(3) For three-turn helix, Content � 0.8705 � 0.7018A �0.6721C � 1.0795E � 0.8350G � 1.0183H � 1.1343I �0.6792K � 0.9052L � 1.2127M � 1.4378N � 0.6087P �0.8839Q � 0.8734R � 0.7722S � 1.1091T � 0.7910V �0.6125W � 1.3194Y

(4) For �-helix, Content � 0.001571 � 0.021692G �0.031560H � 0.023223K � 0.033331S

(5) For hydrogen-bonded turn, Content � �0.3637 �0.6904A � 0.5298C � 1.7514D � 0.4409G � 0.9423H �1.1649I � 0.3922K � 0.8994P � 0.5889R � 0.5498S �0.9162W � 1.3463Y

(6) For �-bridge, Content � 0.06616 � 0.35610D �0.19365E � 0.27202F � 0.19617L � 0.30329M �0.33198N � 0.37477T � 0.69778W

(7) For bend, Content � �0.5940 � 1.1046C � 0.6480E �1.4156F � 1.6740G � 2.1696H � 0.4190L � 2.6169N �0.8916P � 1.8530Q � 1.1393R � 1.1455V � 0.8667W

(8) For random coil, Content � �2.350 � 1.130A �2.884C � 3.330D � 2.098E � 1.808F � 2.965G �4.217H � 2.409I � 1.883K � 2.003L � 1.775M �4.537N � 4.041P � 1.316Q � 3.417R � 2.979S �2.507T � 2.804V � 1.381W � 1.575Y

The letters from A to Y represent the conventionalone-letter codes for the secondary structure content of 20amino acids.

Comparison of Prediction Errors of the Liu andChou Method, the Cai Method, and Our NewMethod

Next, we compared our best model, the model by a SVR,with the Liu method and the Cai method in Table III. Liuand Chou10 demonstrated that providing correlation infor-mation of the amino acid composition significantly in-

TABLE II. Parameters That Are Used to Make a MLR Model for Each Secondary Structure

Secondarystructure

Amino acid symbol letter

A C D E F G H I K L M N P Q R S T V W Y

H O O O O O O O O O OE O O O O O O O O O O O O O O OT O O O O O O O O O O O O O O O O O O� O O O OI O O O O O O O O O O O OB O O O O O O O OS O O O O O O O O O O O OC O O O O O O O O O O O O O O O O O O O O

The symbol for the each secondary structure H, E, T, �, I, B, S, and C denote �-helix, �-strand, three-turn helix, �-helix, hydrogen-bonded turn,�-bridge, bend, and random coil, respectively.The amino acid symbol letters, A to Y, represent the conventional one-letter codes for the secondary structure content of 20 amino acids. The letterO is marked when the information of an amino acid composition is one of the parameter that are used to make a MLR model with backwardstepwise selection algorithm for each secondary structure.

1110 S. LEE ET AL.

Page 5: Prediction of protein secondary structure content using amino acid composition and evolutionary information

creased the performance of their protein secondary struc-ture content predictor. They used 400 input parameters toexpress the first-order coupled amino acid composition. Onthe other hand, Cai and Chou11 applied an artificial neuralnetwork to learn 210 pair-coupled amino acid compositionparameters that were used in Chou’s method.15 Thismethod has been the most accurate, with the lowestoverall average error of 3.4%. The results show that theprediction accuracy of our method is better than that ofLiu’s method, while it is similar to that of Cai’s method. Allthe average absolute errors except for �-helices and theoverall average error are lower than those of Chou’smethod. On the other hand, the average absolute errorsexcept for �-helix and �-strand, and random coil are lowerthan the average absolute errors of Cai’s method in ourmethod. The overall average error is slightly lower thanthat of Cai’s in new method. It is remarkable that theprediction accuracy of our method, which is developed withonly 20 parameters, is similar to that of Cai’s method.Previous workers tried to expand the quantity of informa-tion, while we try to expand the quality of information byincluding evolutionary information with the same quan-tity. The shortcoming of our method is that predictionaccuracies of �-helix and �-strand are not sufficient. Theaverage absolute errors of �-helix and �-strand of ourmethod are higher than those of Cai’s method.

Analysis of Outliers

There are eight protein secondary structures in DSSP.Among them, �-helix, �-strand, and random coil struc-tures are most abundant. Thus, predicting �-helix,�-strand, and random coil accurately may be more mean-ingful than predicting other secondary structures accu-rately. Although our method has the lowest overall error,it does not predict �-helix, �-strand, and random coilssufficiently well, compared to Cai’s method. Figure 2displays the predicted contents and observed contents of�-helices and �-strands for 202 testing proteins using ournew method. In Figure 2(a, b), there are obvious linearcorrelation between the predicted content and the ob-served content. According to Figure 2, our method seems topredict the secondary structure content reasonably wellfor most of proteins. However, there are several outliers.The prediction error of the content of �-helix is over 20%

for those proteins. They are 1aaf (22.2%), 1cnt1 (28.2%),1obpA (21.0%), 2cpl (20.2%), 1erp (39.5%), 1jhgA (20.3%),1occH (25.9%), 1tnfA (32.4%), 1ytfC (35.8%), 2chsA (21.9%),2gdm (23.3%), and 6insE (25.5%). The names of eachprotein are conformable to a standard protein nomencla-ture, the Protein Data Bank (PDB) identifier. In particu-lar, the prediction error is over 30% for 1erp, 1tnfA, and1ytfC.

The first reason why our method does not predict thecontent of �-helix for these outlier proteins accurately isinsufficient information on the amino acid composition forthem. The mean of the number of residues of all proteinsincluding outliers is 274, while that of outliers is 107.Because a short protein does not support the sufficientinformation of the amino acid composition, they may notbe predicted accurately. For example, the number ofresidues of 1erp, for which the prediction error is 39.5%, is38 and it does not contain any isoleucine, arginine, threo-nine, and tyrosine. The number of residues of 1ytfC, forwhich the prediction error is 35.8%, is 46 and it does notcontain any histidine, isoleucine, methionine, and proline.In addition, the number of residues of 6insE (predictionerror � 25.5%) is 50 and it does not contain any aspartate,methionine, and tryptophan. In other words, the informa-tion on the amino acid composition is not sufficient forthese three proteins. Because our method uses the informa-tion of the amino acid composition, it may not predict shortproteins that contain insufficient information of the aminoacid composition.

On the other hand, the prediction error of �-strandcontent is more than 20% for the following proteins: 3fruA(21.7%), 1lcl (22.0%), 1mhcA (22.9%), 1molA (31.2%), 1std(25.2%), 1xnb (21.6%), 2polA (26.1%), 2tgi (22.3%), 1eit(41.7%), 1tnfA (25.5%), and 1ytfC (54.7%). The predictionerrors of 1molA, 1eit, and 1ytfC are even over 30%. Theseoutlier proteins are short. Thus, the main reason why ourmethod does not predict the content of �-strand accuratelyfor these proteins is also the insufficient information of theamino acid composition. The mean of the number of theirresidues is 167, while the mean of the number of residuesof all test proteins is 274. The prediction errors of 1eit and1ytfC are over 40% and they are very short. The predictionerror of the content of �-strand for 1eit is 41.7%. Becausethe number of residues of this protein is 36 and it does notcontain any alanine, leucine, methionine, and threonine,there is no information of the composition of those aminoacids for predicting the content of �-strand in 1eit. Foranother protein, 1ytfC, whose error is 54.7%, there is noinformation of the composition of histidine, isoleucine,methionine, and proline. In short, these outlier proteinsfor predicting the contents of �-helix and �-strand areshort. Thus the information of the composition of eachamino acid, the main input parameter, is not sufficient topredict accurately.

The other reason is the abundant amount of specificamino acids. Figure 3 displays the composition of cysteinein all proteins in the test set. There is a small number ofcysteine in the most proteins, ranging from 1% to 2%.However, the composition of cysteine is significantly high

TABLE III. The Prediction Errors of Liu’s Method,Cai’s Method, and Our New Method

Protein secondary structure

Error for each learning tool

Liu’s Cai’s Ours

Alpha-helix 0.077 0.071 0.078Beta-strand 0.076 0.069 0.072Three-turn helix 0.045 0.022 0.019Pi-helix 0.002 0.002 0.000H bonded turn 0.059 0.028 0.024Beta-bridge 0.019 0.011 0.008Bend 0.069 0.037 0.027Random coil 0.088 0.037 0.039Overall average error 0.061 0.034 0.033

PREDICTION OF PROTEIN SECONDARY STRUCTURE CONTENT 1111

Page 6: Prediction of protein secondary structure content using amino acid composition and evolutionary information

in many of outliers. Among the eight proteins that havesignificantly high in the number of cysteines, five proteinsare outliers.

Test on Extended Dataset and Application toProtein Class Prediction

Because the number of proteins in Dataset-I is rathersmall (244 for training, 202 for testing), it is not entirelyclear whether the prediction accuracy of our new methodhas been meaningfully assessed. Therefore, we apply ournew method to a much large dataset (Dataset-II) that has4350 proteins for training and 1446 proteins for testing.Overall, as shown in Table IV, the prediction accuracy on

Dataset-II is similar to that of Dataset-I, and the overallprediction error is slightly lower for Dataset-II (0.032)than for Dataset-I (0.033). These results indicate that theerror assessment on Dataset-I is mostly correct, and notdataset specific.

A different, but closely related, issue is how accuratelywe can predict the protein structural classes from thesequence alone. There has been the controversy over theaccuracy of structural class prediction.23–25 It turned outthat this controversy is largely due to confusion over theself-consistency test and cross-validation test, and thedefinition of the structural class. First, the definition of thestructural class needs to be clearly specified. According tothe definition used in Eisenhaber and colleagues23 andNakashima and coworkers,26 proteins with � � 15% and� � 10% are assigned to all-� class; those with � � 15%and � � 10% are assigned to all-� class; and those with � �15% and � � 10% are assigned to a mixed class. A smallnumber of proteins that do not follow the above rules areclassified as irregular proteins. It has been argued thatthis type of classification suffers from some amount ofsubjective arbitrariness. It has been argued that 0.01% oreven smaller difference would place two proteins into two

Fig. 2. Comparison of the predicted contents and the observed contents of (a) �-helices and (b) �-strandsfor 202 testing proteins using our new method with SVR. Each dot represents the predicted and observedcontents of one of 202 proteins.

Fig. 3. The composition of cysteine is significantly high in many of theoutlier. Each point represents proteins of test set. The x-axis is the indexof test protein from 1 to 202 and the y-axis is the composition of a specificamino acid. Each circle and rectangular are each protein of test set. Aclosed circle represents the protein that is outlier of prediction of thecontent of �-helix and a closed rectangular represents the protein that isoutlier of prediction of the content of �-strand.

TABLE IV. The Prediction Errors of the Present Methodon the Large Dataset (Dataset-II)

Averageabsolute error

Alpha-helix 0.071Beta-strand 0.062Three-turn helix 0.021Pi-helix 0.000H bonded turn 0.027Beta-bridge 0.008Bend 0.026Random coil 0.042Overall average error 0.032

1112 S. LEE ET AL.

Page 7: Prediction of protein secondary structure content using amino acid composition and evolutionary information

completely different classes if the classification used byEisenhaber and colleagues is employed.24,25 The otherclassification scheme is based on SCOP.16 Chou andcoworkers argued24 that SCOP classification is more mean-ingful because SCOP is based on the evolutionary relation-ships of proteins and the principles that govern theirthree-dimensional structures. To clear any discrepancybetween the two classification schemes, we examine howsimilarly the two schemes classify 5796 protein domains inDataset-II, and the result is shown in Figure 4. Of 1179domains belonging to the all-� class, 1075 domains are allalpha protein class according to SCOP classification. Onthe other hand, of 1205 domains belonging to the all-�class, 1116 domains are all beta protein class according toSCOP classification. Mixed class is predominantly occu-pied by the alpha and beta domains. Although it is notperfect, the correlation between the two classificationschemes is quite good, as high as roughly 90%. Thisanalysis implies that the amount of error caused by usingthe different classification schemes is not significant. Inthis work, we use the classification scheme by Eisenhaberand colleagues and Nakashima and coworkers The com-bined content of �-helix and three-turn helix is consideredas �-helix content, while that of �-strand and �-bridge isthe �-strand content. The other four classes, �-helix,hydrogen-bonded turn, bend, and random coil, are consid-ered as the coil. The class prediction result when themethod is tested on the test set of Dataset-II is shown inTable V. The overall class prediction accuracy is 81%.Although the test sets are different, the prediction accu-racy of our method is much higher than the reportedaccuracy of 60% by Eisenhaber and colleagues, and compa-rable to that of the prediction method by Chou andcoworkers.24 This result is not surprising if we considerthe fact that the secondary structure content predictionerror by Eisenhaber and colleagues is 13%, much higherthan that of the present method.

CONCLUSION

In this work, we show that if the evolutionary informa-tion is incorporated into the prediction algorithm, a veryaccurate secondary structure content prediction can bemade even if only the amino acid composition is utilized.The idea is motivated by the observation that evolutionar-ily related proteins share the similar structure.12 Theperformance of our secondary structure content predictorthat uses only the amino acid composition combined withthe evolutionary information is at least comparable to thatof the most accurate predictors that use not only thecomposition information but also the coupling effects, as aresult, more than 200 input parameters. In contrast, ourmethod uses only 20 input parameters.

The primary experimental method to determine second-ary structure content is circular dichroism (CD) spectros-copy. According to the benchmark study on the accuracy ofestimating the protein secondary structure from CD spec-tra,3 the estimation errors for �-helices and �-sheets are0.05 and 0.06, respectively, while the error for randomcoils is roughly 0.1. For �-helices and �-sheets, the accu-racy of predicting secondary structure content from se-quence alone is slightly worse than the experimentalmethod. Notably, however, for random coils the accuracyof prediction methods exceeds that of experimental meth-ods. Our work demonstrates that judicial use of evolution-

TABLE V. The Protein Structural Class Prediction ResultWhen the Method is Tested on the Test Set of Dataset-II

No. Predictions Correct Incorrect Accuracy

all-� 295 200 95 0.68all-� 307 196 111 0.64mixed 840 776 64 0.92irregular 2 0 2 0.00Total 1444 1172 272 0.82

Fig. 4. The correlation between the protein structural class classification scheme used by Eisenhaber andcolleagues23 and Nakashima and coworkers 26 and SCOP classification. All-a, All-b, Mixed, and Irregulardenote all-a class, all-b class, mixed class, and irregular proteins according to the classification scheme usedby Eisenhaber and colleagues and Nakashima and coworkers, respectively. SCOP-a, SCOP-b, SCOP-c,SCOP-d, and SCOP-e denote all alpha proteins, all beta proteins, alpha and beta proteins (a/b), alpha and betaproteins (a � b), and multidomain proteins (alpha and beta) of SCOP classification, respectively.

PREDICTION OF PROTEIN SECONDARY STRUCTURE CONTENT 1113

Page 8: Prediction of protein secondary structure content using amino acid composition and evolutionary information

ary information significantly improves the prediction accu-racy. The quality of evolutionary information, which getsbetter as the number of homologous protein sequencesbecomes larger, is critical to the performance of thepresent method. As the number of protein sequences isever growing, it is expected that by extending the presentmethod to more elaborated ways by including couplingeffect and other factors, the prediction methods will outper-form the experimental methods in determining the proteinsecondary structure contents.

We can think of many ways to improve the predictionaccuracy of our method. One obvious way is to incorporatethe sequence order effect. It is expected that the predictionaccuracy of the present method can be improved if theamino acid composition is augmented by the coupledamino acid composition10 or the pseudo amino acid compo-sition27 because those can incorporate a considerableamount of the sequence order effects. Besides the second-ary structure content prediction, the amino acid composi-tion information has been used in many areas of proteinbioinformatics such as subcellular localization predic-tion,28 the enzyme subclass prediction,29 the protein do-main boundary prediction,30 signal sequence prediction,31

and the prediction of unfolded segment in a proteinsequence.32 Therefore, it is anticipated that our novel ideapresented in this article can be applied to those areas andgreatly improve the performance of many protein bioinfor-matics tools.

ACKNOWLEDGMENTS

We thank the members of Protein Bio-Informatics Labo-ratory (PBIL), Sanjo Han, Seung Taek Ryu, and Chan-seok Jeong, for helpful discussion.

REFERENCES

1. Kabsch W, Sander C. Dictionary of protein secondary structure:pattern recognition of hydrogen-bonded and geometrical features.Biopolymers 1983;22(12):2577–2637.

2. Pancoska P, Bitto E, Janota V, Urbanova M, Gupta VP, KeiderlingTA. Comparison of and limits of accuracy for statistical analyses ofvibrational and electronic circular dichroism spectra in terms ofcorrelations to and predictions of protein secondary structure.Protein Sci 1995;4(7):1384–1401.

3. Sreerama N, Woody RW. Estimation of protein secondary struc-ture from circular dichroism spectra: comparison of CONTIN,SELCON, and CDSSTR methods with an expanded reference set.Anal Biochem 2000;287(2):252–260.

4. Krigbaum WR, Knutton SP. Prediction of the amount of secondarystructure in a globular protein from its amino acid composition.Proc Natl Acad Sci U S A 1973;70(10):2809–2813.

5. Zhang CT, Zhang Z, He Z. Prediction of the secondary structurecontents of globular proteins based on three structural classes. JProtein Chem 1998;17(3):261–272.

6. Lin Z, Pan XM. Accurate prediction of protein secondary struc-tural content. J Protein Chem 2001;20(3):217–220.

7. Pilizota T, Lucic B, Trinajstic N. Use of variable selection inmodeling the secondary structural content of proteins from theircomposition of amino acid residues. J Chem Inf Comput Sci2004;44(1):113–121.

8. Muskal SM, Kim SH. Predicting protein secondary structurecontent. A tandem neural network approach. J Mol Biol 1992;225(3):713–727.

9. Eisenhaber F, Imperiale F, Argos P, Frommel C. Prediction ofsecondary structural content of proteins from their amino acidcomposition alone. I. New analytic vector decomposition methods.Proteins 1996;25(2):157–168.

10. Liu W, Chou KC. Prediction of protein secondary structurecontent. Protein Eng 1999;12(12):1041–1050.

11. Cai YD, Liu XJ, Chou KC. Prediction of protein secondarystructure content by artificial neural network. J Comput Chem2003;24(6):727–731.

12. Rost B. Twilight zone of protein sequence alignments. Protein Eng1999;12(2):85–94.

13. Jones DT. Protein secondary structure prediction based on posi-tion specific scoring matrices. J Mol Biol 1999;292:195–202.

14. Rost B, Sander C. Prediction of protein secondary structure atbetter than 70% accuracy. J Mol Biol 1993;232:584–599.

15. Chou KC. Using pair-coupled amino acid composition to predictprotein secondary structure content. J Protein Chem 1999;18(4):473–480.

16. Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a struc-tural classification of proteins database for the investigation ofsequences and structures. J Mol Biol 1995;247(4):536–540.

17. Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M,Brenner SE. The ASTRAL Compendium in 2004. Nucleic AcidsRes 2004;32(Database issue):D189–D192.

18. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, MillerW, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generationof protein database search programs. Nucleic Acids Res 1997;25:3389–3402.

19. Holm L, Sander C. The FSSP database: fold classification based onstructure-structure alignment of proteins. Nucleic Acids Res1996;24(1):206–209.

20. Haykin S. NEURAL NETWORKS: a comprehensive foundation.Upper Saddle River, NJ: Prentice Hall; 1999.

21. Vapnik V. Statistical learning theory. New York: Wiley; 1998.22. Ruping S. mySVM-Manual, University of Dortmund, Lehrstuhl

Informatik 8. Available from http://www-ai.cs.uni-dortmund.de/SOFTWARE/MYSVM/.2000.

23. Eisenhaber F, Frommel C, Argos P. Prediction of secondarystructural content of proteins from their amino acid compositionalone. II. The paradox with secondary structural class. Proteins1996;25(2):169–179.

24. Chou KC, Liu WM, Maggiora GM, Zhang CT. Prediction andclassification of domain structural classes. Proteins 1998;31(1):97–103.

25. Cai YD. Is it a paradox or misinterpretation? Proteins 2001;43(3):336–338.

26. Nakashima H, Nishikawa K, Ooi T. The folding type of a protein isrelevant to the amino acid composition. J Biochem (Tokyo) 1986;99(1):153–162.

27. Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 2001;43(3):246–255.

28. Park KJ, Kanehisa M. Prediction of protein subcellular locationsby support vector machines using compositions of amino acids andamino acid pairs. Bioinformatics 2003;19(13):1656–1663.

29. Chou KC. Using amphiphilic pseudo amino acid composition topredict enzyme subfamily classes. Bioinformatics 2005;21(1):10–19.

30. Dumontier M, Yao R, Feldman HJ, Hogue CW. Armadillo: domainboundary prediction by amino acid composition. J Mol Biol2005;350(5):1061–1073.

31. Chou KC. Prediction of protein signal sequences and their cleav-age sites. Proteins 2001;42(1):136–139.

32. Coeytaux K, Poupon A. Prediction of unfolded segments in aprotein sequence based on amino acid composition. Bioinformatics2005;21(9):1891–1900.

1114 S. LEE ET AL.