12
Is There an Optimal Substitution Matrix for Contact Prediction with Correlated Mutations? Pietro Di Lena, Piero Fariselli, Luciano Margara, Marco Vassura, and Rita Casadio Abstract—Correlated mutations in proteins are believed to occur in order to preserve the protein functional folding through evolution. Their values can be deduced from sequence and/or structural alignments and are indicative of residue contacts in the protein three- dimensional structure. A correlation among pairs of residues is routinely evaluated with the Pearson correlation coefficient and the MCLACHLAN similarity matrix. In literature, there is no justification for the adoption of the MCLACHLAN instead of other substitution matrices. In this paper, we approach the problem of computing the optimal similarity matrix for contact prediction with correlated mutations, i.e., the similarity matrix that maximizes the accuracy of contact prediction with correlated mutations. We describe an optimization procedure, based on the gradient descent method, for computing the optimal similarity matrix and perform an extensive number of experimental tests. Our tests show that there is a large number of optimal matrices that perform similarly to MCLACHLAN. We also obtain that the upper limit to the accuracy achievable in protein contact prediction is independent of the optimized similarity matrix. This suggests that the poor scoring of the correlated mutations approach may be due to the choice of the linear correlation function in evaluating correlated mutations. Index Terms—Protein contact prediction, correlated mutations, similarity matrix. Ç 1 INTRODUCTION A large-scale statistical analysis indicates that, through evolution, side chain mutations tend to preserve more the protein structure than its sequence [15]. As a conse- quence, residue substitutions, when occurring, must be compensated in spatially close neighbors by other muta- tions (two residues in the same spatial neighborhood are termed in contact). This basic idea has been exploited first by searching for pairs of residues that might have coevolved and then by inferring that these pairs are indeed close in the three-dimensional space of the protein [14], [20]. The most widely adopted measures for scoring coevolving residue pairs are based on Pearson correlation, mutual information, and joint entropy. Among these measures, the Pearson correlation and its variants are the most popular and have been widely investigated in literature [10], [14], [18], [19], [20]. For a residue pair in the protein sequence, the Pearson coefficient measures the degree of correlation between pairwise substitutions. A set of similar sequences compiled in a multiple alignment and a similarity matrix to weigh residue substitutions in each position are necessary to compute each pairwise correlation. Scores for residue substitutions are routinely provided by the MCLACHLAN similarity matrix [16], in which substitution values are assigned on the basis of physicochemical similarity between side chains. If two residues a; b have a high level of physical similarity, then the substitution a ! bðb ! aÞ is rewarded with a positive score, otherwise a negative score is assigned. To the best of our knowledge, there is no justification to the use of the MCLACHLAN matrix instead of other substitution matrices, such as BLOSUM62 [12] or PAM250 [8]. Recently, MCLACHLAN was used in conjunction with a second matrix derived from wet contacts (side chains in contact with water molecules cocrystallized within the protein) and the results suggest that water may play a role in contact predictions [21]. The main motivation of this work is to explore the space of all the possible similarity matrices in order to find those that maximize the accuracy of contact prediction with the correlated mutation approach, as calculated with the Pearson coefficient. Under this condition, we show that the problem of computing the optimal similarity matrix for correlated mutations can be defined as a minimization problem of some particular error function. We performed several experimen- tal tests by optimizing the similarity matrices starting from different initial solutions. For each optimized matrix, the accuracy of contact prediction is similar to that obtained with MCLACHLAN. We obtain that the space of similarity matrices for correlated mutations contains a huge number of optimal matrices that perform equally well. This result provides a justification for adopting MCLACHLAN and, at the same time, defines an upper limit to the accuracy achievable in predicting residue contacts. 2 PRELIMINARIES 2.1 Contact Maps and Evaluation Criteria for Contact Prediction The map of contacts (contact map) of a protein P is a two- dimensional approximation of the three-dimensional struc- ture of P . There are several definitions of residue contacts in IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 4, JULY/AUGUST 2011 1017 . P. Di Lena, L. Margara, and M. Vassura are with the Department of Computer Science, University of Bologna, Via Mura Anteo Zamboni 7, 40127 Bologna, Italy. E-mail: {dilena, margara, vassura}@cs.unibo.it. . P. Fariselli and R. Casadio are with the Biocomputing Group, Department of Biology, University of Bologna, Via S, Giacomo 9/2, 40127 Bologna, Italy. E-mail: {piero, casadio}@biocomp.unibo.it. Manuscript received 19 Mar. 2010; revised 11 May 2010; accepted 17 May 2010; published online 9 Sept. 2010. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCBB-2010-03-0075. Digital Object Identifier no. 10.1109/TCBB.2010.91. 1545-5963/11/$26.00 ß 2011 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

Is There an Optimal Substitution Matrix for Contact Prediction with Correlated Mutations?

  • Upload
    unibo

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Is There an Optimal Substitution Matrix forContact Prediction with Correlated Mutations?

Pietro Di Lena, Piero Fariselli, Luciano Margara, Marco Vassura, and Rita Casadio

Abstract—Correlated mutations in proteins are believed to occur in order to preserve the protein functional folding through evolution.

Their values can be deduced from sequence and/or structural alignments and are indicative of residue contacts in the protein three-

dimensional structure. A correlation among pairs of residues is routinely evaluated with the Pearson correlation coefficient and the

MCLACHLAN similarity matrix. In literature, there is no justification for the adoption of the MCLACHLAN instead of other substitution

matrices. In this paper, we approach the problem of computing the optimal similarity matrix for contact prediction with correlated

mutations, i.e., the similarity matrix that maximizes the accuracy of contact prediction with correlated mutations. We describe an

optimization procedure, based on the gradient descent method, for computing the optimal similarity matrix and perform an extensive

number of experimental tests. Our tests show that there is a large number of optimal matrices that perform similarly to MCLACHLAN.

We also obtain that the upper limit to the accuracy achievable in protein contact prediction is independent of the optimized similarity

matrix. This suggests that the poor scoring of the correlated mutations approach may be due to the choice of the linear correlation

function in evaluating correlated mutations.

Index Terms—Protein contact prediction, correlated mutations, similarity matrix.

Ç

1 INTRODUCTION

A large-scale statistical analysis indicates that, throughevolution, side chain mutations tend to preserve more

the protein structure than its sequence [15]. As a conse-quence, residue substitutions, when occurring, must becompensated in spatially close neighbors by other muta-tions (two residues in the same spatial neighborhood aretermed in contact). This basic idea has been exploited firstby searching for pairs of residues that might havecoevolved and then by inferring that these pairs are indeedclose in the three-dimensional space of the protein [14], [20].The most widely adopted measures for scoring coevolvingresidue pairs are based on Pearson correlation, mutualinformation, and joint entropy. Among these measures, thePearson correlation and its variants are the most popularand have been widely investigated in literature [10], [14],[18], [19], [20]. For a residue pair in the protein sequence,the Pearson coefficient measures the degree of correlationbetween pairwise substitutions. A set of similar sequencescompiled in a multiple alignment and a similarity matrix toweigh residue substitutions in each position are necessaryto compute each pairwise correlation. Scores for residuesubstitutions are routinely provided by the MCLACHLANsimilarity matrix [16], in which substitution values areassigned on the basis of physicochemical similarity betweenside chains. If two residues a; b have a high level of physical

similarity, then the substitution a! bðb! aÞ is rewardedwith a positive score, otherwise a negative score is assigned.

To the best of our knowledge, there is no justification tothe use of the MCLACHLAN matrix instead of othersubstitution matrices, such as BLOSUM62 [12] or PAM250[8]. Recently, MCLACHLAN was used in conjunction witha second matrix derived from wet contacts (side chains incontact with water molecules cocrystallized within theprotein) and the results suggest that water may play a rolein contact predictions [21].

The main motivation of this work is to explore the space ofall the possible similarity matrices in order to find those thatmaximize the accuracy of contact prediction with thecorrelated mutation approach, as calculated with the Pearsoncoefficient. Under this condition, we show that the problemof computing the optimal similarity matrix for correlatedmutations can be defined as a minimization problem of someparticular error function. We performed several experimen-tal tests by optimizing the similarity matrices starting fromdifferent initial solutions. For each optimized matrix, theaccuracy of contact prediction is similar to that obtained withMCLACHLAN. We obtain that the space of similaritymatrices for correlated mutations contains a huge numberof optimal matrices that perform equally well. This resultprovides a justification for adopting MCLACHLAN and, atthe same time, defines an upper limit to the accuracyachievable in predicting residue contacts.

2 PRELIMINARIES

2.1 Contact Maps and Evaluation Criteria forContact Prediction

The map of contacts (contact map) of a protein P is a two-dimensional approximation of the three-dimensional struc-ture of P . There are several definitions of residue contacts in

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 4, JULY/AUGUST 2011 1017

. P. Di Lena, L. Margara, and M. Vassura are with the Department ofComputer Science, University of Bologna, Via Mura Anteo Zamboni 7,40127 Bologna, Italy. E-mail: {dilena, margara, vassura}@cs.unibo.it.

. P. Fariselli and R. Casadio are with the Biocomputing Group, Departmentof Biology, University of Bologna, Via S, Giacomo 9/2, 40127 Bologna,Italy. E-mail: {piero, casadio}@biocomp.unibo.it.

Manuscript received 19 Mar. 2010; revised 11 May 2010; accepted 17 May2010; published online 9 Sept. 2010.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-2010-03-0075.Digital Object Identifier no. 10.1109/TCBB.2010.91.

1545-5963/11/$26.00 � 2011 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

literature. We adopt the physical representation, whichdefines two residues to be in contact when the minimumdistance between every possible pair of heavy atoms (all-atom but not hydrogens, that are not present in the vastmajority of the PDB structures) in the two residues is lessthan or equal to 5 Angstrom (�A) [5]. According to [13], [17],4.5-5 �A is the largest distance that does not allow insertionof water molecules between two residues. For a givenprotein P , its contact map MP is a symmetric square matrix,which can be defined as

MPij ¼

1; if residues i and j are in contact;0; otherwise:

The presence of a contact between two residues is astructural constraint for the correct fold. If we were able topredict just few suitable contacts of a target protein, then wecould use such predictions as an intermediate step for thereconstruction of the tertiary structure [24] or as a tool forthe validation of novel structure predictions computed byother methods [7].

The research in contact prediction is assessed every twoyears in CASP experiments [9], following the EVAconindications [11]. The most important evaluation measuredefined in EVAcon is the accuracy of prediction, which isdefined as

number of correctly predicted contacts

predicted contacts: ð1Þ

Usually, contact predictors do not predict a binary matrixbut assign a contact score (i.e., a probability of contact) toeach pair of residues. Thus, the accuracy is calculated byapplying (1) to the first L;L=2; L=5; L=10 high-scored pairs(where L is the length of the protein). To have a bettermeasure of accuracy, three different classes of residue pairsare considered, depending on their sequence separation(number of residues included in their interleaving seg-ment): long range ðsequence separation � 24Þ, mediumrange ð�12Þ, and short range ð�6Þ. Residue contacts whosesequence separation is below 6 do not provide usefulinformation about the protein folding and are not eval-uated. The difficulty in contact prediction increases atincreasing sequence separation lengths. Long-range con-tacts are the most informative about the protein structuresince they impose the strongest constraints to the proteinfold. Thus, the performances of a contact predictor areusually ranked in terms of the prediction accuracy for long-range contacts.

2.2 Correlated Mutations and Contact Prediction

Correlated mutations have been intensively exploited forresidue contact predictions [10], [18], [20]. All best-knownimplementations of this approach compute the correlatedmutation scores by means of the Pearson correlationcoefficient (or its variants). In the following, we describe,in detail, the standard implementation of this method [10].

Consider a protein P and denote with AP , the matrixrepresenting the multiple sequence alignment for a family ofproteins similar to P . The first column of AP contains theresidue chain of P , the remaining columns ofAP correspondto the aligned protein chains and every row i of AP

corresponds to the observed substitutions for the ith residue

of P (according to the alignment). In detail, for everyposition i, we have the sequence of amino acid substitutions

APi0 ! AP

i1; APi0 ! AP

i2; . . . ; APi1 ! AP

i2; APi1 !; AP

i3; . . .

which correspond to the vector of scores

vi ¼�SAP

i0APi1;SAP

i0APi2; . . . ;SAP

i1APi2;SAP

i1APi3; . . .

�;

where S is some amino acid similarity matrix (i.e., SAPi0APi1

isthe score provided by the similarity matrix S for thesubstitution AP

i0 ! APi1). MCLACHLAN is the similarity

matrix routinely adopted to compute the substitution scoresfor correlation coefficients but the approach itself is notdependent on any particular similarity matrix.

The amount of correlation between sites i and j iscomputed by means of the Pearson correlation betweenvectors vi and vj. In detail, for every pair of positions i; j, wecompute the substitution score vectors vi; vj as describedabove, taking care to maintain a correspondence betweensubstitution pairs and entries of vi; vj, i.e., if for some k; vik isthe score of the substitution AP

ip ! APiq then vjk must be the

score of the substitution APjp ! AP

jq. This is necessary sincesimilarity matrices do not provide scores for substitutionsthat involve gaps, then some entries of vi; vj can beeventually removed. The Pearson correlation coefficientfor positions i; j of P is defined by

CPij ¼1

N

XNk¼1

�vik � vi

��vjk � vj

��i�j

; ð2Þ

where N is the length of the vectors vi; vj and

si ¼ 1

N

XNk¼1

vik ðaverage of viÞ;

�i ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPNk¼1

�vik � vi

�2

N

sðstandard deviation of viÞ:

By definition, the Pearson correlation coefficient CPij isinvariant to any linear transformation of the two vectorsvi; vj and it ranges into the interval ½�1; 1�, due to thenormalization factor �i�j. The coefficient CPij can be used toquantify the degree of linear correlation for the (observed)mutations of the ith residue with respect to the mutations ofthe jth residue of P . If CPij � 1, then the mutations at sites i; jare positively correlated, if CPij � �1, they are negativelycorrelated. When CPij � 0 the mutations at sites i; j areuncorrelated. In the standard implementation of this ap-proach, perfectly conserved positions (i.e., positions whichcorrespond to perfectly conserved residues in the multiplealignment) and positions containing more than 10 percentgaps in the multiple alignment are not considered since theyare uninformative for this analysis. In fact, note that forperfectly conserved positions the Pearson correlation (2) isnot determined since the standard deviation is equal to 0.

Referring to the contact prediction problem, it isgenerally believed that the amount of correlation betweenpairs of coevolving sites in a protein sequence provides anindication of contacts between residues in the protein: thehigher is the coefficient CPij, the higher is the probability thatresidues i; j are in contact.

1018 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 4, JULY/AUGUST 2011

3 MATERIALS AND METHODS

3.1 Minimization with Gradient Descent

The values of the correlation coefficients provided by (2)depend on the similarity matrix S used to compute thesubstitution scores. The aim of this paper is to search inthe space of substitution matrices the one that maximizesthe accuracy of contact prediction with correlated muta-tions (when the correlation measure is provided by thePearson correlation coefficients). This can be done in avery effective way by transforming the problem to find theoptimal similarity matrix for correlated mutations into aminimization problem.

Consider some similarity matrix S and some training setT of proteins. We define two different error functions, E1; E2

with respect to T and S

E1 ¼ �XP2T

Xi�1

Xj>iþ6

CPij$ij; ð3Þ

and

E2 ¼ �XP2T

Xi�1

Xj>iþ6

CPij%ij; ð4Þ

where

$ij ¼1; if MP

ij ¼ 1;

�1; if MPij ¼ 0;

(

%ij ¼1; if MP

ij ¼ 1;

0; if MPij ¼ 0:

(

MP is the contact map of protein P (see Section 2.1) and CPij isthe Pearson correlation coefficient for residues i; j of P ,computed with similarity matrix S (see Section 2.2). Wedefined the error functions for only residue pairs withsequence separation� 6, since this is the minimum sequenceseparation for which the contact prediction is meaningful(see Section 2.1). For the error function E1 in (3), if residuesi; j are not in contact (i.e., MP

ij ¼ �1) then E1 increases of CPijwhile if i; j are in contact (i.e., MP

ij ¼ 1) then E1 increases of�CPij. For the error function E2 in (4), the cost is determinedonly by residue pairs i; j that are in contact: if i and j are notin contact then %ij ¼ 0 and then the correlation CPij betweensites i and j has no contribution in (4). In both cases, thesimilarity matrix that minimizes E1 or E2 maximizes thecorrelation between residue contacts and correlated muta-tion coefficients (for the set T ). The minimization procedurewe use is the same for both E1 and E2.

Starting from an initial similarity matrix S ¼ Sð0Þ, we cancompute a new similarity matrix which minimizes E (witheither E ¼ E1 or E ¼ E2) by using the standard gradientdescent method [22]. The similarity matrix SðkÞ at step k > 0is computed by

SðkÞlm ¼ Sðk�1Þlm � � @E

@Sðk�1Þlm

;

where � is the learning rate of the minimization procedure.The partial derivative of the correlation coefficient CPij withrespect to Slm (i.e., with respect to the substitution l! m) isshown in Appendix. In our tests, the learning rate� is initially

set to 0.01 and it is decreased by 1=2 each time that the errorfunction increases between two successive steps. We decidedto run the minimization procedure for 150 epochs and thenreturn the similarity matrix that obtained the lowest error.

In order to obtain meaningful results for the minimiza-tion on E1, we need to balance the number of positive andnegative examples. In fact, since the number of contacts in acontact map is generally quite sparse with respect to thenumber of noncontacts, the minimization procedure tendsto favor noncontacts, leading to a similarity matrix whichminimizes the correlation coefficients for almost all pairs ofcontacts. We overcome the problem in the followingstandard way. At every iteration step and for every proteinP 2 T , we balance the number of negative and positiveexamples from P by taking all the available n residue pairsin contacts ðwith sequence separation � 6Þ and n randomlychosen (distinct) residue pairs not in contact. By this, thenumber of positive and negative examples is perfectly equalat every iteration step and randomness assures that most ofthe noncontact pairs are taken into account during theminimization process.

3.2 Data Sets

We selected from PDB [6] all protein chains whose structureshave resolution <2:5 �A. We excluded from this set allproteins whose structures contain missing residue coordi-nates and removed sequence redundancy with BLAST [1] inorder to obtain a set of protein chains with sequence identitylower than 25 percent. For each protein P in this set, weperformed BLAST search against the UniRef90 data set [23]to compile a multiple sequence alignment with respect toeach query protein P . From this set, we removed all proteinchains for which BLAST returned less than 100 alignedsequences (the correlation coefficients are not reliable whenthe set of aligned sequences is small [10], [18]) ending upwith a set of 899 protein chains. This set has been randomlypartitioned in four disjoint sets, set1, set2, set3, set4, of thesame cardinality (225 proteins each, except set4, whichcontains 224 proteins), well balanced with respect to theSCOP structural classes [2] and protein chain lengths. Weperformed tests in cross validation: when seti is consideredas the test set, the training set consists of [j6¼isetj. Thus, foreach experiment, we performed four distinct tests.

3.3 Similarity Matrices

In order to obtain meaningful tests, it is useful to considerseveral different initial similarity matrices. This is necessarysince the gradient descent method can be trapped in localminima, thus, the final solution can be highly influenced bythe initial solution chosen.

We considered six distinct initial similarity matrices forour tests: MCLACHLAN [16], BLOSUM62 [12], PAM250 [8],the IDENTITY matrix (all entries are set to 0 except on themain diagonal where they are set to 1) and two symmetricrandomly generated matrices, RANDOM1 (see Table 8) andRANDOM2 (see Table 9), whose entries are rational numbersinto the interval ½�1; 1�. Note that, since the Pearsoncorrelation is invariant to linear transformations, the correla-tion between two coevolving sites is invariant with respect tolinear transformations of the similarity matrix. Thus, we canuse the correlation coefficient of (2) to evaluate the correla-tion (similarity) between the substitution matrices described

DI LENA ET AL.: IS THERE AN OPTIMAL SUBSTITUTION MATRIX FOR CONTACT PREDICTION WITH CORRELATED MUTATIONS? 1019

above (the correlation has been computed by consideringonly the upper triangle of the matrices, since they aresymmetric). The correlation coefficients are shown in Fig. 1.We can notice that MCLACHLAN, BLOSUM62, andPAM250 are highly correlated rather independently of theproperties from which they were derived. The IDENTITYmatrix is more correlated to the BLOSUM62 matrix than tothe other matrices. As expected, there is no (positive ornegative) correlation between the two random matrices(RANDOM1 and RANDOM2) and those derived fromdifferent properties.

4 RESULTS

The general scheme of our tests is the following. We startwith an initial similarity matrix S and compute the optimizedsimilarity matrix S0 by using the minimization proceduresdescribed in Section 3.1. Then, we consider the accuracies ofprediction obtained by using the optimized matrix S0. Theaccuracies shown are the average of the accuracies obtainedin cross validation on the four data sets described inSection 3.2. The results obtained by minimization on theerror function E1, from (3), are shown in Section 4.1. Theresults obtained by minimization on the error function E2

from (4) are shown in Section 4.2. As a further test, inSection 4.3, we evaluate how our optimization procedurebehaves when it is applied to multiple sequence alignmentswith different degrees of sequence identity.

For the sake of clarity, in Table 1, we show the accuraciesobtained when predicting contacts with the similaritymatrices described in the previous section. For each matrix,the average accuracies for sequence separation � 24 havebeen computed separately for each test set (set1, set2, set3,set4) and the average accuracy � the standard deviationover the four sets is shown. The accuracies obtained byusing MCLACHLAN, BLOSUM62, and IDENTITY arecomparable, while PAM250 performs slightly worse. Thisis in agreement with our previous observation (Fig. 1) that

such matrices are highly correlated. The fact that theIDENTITY matrix has performances very close to those ofMCLACHLAN and BLOSUM62 is notable. It indicates thatmost of the information that correlated mutations canprovide about residue contacts are in fact related to theamount of conservation of residue pairs in the multiplealignments. As expected, the two random matrices havepoor performances compared to the other similaritymatrices. Nevertheless, the accuracies obtained with RAN-DOMs are still better than those we would obtain with arandom predictor (�3-4%) [11]. Actually, a random simi-larity matrix assigns always a specific score to each pair ofresidues, so that only the relationships between pairs arerandom. We remark that MCLACHLAN in Table 1represents the standard method [10], [18] and can be takenas benchmark of the current state of the art of contactprediction with correlated mutations, evaluated followingEVAcon [11].

4.1 Minimization on Error Function E1

In Table 2, we show the accuracies obtained by using thesimilarity matrices computed with the minimization proce-dure described in Section 3.1 for the error function E1 in (3).The initial similarity matrix for the minimization procedureis listed in the first column of Table 2. As specified inSection 3.2, when the test set is seti (for i 2 f1; 2; 3; 4g), thetraining set is defined by [j 6¼isetj (for j 2 f1; 2; 3; 4g). Thus,for each test, we obtained four distinct optimized similarity

1020 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 4, JULY/AUGUST 2011

Fig. 1. Pearson correlation coefficients between similarity matrices.Darker colors correspond to lower correlation.

TABLE 1Accuracies (Percent) of Contact Prediction

(with Correlated Mutations) forSequence Separation � 24 (see Section 2.1)

The predictions have been computed by using the substitution matricesin the first column. The accuracies shown are the average accuracies �the standard deviation over the accuracies obtained on set1, set2, set3,set4 (see Section 3.2).

TABLE 2Accuracies (Percent) of Contact Prediction

(with Correlated Mutations) for Sequence Separation � 24

The predictions have been computed by using the substitution matricesobtained with the minimization procedure described in Section 3.1 onerror function E1. The initial solution is the matrix listed in the firstcolumn. The accuracies shown are the average accuracies � thestandard deviation over the accuracies obtained on set1, set2, set3, set4(see Section 3.2).

matrices. We will refer to such matrices in the following way:the optimized matrix obtained by using the MCLACHLANas initial solution and by training on [i2f2;3;4gseti is denotedas MCLACHLAN.1, the one obtained by training on[i2f1;3;4gseti will be denoted as MCLACHLAN.2 and so on.In this way, the contact predictions for set1 have beenobtained by using MCLACHLAN.1, the predictions for set2by using MCLACHLAN.2 and so on. The procedure is thesame for the other matrices.

It is evident that our minimization procedure producesin all tested cases similarity matrices which provides almostthe same performances of MCLACHLAN and BLOSUM62(compare Tables 1 and 2), even when the initial solution israndomly chosen (RANDOM1 and RANDOM2).

The correlation coefficients between the optimizedsimilarity matrices for set1 only are shown in Fig. 2. Thecorrelations between all computed matrices are shown inFig. 6. We notice that when we start with some initialmatrix, the four matrices, obtained by applying theminimization procedure on the four different training sets,are always highly correlated (i.e., the white four by foursquares along the main diagonal in Fig. 6). In this case, theminimum correlation coefficient observed is �0:95. It is alsoevident that there is some correlation between every pair ofoptimized matrices (the minimum correlation coefficientobserved is �0:67). The two most distant groups of matricesare those obtained by minimization on PAM250 and thoseobtained by minimization on RANDOM1, which roughlycorrespond to the worst and best performing optimizedsimilarity matrices, respectively (see Table 2).

4.2 Minimization on Error Function E2

In Table 3, we show the accuracies obtained by using thesimilarity matrices computed with the minimization proce-dure described in Section 3.1 on error function E2 in (4). Thecorrelation coefficients between the optimized similaritymatrices for set1 only are shown in Fig. 3. The correlation

coefficients between all the optimized similarity matrices,shown in Fig. 7, indicates that all the optimized matrices arevery highly correlated: almost all the matrices have correla-tion�0:99 and the minimum correlation is�0:97. Differently,from what we obtained in Section 4.1 (see Fig. 2), the surfaceof the error function E2 seems to be quite smooth, thus theminimization procedure eventually leads to a uniqueminimum independently of the starting point. This is alsoconfirmed by the fact that the learning rate � is decreased justtwo times in 150 epochs starting from IDENTITY, RAN-DOM1, and RANDOM2 and never decreased starting fromMCLACHLAN, BLOSUM62, and PAM250. In terms ofcontact prediction accuracy (see Table 3), the results arecomparable and slightly better on average than thoseobtained in the previous section. If we compare the resultsin Table 3 with the standard method (i.e., MCLACHLAN) inTable 1, we notice that with our optimized matrices there issome improvement in contact prediction accuracy (less than1 percent). In Table 4, we show the optimized similarity

DI LENA ET AL.: IS THERE AN OPTIMAL SUBSTITUTION MATRIX FOR CONTACT PREDICTION WITH CORRELATED MUTATIONS? 1021

Fig. 2. Pearson correlation coefficients between the optimizedsimilarity matrices for set1, computed with the minimization procedureof Section 3.1 on error function E1. Darker colors correspond to lowerlevel of correlation. The correlations between all computed matricesare shown in Fig. 6.

Fig. 3. Pearson correlation coefficients between the optimizedsimilarity matrices for set1, computed with the minimization procedureof Section 3.1 on error function E2. Darker colors correspond to lowerlevel of correlation. The correlations between all computed matricesare shown in Fig. 7.

TABLE 3Accuracies (Percent) of Contact Prediction

(with Correlated Mutations) for Sequence Separation � 24

The predictions have been computed by using the substitution matricesobtained with the minimization procedure described in Section 3.1 onerror function E2. The initial solution is the matrix listed in the firstcolumn. The accuracies shown are the average accuracies � thestandard deviation over the accuracies obtained on set1, set2, set3, set4(see Section 3.2).

matrix obtained with the minimization procedure on theerror function E2. The similarity matrix in Table 4 has beenobtained by taking the average (after standardization) of allthe 24 optimized matrices obtained with the minimizationprocedure and next by rounding to integer numbers. Theperformances of our optimize matrix in terms of accuracy ofprediction are equal to those shown in Table 3. It is worthnoticing that, in the optimized similarity matrix of Table 4, thescores on the main diagonal are far higher than the otherssubstitution scores. This confirms our previous observationthat residue pair conservation in multiple alignments has akey role in contact prediction with correlated mutations.

Furthermore, it is very interesting, and not completelyexpected, that by optimizing a substitution matrix (evenstarting from random entries) for the contact prediction taskwe end up with a substitution matrix that contains biochem-ical informations (see dendrogram in Fig. 4). For instance,charged residues tend to be favorably interchanged betweenpairs of the same charge (R-K and D-E) and hydrophobicresidues tend to be positively substituted (I,L,M,V) as well asaromatic-hydrophobic residues (F,Y,W). This unexpectedresult also confirms that the allowed substitutions must be“chemically compliant” in order to preserve the correlationamong contacts in the protein structures.

4.3 Minimization on Error Function E2 by UsingDifferent Thresholds of Sequence Identity

It has been noticed that the correlated mutation approach isbiased by the quality of the multiple sequence alignmentsand that is possible to improve the predictive accuracy byfiltering out background noise [3], [4]. Although our proteindata set for the alignment is Uniref90, that, by construction,does not contain identical sequences, it is not guaranteedthat our minimization approach is not affected by thedifferent degree of sequence similarities that are generatedby the BLAST runs for the various training sequences. Totest this, we built three artificial data sets starting from theoriginal one, with a controlled level of sequence similarity.In practice, for each sequence s with its corresponding

multiple sequence alignment MSAðsÞ obtained with theprocedure described in Section 3.2, we generated threedifferent subalignments: MSAhðsÞ (high-sequence identity),MSAmðsÞ (medium-sequence identity), and MSAlðsÞ (low-sequence identity). In detail, MSAhðsÞ contains only theretrieved sequences that have a sequence identity � 60%with respect to the query s. MSAmðsÞ contains only theretrieved sequences that have a sequence identity < 60%and � 30% with respect to s. MSAlðsÞ contains the alignedsequences that have a sequence identity < 30% with respectto s. We selected these thresholds to have an almostbalanced number of sequences in the MSAh and MSAl sets.

1022 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 4, JULY/AUGUST 2011

TABLE 4Mean Similarity Matrix (Rounded to Integer Numbers), Obtained from the Minimization Procedure on E2 in (4)

Fig. 4. Cluster dendrogram obtained from the substitution matrix inTable 4 with the hclust procedure in the R environment. The matrix inTable 4 has been previously converted into a distance matrix bychanging each substitution score with (max value)-(subst. score), wheremax value is equal to 38. The letters in the dendrogram represent the20 amino acids. The bar height on the left indicates the degree ofsimilarity (the similarity is as high as lower is the height).

Starting from the previously selected six matrices, weoptimized the E2 function by using the three different typesof multiple sequence alignments. In Fig. 5, we report thefinal correlations among the different minimized matricesand with respect to our optimized matrix, MEANSM, ofTable 4. We found that also in these extreme conditions, theminimized matrices are highly correlated, indicating thatthe optimization of the matrix as function of the proteincontacts is independent of the degree of sequence similarityin the multiple alignments. This finding does not mean thatthe prediction accuracy is not affected. In Tables 5, 6, and 7,we report the contact prediction performance when thedifferent matrices (including the original MACLACHLANand our optimized matrix, MEANSM, of Table 4) and thedifferent artificial alignment subsets are tested. FromTables 5, 6, and 7, it is clear that, in agreement on whatpreviously pointed out [3], [4], the predictive accuracy isstrongly affected when unbalanced multiple sequencealignments are used to predict the residue contacts. Thisfurther indicates that the alignment is the pivotal gauge

modulating the prediction accuracy, while the choice of the

scoring matrices is less influent.

DI LENA ET AL.: IS THERE AN OPTIMAL SUBSTITUTION MATRIX FOR CONTACT PREDICTION WITH CORRELATED MUTATIONS? 1023

Fig. 5. Pearson correlation coefficients between the optimized similarity matrices computed with the minimization procedure of Section 3.1 on errorfunction E2 and with respect to three filtered MSAs sets (MSAh;MSAm, and MSAl) described in Section 4.3. Darker colors correspond to lower levelof correlation. The correlation values range between 0.88 and 0.99.

TABLE 5Accuracies (Percent) of Contact Prediction

(with Correlated Mutations) for Sequence Separation � 24on the Set MSAh (Sequence Identity � 60%)

The predictions have been computed by using the substitution matricesobtained with the minimization procedure described in Section 3.1 onerror function E2.

5 CONCLUSIONS

Correlated mutations are indicative of residue contacts in

proteins. Pearson correlation coefficient or some variants are

routinely used to compute correlated mutations from protein

sequence alignments. Standard implementations exploit the

MCLACHLAN similarity matrix to provide a measure of

residue substitutions. So far, the choice of the MCLACHLAN

matrix was not justified in literature and there was no

evidence of its optimality with respect to the contact

prediction problem. In this paper, we show that othersubstitution matrices achieve similar performances in con-tact prediction, including BLOSUM62 and the IDENTITYmatrix. The fact that the IDENTITY matrix achievesperformances comparable to MCLACHLAN indicates thatresidue pair conservation in multiple alignments plays animportant role in the determination of residue contacts byobserving correlated mutations.

Furthermore, we cast the problem of computing theoptimal similarity matrix that maximizes the accuracy of

1024 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 4, JULY/AUGUST 2011

TABLE 6Accuracies (Percent) of Contact Prediction (with CorrelatedMutations) for Sequence Separation � 24 on the Set MSAm

(Sequence Identity < 60% and � 30%)

The predictions have been computed by using the substitution matricesobtained with the minimization procedure described in Section 3.1 onerror function E2.

TABLE 7Accuracies (Percent) of Contact Prediction (with CorrelatedMutations) for Sequence Separation � 24 on the Set MSAl

(Sequence Identity < 30%)

The predictions have been computed by using the substitution matricesobtained with the minimization procedure described in Section 3.1 onerror function E2.

TABLE 9RANDOM2 Similarity Matrix

TABLE 8RANDOM1 Similarity Matrix

contact prediction with correlated mutations. We defined

the problem of computing such optimal similarity matrix as

a minimization problem for the error functions E1, (3), and

E2, (4). In the first case (Section 4.1), the error function E1

takes into account both positive and negative examples (i.e.,

pair of residues in contact and not in contact, respectively).

In the second case (Section 4.2), the error function E2 takes

into account only positive examples. The surfaces of the two

error functions are quite different. In particular, our

experimental results show that E1 has a huge number of

equivalent minima, which correspond to well-correlated

similarity matrices that have more or less the same

performances in terms of accuracy of contact prediction

(slightly better than MCLACHLAN). Surprisingly, the error

function E2 has a unique minimum, which corresponds to a

similarity matrix that, also in this case, has performances

slightly better than MCLACHLAN (less than 1 percent of

improvement). In this paper, we also show (Section 4.3) that

this result is independent of the degree of similarity in the

multiple sequence alignment used to minimize the function

E2. On the contrary, in agreement on what previously

observed [3], [4], [19], the quality of the multiple sequence

alignment strongly affects the predictive performance of the

correlated mutations.

DI LENA ET AL.: IS THERE AN OPTIMAL SUBSTITUTION MATRIX FOR CONTACT PREDICTION WITH CORRELATED MUTATIONS? 1025

Fig. 6. Pearson correlation coefficients between the optimized similarity matrices computed with the minimization procedure of Section 3.1 on errorfunction E1. Darker colors correspond to lower level of correlation. The correlation values range between 0.67 and 0.99.

In conclusion, our results show that the space ofsimilarity matrices for correlated mutations has a hugenumber of optimal matrices in terms of accuracy of contactprediction. With our optimization procedure, an upperlimiting value of about 16 percent accuracy is found whenpredicting long-range residue contacts (at sequence separa-tion � 24). The value is independent of the optimizedsimilarity matrix used to compute correlation coefficientsand represents the-state-of-the-art performance at sequenceseparation � 24. This finding suggests that a possibleimprovement in the accuracy value of contact predictionmay be obtained by exploiting higher order correlationfunctions between coevolving pairs of residues.

APPENDIX

Recall that the Pearson correlation coefficient between two

vectors vi; vj is defined by

Cij ¼1

N

XNk¼1

�vik � vi

��vjk � vj

��i�j

;

where

vi ¼ 1

N

XNk¼1

vik

1026 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 4, JULY/AUGUST 2011

Fig. 7. Pearson correlation coefficients between the optimized similarity matrices computed with the minimization procedure of Section 3.1 on errorfunction E2. Darker colors correspond to lower level of correlation. The correlation values range between 0.97 and 0.99.

is the mean of vi and

�i ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPNk¼1

�vik � vi

�2

N

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1

N

XNk¼1

�vik�2 � ðviÞ2

vuut

is the standard deviation of vi. For simplicity, denote

Cij ¼cij�i�j

;

where

cij ¼1

N

XNk¼1

�vik � vi

��vjk � vj

�¼ 1

N

XNk¼1

vikvjk � vivj:

The partial derivative of Cij with respect to the substitution

s ¼ Slm is

@Cij@s¼ 1

�i�j

@cij@sþ cij�j

@��1i

@sþ cij�i

@��1j

@s:

We have

@cij@s¼ 1

N

XNk¼1

���s; vik

�vjk þ �

�s; vjk

�vik

� 1

N

XNk¼1

���s; vik

�vj þ �

�s; vjk

�vi�

¼ 1

N

XNk¼1

���s; vik

��vjk � vj

�þ ��s; vjk

��vik � vi

�;

@��1i

@s¼ � 1

2�3i

2

N

XNk¼1

vik��s; vik

�� 2

NviXNk¼1

��s; vik

�" #

¼ � 1

N�3i

XNk¼1

��s; vik

��vik � vi

�;

@��1j

@s¼ � 1

2�3j

2

N

XNk¼1

vjk��s; vjk

�� 2

NvjXNk¼1

��s; vjk

�" #

¼ � 1

N�3j

XNk¼1

��s; vjk

��vjk � vj

�;

where

��s; vik

�¼ 1; if sik corresponds to subst: l! m;

0; otherwise:

In order to simplify the notation, denote

. A1 ¼ 1N�i�j ; A2 ¼ � cij

Nð�iÞ3�j ; A3 ¼ � cij

N�ið�jÞ3 .. �ik ¼ ðvik � viÞ; �

jk ¼ ðv

jk � vjÞ.

Putting altogether, we obtain

@Cij@s¼ A1

XNk¼1

���s; vik

��jk þ �

�s; vjk

��ik

þA2

XNk¼1

��s; vik

��ik

þA3

XNk¼1

��s; vjk

��jk

¼XNk¼1

��s; vik

��A1�

jk þA2�

ik

þXNk¼1

��s; vjk

��A1�

ik þA3�

jk

¼ 1

N�i�j

XNk¼1

��s; vik

��jk �

cij

ð�iÞ2�ik

!"

þXNk¼1

��s; vjk

��ik �

cij

ð�jÞ2�jk

!#:

REFERENCES

[1] S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W.Miller, and D.J. Lipman, “Gapped BLAST and PSI-BLAST: A NewGeneration of Protein Database Search Programs,” Nucleic AcidsResearch, vol. 25, no. 17, pp. 3389-3402, Sept. 1997.

[2] A. Andreeva, D. Howorth, S.E. Brenner, T.J. Hubbard, C. Chothia,and A.G. Murzin, “SCOP Database in 2004: Refinements IntegrateStructure and Sequence Family Data,” Nucleic Acids Research,vol. 32, pp. 226-229, Jan. 2004.

[3] H. Ashkenazy, R. Unger, and Y. Kliger, “Optimal Data Collectionfor Correlated Mutation Analysis,” Proteins, vol. 74, no. 3, pp. 545-555, Feb. 2009.

[4] H. Ashkenazy and Y. Kliger, “Reducing Phylogenetic Bias inCorrelated Mutation Analysis,” Protein Eng., Design and Selection,vol. 23, no. 5, pp. 321-326, May 2010.

[5] L. Bartoli, P. Fariselli, and R. Casadio, “The Effect of Backbone onthe Small-World Properties of Protein Contact Maps,” PhysicalBiology, vol. 4, no. 4, pp. L1-5, 2008.

[6] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H.Weissig, I.N. Shindyalov, and P.E. Bourne, “The Protein DataBank,” Nucleic Acids Research, vol. 28, no. 1, pp. 235-242, Jan. 2000.

[7] R. Das and D. Baker, “Macromolecular Modeling with Rosetta,”Ann. Rev. of Biochemistry, vol. 77, pp. 363-382, 2008.

[8] M.O. Dayhoff, R.M. Schwartz, and B.C. Orcutt, “A Model ofEvolutionary Change in Proteins,” Atlas of Protein Sequence andStructure, vol. 5, no. 3, pp. 345-352, 1978.

[9] I. Ezkurdia, O. Grana, J.M. Izarzugaza, and M.L. Tress, “Assess-ment of Domain Boundary Predictions and the Prediction ofIntramolecular Contacts in CASP8,” Proteins, vol. 77, no. 9,pp. 196-209, 2009.

[10] U. Gobel, C. Sander, R. Schneider, and A. Valencia, “CorrelatedMutations and Residue Contacts in Proteins,” Proteins, vol. 18,no. 4, pp. 309-317, Apr. 1994.

[11] O. Grana, V.A. Eyrich, F. Pazos, B. Rost, and A. Valencia,“EVAcon: A Protein Contact Prediction Evaluation Service,”Nucleic Acids Research, vol. 33, pp. 347-351, July 2005.

[12] S. Henikoff and J.G. Henikoff, “Amino Acid Substitution Matricesfrom Protein Blocks,” Proc. Nat’l Academy of Sciences USA, vol. 89,no. 22, pp. 10915-10919, Nov. 1992.

[13] D.A. Hinds and M. Levitt, “A Lattice Model for Protein StructurePrediction at Low Resolution,” Proc. Nat’l Academy of SciencesUSA, vol. 89, no. 5, pp. 2536-2540, Apr. 1992.

[14] D.S. Horner, W. Pirovano, and G. Pesole, “Correlated SubstitutionAnalysis and the Prediction of Amino Acid Structural Contacts,”Briefings in Bioinformatics, vol. 9, no. 1, pp. 46-56, Jan. 2008.

[15] A. Lesk, Introduction to Bioinformatics. Oxford Univ. Press, 2006.

DI LENA ET AL.: IS THERE AN OPTIMAL SUBSTITUTION MATRIX FOR CONTACT PREDICTION WITH CORRELATED MUTATIONS? 1027

[16] A.D. McLachlan, “Tests for Comparing Related Amino-acidSequences. Cytochrome c and Cytochrome c 551,” J. MolecularBiology, vol. 61, no. 2, pp. 409-424, Oct. 1971.

[17] L. Mirny and E. Domany, “Protein Fold Recognition andDynamics in the Space of Contact Maps,” Proteins, vol. 26, no. 4,pp. 391-410, 1996.

[18] O. Olmea and A. Valencia, “Improving Contact Predictions by theCombination of Correlated Mutations and Other Sources ofSequence Information,” Folding and Design, vol. 2, no. 3, pp. 25-32, 1997.

[19] F. Pazos, M. Helmer-Citterich, G. Ausiello, and A. Valencia,“Correlated Mutations Contain Information about Protein-ProteinInteraction,” J. Molecular Biology, vol. 25, no. 4, pp. 511-523, Aug.1997.

[20] D.D. Pollock and W.R. Taylor, “Effectiveness of CorrelationAnalysis in Identifying Protein Residues Undergoing CorrelatedEvolution,” Protein Eng., vol. 10, no. 6, pp. 647-657, June 1997.

[21] S.A. Samsonov, J. Teyra, G. Anders, and M.T. Pisabarro, “Analysisof the Impact of Solvent on Contacts Prediction in Proteins,” BMCStructural Biology, vol. 9, article no. 22, Apr. 2009.

[22] J.A. Snyman, Practical Mathematical Optimization: An Introduction toBasic Optimization Theory and Classical and New Gradient-BasedAlgorithms. Springer-Verlag, 2005.

[23] B.E. Suzek, H. Huang, P. McGarvey, R. Mazumder, and C.H. Wu,“UniRef: Comprehensive and Non-Redundant UniProt ReferenceClusters,” Bioinformatics, vol. 23, no. 10, pp. 1282-1288, May 2007.

[24] M. Vassura, L. Margara, P. Di Lena, F. Medri, P. Fariselli, and R.Casadio, “Reconstruction of 3D Structures from Protein ContactMaps,” IEEE/ACM Trans. Computational Biology and Bioinformatics,vol. 5, no. 3, pp. 357-367, July/Sept. 2008.

Pietro Di Lena received the Laurea and PhDdegrees in computer science from the Universityof Bologna, Italy, in 2003 and 2007, respectively.He is a research assistant of computer scienceat the University of Bologna. His researchinterests include combinatorial optimization,computational complexity, cellular automata,and bioinformatics.

Piero Fariselli received the PhD degree inbiophysics and the Laurea degree in physics.He is a permanent researcher in the Biocomput-ing Group, University of Bologna. His mainresearch interests include computational biologyand machine learning. He is the author of morethan 100 publications.

Luciano Margara received the Laurea and Phddegrees in computer science from the Universityof Pisa in 1991 and 1995, respectively. Since1995, he has been with the University ofBologna, Italy, where he was a researchassociate from 1995 to 2000, was an associateprofessor from 2000 to 2005, and is currently afull professor of computer science. He has beena visiting scientist at the International ComputerScience Institute, Berkeley, and a visiting pro-

fessor in the Department of Computer Science, Cornell University. Hisresearch interests include discrete time dynamical systems, opticalnetworks, computational complexity, and, recently, bioinformatics.

Marco Vassura received the Laurea and PhDdegrees in computer science from the Universityof Bologna, Italy, in 2001 and 2005, respectively.He is a research assistant of computer scienceat the University of Bologna. His researchinterests include combinatorial optimization,optical networks, and, recently, bioinformatics.

Rita Casadio is full a professor of biochemistry/bioinformatics at the University of Bologna, Italy,and is the group leader of the BiocomputingGroup (www.biocomp.unibo.it). Her researchinterests include bioinformatics, computationalbiology, machine learning, and molecular theo-retical biophysics. She is the author of more than300 publications. She is president of the BolognaInternational Master degree in Bioinformatics.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

1028 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 4, JULY/AUGUST 2011