Viral Protein Structure Predictions - Consensus Strategy

New strategy for the viral protein structure predictions: “Consensus model” approach to take advantage of sequence diversity Introduction

The viral surface proteins are good targets for the vaccine development. They are major targets for neutralizing antibody against viruses but elicitation of broadly reactive neutralizing antibody (brNAb) is proved to be difficult. The structural information for these viral surface proteins is extremely important for rational design for immunogen targeting these viruses in order to understand conserved surface residues. Although drug targets such as viral proteases, reverse transcriptases are well studied by crystallographic or NMR structural determinations in order to design effective drug against these viruses, surface proteins are not as much studied as drug target proteins. Some well-studied surface proteins such as HIV gp120 and influenza hemagglutinin show significant sequence diversity and shown to be challenging target for elicitation of brNAb. The peptide based immunogen and epitope scaffold-based approaches utilizing known brNAb epitope are not successful to elicit brNAb against HIV-1 and influenza so far [1,2], and the use of minimal-sized rigid protein of conserved sequence region in the native structural context would be necessary [3]. In order to rationally design such immunogen, structural information for the region of protein that can fold independently is crucial. With insufficient solved structures, we need to rely on the computational structure prediction on these proteins. The structure prediction of viral proteins is not easy task. This class of proteins does not share high sequence similarity with known cellular proteins [4,5,6] and also solved structures of viral proteins are sparse compared to other protein families. Thus, it is poor target for the comparative modeling strategy. In addition to that, the sequence diversity within the same protein is extremely high [7,8] with their high mutation rates for virus’ immune evasion and makes it a moving target for the structure prediction (the choice of sequence may affect outcome). Some proteins show below “twilight zone” sequence identities [9] among same protein from different strains but same specie. In these cases, first of all, even it is not clear what is the principle of choosing target sequence for prediction, as even consensus sequence is just another variation. Thus, it is also true that these proteins are difficult target for de novo prediction. In addition to extensive diversity within same protein, the sequence space of viral proteins is vastly large yet not overlapping enough with “cellular” (prokaryotic/archaeal/eukaryotic) sequence spaces [5,6,10]. Thus, predicting the protein structure with knowledge-based algorithms [11,12] appears to be difficult including both comparative and fragment assembly based free modeling strategies. As completely physics-based modeling is still not feasible or practical at this time [13], in order to predict the structure of viral proteins, we need to use currently available algorithms [14,15]. Although overall sequence spaces of cellular and viral proteins may not overlap very much and unique to each other except some exchanged genes between two, still fragment library has short “viral” sequences (such as 9mers for Rosetta fragment library) in some degree even assembled from database composed of mostly cellular proteins. It is assumed that due to physicochemical constraints, same sequence in both cellular and viral proteins should share similar ensemble of structures as fragments. For this reason, physics-based fragment assembly algorithms such as Rosetta should be able to produce decoys with same fold as native protein, probably with low frequency.

The strategy for computational structure prediction of those viral proteins was developed using simple principle in order to address difficulty associated with high diversity and uniqueness of viral sequence space. Despite genetic distance, same protein must retain its function, thus same structure (similar enough to have intact functionality). Although human papillomavirus E7 protein C-terminal domain only shares as low as 15% sequence identity among strains (data not shown), all variants are functional protein from isolates. Thus, in principle, they should share same fold among genetically distant sequences, even each member of assembled sequences would generate considerably different set of decoy population. By incorporating farthest sequence pairs among the sequences from same protein in order to minimize overlapped structure space belonging to each sequence, this strategy is designed to capture the common fold being populated by maximum number of sequences.

It may not be abundantly populated members nor low energy structure, but common among all sequence space and it is the “consensus model”. The multiple sequence alignment information has been used to obtain the hydrophobic core forming residues or to improve secondary/tertiary structure predictions and reported successful improvement of predicted structures [16,17] but the use of different sequences within same protein directly has not been reported, appears to be mainly due to negligible sequence diversity for cellular proteins. Utilizing Rosetta for decoy generation, 5 small viral proteins with solved structures were used as the benchmark proteins for evaluation of this strategy. Increase of computational power in workstation allows such “redundant” approach with manageable time scale with affordable resources. Total of 800,000 decoys (10,000x16x5) were generated and clustered the decoys with their pairwise TM-score [18] using just two 8-core CPU equipped workstations. For this approach, we have chosen to use computationally expensive TM-score instead of RMSD (this step is the most time consuming step for all process) in order to capture the overall “fold” [19,20] rather than atomic details. In this report, first, we show the results of Rosetta generated decoys and TM-score/energy score relations of these benchmark viral proteins then show the performance of “consensus” approach. Methods

Selection of benchmark proteins The 5 small viral proteins with known structure

and variable degree of sequence diversity have been chosen for this study to test “consensus” approach. Availability of solved structure, sequence length below 100 residues and extremely to mildly diverse sequences are the criteria for selection of target proteins. We have chosen HIV p24 C-terminal dimerization domain, vpr and vpu cytoplasmic domain, HPV E2 DNA binding and E7 C-terminal domains that matches above defined criteria. The 16 sequences from each protein are chosen to be most distant in pairwise and in average. Pairwise sequence identity among the target sequences ranges from 15 to 92% and averages pairwise identity ranges are from 33 to 79% among same protein sequences. Table 1 illustrates sequence statistics of target proteins. Sequences were obtained from Los Alamos HIV database (http://www.hiv.lanl.gov/) (HIV p24/vpr/vpu) or manually collected from GenBank [21] (HPV E2/E7) and subjected to multiple alignment by Clustal W [22] and T-Coffee [23]. Sequences are then trimmed down to the domains with structure assessments and N- or C- terminal flexible regions in solution structure were removed from target sequences. Decoy generation, calculation of pairwise TM-scores and clustering

The sets of 10,000 decoys are generated for each sequence using Rosetta 2.2 without full-atom relaxation step. Each decoy dataset was then subjected to the calculation of pairwise TM-score of all decoy combinations. TM-scores were calculated by in-house program written in C (with TM-align source code ported to C). Then decoys were clustered by decoy-decoy pairwise TM-scores. Clustering was done with in-house program based on algorithm from Cluster 3.0 software source code [24,25] by complete linkage clustering with threshold of TM-score 0.5. The clusters are further examined for distances among all cluster members of two clusters. If all distance among members are above threshold, they were merged. TM-scores against solved structures (structures used as native are following; p24:2JYL, vpr:1M8L, vpu:1VPU, E2:1A7G, E7:2B9D) were calculated for all decoys (1VPU has 9 models in PDB file so that all 9 models were used for the calculation) and Rosetta energy scores were extracted From Rosetta silent files. All calculations were executed on 2 of 2.66GHz dual quad-core Xeon equipped Mac Pro with Mac OSX 10.6 and all programs were compiled with Intel compiler v11.1 with full optimization flags.

Calculation of consensus cluster

Table 1 Sequence ID (%) Length

mean Max Min

P24 70 79.3 92 62

vpr 56 72.4 89 50

vpu 81 57.7 82 41

E2 84 40.1 90 22

E7 44-54 33.3 64 15

Top 10 largest clusters for each sequence are taken and cluster center decoys were assembled as top 10 decoy set for each protein. Top 10 decoy set was then subjected to the calculation of pairwise TM-scores and clustered again with TM-score threshold of 0.3. The largest cluster of top 10 decoy set was assigned as consensus cluster and its cluster center is assumed as consensus model. Even not the largest cluster, the cluster with highest sequence coverage was considered to be a consensus cluster in order to cover larger sequence space rather than decoy population.

Calculation of e-values between viral proteins and their fragment libraries Rosetta 9mer fragment libraries for 5 viral and 3 cellular proteins were parsed and assembled into plain fasta-formatted files by script. All viral sequences were also converted to overlapping 9mers fragments in fasta-formatted file. The viral 9mer fragments and Rosetta fragment library 9mers were then compared by ssearch version 3.5 and e-values were assembled for all 16 sequences. Depending sequence length, total number of fragments varies so that frequencies were calculated as relative values for total fragments. Results

Decoys of 5 viral proteins The decoys for viral proteins generated by Rosetta 2.2 did not show good correlation between Rosetta energy and TM-score. Figure 1 indicates Rosetta energy score vs. TM-score plots of decoys of 5 viral proteins. The right bottom panel is same plot for cellular proteins (acyl carrier protein/ACP, ubiquitin and thioredoxin). Not all but in many cases for these cellular proteins, Rosetta can generate the decoy set with modest to very good correlation between energy score and RMSD/TM-score (low energy decoys are also structurally similar to native protein). Thioredoxin has very good correlation between low energy and high TM-score (as high as 0.9, RMSD ~1 angstrom) and ubiquitin also performed well. Rosetta did not produced high TM-score decoys for ACP but did produce the decoys with good correlation between energy and TM-score. On the other hand, those 5 viral proteins selected for this study do not show such a trend. They show remarkably different Rosetta energy/TM

-score distributions ranging from completely uncorrelated (HPV E2) to negatively correlated (HIV vpu; shown only plot for model 1) or sequence dependent (HPV E7). Particularly, E7 shows great variation in decoy populations by sequences as expected with such a remote sequence identities even within same protein. On the contrary, HPV E2 shows very uniform population among the 16 sequences although the sequence identities are as low as 22% (mean 40%). All top10 clusters belong to single cluster with average pairwise distance 0.500 and 0.552 for average distance from cluster center. With this poor correlation between energy score and TM-score, choosing low energy decoy for prediction would not be good strategy for these 5 viral proteins. The overall distributions of TM-scores are skewed toward low score compared to non-viral proteins. Majority of decoys are TM-score below 0.5,

Figure 1: Energy vs. TM-score plot of decoys for 5 viral proteins. Five panels show 5 benchmark viral proteins and right bottom panel represents plot of cellular proteins, thioredoxin, ubiquitin and ACP (acyl carrier protein). For viral proteins, 16 most distant sequences are used for decoy generation and plotted in different color. Viral proteins show poor to reversed correlation between energy and TM-score such as completely uncorrelated HPV E2 protein, reversed correlation (low energy is structurally distant) of HIV vpu or sequence dependent (relatively small overlap among different sequences) of HPV E7. On the other hand, non-viral proteins show very good (thioredoxin) to descent (ACP) correlation between energy and TM-scores.

which is rough threshold for the same fold [20]. Thus it is difficult to select the decoy with TM-score beyond 0.5 (against native structure) for final model.

Clustering of decoys The clustering of decoys for these proteins are performed with hierarchical complete linkage clustering algorithm since each cluster should only contain structurally close enough (same fold) decoys in order for capturing consensus cluster later. Therefore, TM-score 0.5 was set as threshold for complete linkage clustering. The number of clusters varies to a great extent among the proteins and sequences. HPV E2 has significantly more clusters (table2) with threshold value 0.5 (mean 276.5) in comparison with HIV vpr (mean 12.5). Number of clusters of HPV E7 fluctuates greatly (31-432) by sequence in same way as energy/TM-score plots. Similar trend is also observed by coverage of Top 10 clusters (Sum of top 10 cluster members divided by all decoys). Threshold value 0.3 for the clustering Top 10 decoys were determined as the value giving enough cluster members for choosing “consensus” cluster but not to include too diverse structural ensemble. The threshold value of 0.5 (used for initial clustering) produced too many clusters with small number of cluster members among Top 10 decoy set and cannot reach “consensus” as each cluster has too incomplete coverage of sequences. In order to determine consensus cluster, TM-score threshold was lowered to yield enough cluster members populated by most of sequences, but pairwise distance among the Top 10 decoy set and average distance from each cluster center is not far below from 0.5. The threshold value of TM-score 0.3 appears to be low but clustering is performed with complete linkage clustering algorithm so that even the most distant decoys have TM-score around or above 0.3. For all 5 viral proteins, the distances are within the range of same fold as average pairwise TM-score is 0.49 (0.46-0.51) and 0.55 (0.51-0.57) for average distance from cluster center for all Top 10 decoy sets. Thus, it is concluded that these clusters are representative of decoys in same fold and clusters with highest sequence coverage were taken as consensus cluster and center as consensus model.

Consensus model strategy

Table 2 p24 vpr vpu E2 E7 Seq# Clust cover. Clust cover. Clust cover. Clust cover. Clust cover.

1 65 0.816 9 0.999 97 0.701 224 0.480 432 0.165 2 62 0.800 8 0.999 60 0.803 411 0.372 54 0.893 3 41 0.802 12 0.997 74 0.810 292 0.322 226 0.610 4 50 0.657 67 0.809 93 0.770 320 0.380 57 0.866 5 30 0.924 8 0.999 164 0.495 145 0.700 46 0.921 6 44 0.866 10 0.999 150 0.533 303 0.323 113 0.767 7 32 0.868 8 0.999 103 0.671 345 0.295 56 0.916 8 51 0.584 12 0.994 135 0.547 237 0.614 43 0.923 9 38 0.843 8 0.999 112 0.663 293 0.360 31 0.927

10 43 0.831 6 0.999 123 0.654 267 0.398 318 0.391 11 147 0.307 19 0.976 110 0.702 235 0.503 30 0.956 12 48 0.836 7 0.999 18 0.987 368 0.469 291 0.339 13 47 0.719 8 0.999 77 0.804 255 0.605 64 0.896 14 29 0.945 6 0.999 161 0.456 257 0.514 37 0.969 15 23 0.965 9 0.999 94 0.693 227 0.720 399 0.349 16 29 0.902 3 0.998 63 0.917 245 0.455 438 0.242

Ave 48.7 0.792 12.5 0.985 102.1 0.700 276.5 0.469 164.7 0.696 S.D. 28.8 0.164 14.9 0.047 39.4 0.147 64.3 0.133 157.9 0.294

Figure 2: Aligned predicted models (brown) and solved structures (white). The consensus models of 5 viral proteins are aligned to their solved structures by TMalign. In the figure, “Aligned” indicates the residues within 5 angstrom distances from solved structures. The solved HIV vpu structure is NMR solution structure with 9 models. Displayed in the figure is model 1 which was best aligned with consensus model. Other 4 structures are crystal structures.

The centers of consensus clusters for 5 viral proteins are classified as consensus models and examined in comparison with solved crystal/NMR structures. HIV p24 and vpr were most successfully predicted among 5 proteins (figure 2). Although these predictions are not spectacularly accurate (TM-scores 0.56 and 0.55 for p24 and vpr, respectively) but are good enough to represent essences of these folds. Since consensus model approach is not designed to predict an atomic accuracy but to capture correct overall fold, they are quite successful. Specially, vpr has 51 residues aligned out of 56 residues within 5 angstrom from native structure. P24 has mostly well aligned prediction but very C-terminal end is placed in wrong direction, thus lowering TM-score. HIV vpu are less successful in terms of TM-score (0.48) but overall fold/topology is well captured. The prediction for HPV E2 was not great, but also not too far either. On HPV E7, consensus approach did not work well. As figure 2 indicates, the α-helix connected to two consecutive β-sheets is positioned in wrong side. Although there is case such as HPV E7, if we compare these results with the performance of conventional strategies to select the best model, overall success of consensus strategy becomes clear. Except HPV E7 that failed to capture native fold, consensus model approach performed modestly to significantly better on other four viral proteins (Figure 3). HPV E2 has minimal improvement compared with largest cluster center or lowest energy decoy since all decoys generated for this protein did not cover large structure space. The sequence space of HPV E2 only covers very small, relatively similar and degenerated structure space (although many in cluster numbers) despite large sequence space, thus three methods examined here did not show any major differences. HPV E7 has too diverse sequence space (as low as 15% identity with average of 33%) and it appears structure spaces of some sequences are not overlapping. In order for consensus cluster strategy to work, these most distant sequences need to share some degree of structure space. HPV E7 appears too diverged in both sequence (even length of domain varies from 44 to 54 residues) and structure space. In fact, the matrix representation of pairwise TM-scores of HPV E7 indicates there are some pairs of sequences they do not share the predicted structure space (Figure S1) and the sequences are behaving as if they are different proteins. On the other hand, consensus model approach performs well with reasonably variable sequence space. It should be noted that even consensus strategy on HPV E7 did not underperform in comparison to other strategies (Figure 3), just there was no improvement. Figure 4 represents distributions of decoy TM-scores against native structures as histogram. Well behaving proteins such as p24, vpr or vpu have two major peaks in distributions with different degrees of populations. The consensus approach is capturing these populations with “native like” structures (in higher TM-score, “close” peaks) rather than structures in other “distant” peaks. This observation is reasonable and hoped for as “distant” peaks are likely to be composed of multiple populations with greatly varied folds due to diverse sequence space but “native like” structures have more stringent constraints to be close to native structure, and likely to be clustered together

Figure 3: The performance of different strategies for the best model selection from decoy set of 5 viral proteins. The data for HIV vpu are average of an ensemble of 9 NMR structures. The error bars indicate the range of 9 data points.

Figure 4: The histogram of TM-score distribution of all decoy sets for 5 viral proteins. Each protein has 16 decoy sets and distributions are calculated, plotted independently. The solid lines indicate the TM-scores of consensus models and broken lines are TM-scores 0.5.

within “close” peak. Figure 4 also indicates that even with HPV E2, consensus model strategy captures one of the clusters closest to native within generated decoys although all decoys are not close enough to native structure. In the case of HPV E7, it is even more clearer in this figure that some sequences are producing only decoys with too distant structures (some sequences have extremely small population beyond TM-score 0.4) so that there is no way for capturing cluster close to native structure. Thus, HPV E7 represents the limitation of this strategy that is finding the common structure among different sequences. Discussion

Predictive performance The consensus model approach works well with viral proteins although not all protein can be predicted with TM-score above 0.5. There are two major limitations, one is technical and another is in principle. As stated in the last paragraph of Result section, HPV E7 reveals the fundamental limitation of this approach. As sequence space diverges too far, current protocols cannot converge into same structure space. It is not the limitation per se, but if predicted structure space has too small or no overlap among the sequences in the sequence space, the strategy fails. It is also related to another technical limitation. Rosetta is one of the best de novo structure prediction tools. Still it cannot produce native like decoys for HIV vpu, HPV E2 proteins. It is unclear whether sampling space is too small, fragment libraries are biased toward non-viral proteins or other reasons. Limited sampling space should not be severe issue with this approach. The consensus cluster approach is heavily relying on the population ensemble rather than rare discovery of decoy with low energy and structurally close to native. If 10,000 decoys cannot capture structurally close population, it is unlikely to capture such a population even with orders of magnitude larger decoy set. The similarity calculation of overlapping 9mers generated from 16 sequences of 5 viral proteins against Rosetta fragment library for each protein indicates the frequency of low e-value fragments for viral proteins are significantly lower than that of cellular proteins and available fragments are distributed evenly from low to high e-values (Figure 5). It is likely that low number of solved structures

Figure 5: E-value distribution between Fragment library and target sequences. The 9mer sequences from Rosetta fragment library and target cellular and viral sequences are compared using ssearch v3.5. Viral sequences show significantly lower low e-value sequences and mostly uniform distribution in all range of e-value. Note that as sequences are short 9mers, e-value over 1 does not necessarily mean no correlations. With frequently observed residues, e-value over 5 can be 4~5 residue match. Viral proteins show smoother graph as they have 16x more sampling than cellular proteins.

of viral proteins compared with other protein classes results in small population of fragments with low e-value. Although it is not necessarily true that availability of low e-value fragments guarantees accurate predictions, opposite would be true. The distant fragment library sequences from actual protein are likely to have lower structural correlation to actual structures. These libraries had to be used for fragment assembly and likely to keep decoy structures away from solved structures. As viral proteins sequence space is generally unique compared to other protein classes, the low number of available structures for fragment library generation would be to be blamed for the poor performance in decoy generation of viral proteins. Put together, available sampling space is limited due to the lack of available solved structures corresponding to the viral sequence space. Another factor related to uniqueness of sequence can affect the secondary structure prediction used for Rosetta decoy generation. Secondary structure of HIV vpu was not accurately predicted by all three methods supported by Rosetta (PSIPRED, JUFO and SAM) and resultant model has extended C-terminus rather than actual α-helical structure. Considering uniqueness of viral sequence and structure spaces, it is not surprising that all three secondary structure prediction algorithms do not work for some viral proteins. If secondary structure has been accurately predicted, it is likely that the result for HIV vpu was much more accurate and well over TM-score 0.5. Further improving performance by combining “consensus” approach with sampling of low Rosetta energy decoys was attempted but unsuccessful. There are 2 cases (E2 and E7) which improvement was observed by filtering the decoys by energy score but other 3 cases (p24, vpr and vpu), filtering actually worsened the results (Figure S2). Rosetta generated the population of decoys for HPV proteins with lower energy that is less distant to native than overall ensemble. Unfortunately, opposite is true for HIV proteins and filtering by energy score actually worsened results for these proteins. It is puzzling and not known why Rosetta energy and TM-scores are uncorrelated for viral proteins. Nonetheless, due to the fact that viral proteins do not have lowest energy states with their native structure or Rosetta energy function does not represent native viral protein energy, an attempt to further improve results by taking low energy decoys failed. Although full-atom relaxation step was not performed, it is unlikely that the Rosetta energy scores dramatically change with full-atom relaxation. The largest cluster center strategy that is taken frequently also did not work as good as consensus strategy. The largest cluster does not necessarily represents the decoy population with smallest distance form native structure (7,6,5,1,2 cases out of 16 for p24, vpr, vpu, E2 and E7, respectively). It is, therefore, difficult to determine the correct fold by selecting center of largest cluster too. And again, there is problem of choosing from many different sequence variants. On the other hand, consensus strategy took advantage of sequence diversity (ranging from 33 to 79% in mean sequence identities) and successfully captured the cluster with most structurally close to native fold even the case of overall decoy population is not very close to native (figure 4). An initial assumption that short viral sequence fragments should have enough overlap with cellular protein sequence space appears not standing and it is probable that even 9mer fragments are unique for viral sequences compared with cellular sequences, as indicated by significantly lower number of low e-value and abundant high e-value fragments are observed in fragment library (Figure 5).

Usefulness This approach is developed mainly targeting the structure prediction of small viral proteins for antigen/immunogen design. This strategy captures same or close fold (around or above TM-score 0.45) for 4 out of 5 proteins. The purpose wise, this result is a success in the use of the predicted model for the evaluation of properties of designed antigen/immunogen such as mapping of epitope and immunodominant region. For these purposes, atomic level of accuracy is desirable but not necessary. In fact, this approach is used for predicting the structure of small protein based immunogen for HIV-1 vaccine and generating quite useful information to interpret immunological and biochemical data obtained by antigenic and immunogenic studies. Many biochemical data can be interpreted with low-resolution structure model (correct fold) instead of high-resolution atomic models for the application such as function or protein family inference [26]. Thus, this approach can produce

the model that is useful for many area of biology although in principle it cannot produce model in atomic accuracy. The limitation of approach is clearly illustrated by HPV E7 protein. But this is rare case, such as sequence identities among HPV E7 C-terminal domain are as low as 15%. At this level of sequence diversity, consensus approach breaks down. It is also possible that difference in the sequence length (44 to 54 residues) may have impacted the results. The protein with such sequence diversity is not so common, and E7 protein has tandem CXXC motif as metal coordination site that is not included as constraint but limits possible architecture/topologies significantly. Also, its C-terminal domain may serve only as scaffold (for dimer formation) for the unstructured N-terminal domain with LXCXE motif for pRb binding that is necessary and sufficient for functionality. Thus it is possible that the domain’s structure may not be strictly conserved as it may not be essential for its function and Rosetta’s inability to generate decoys in same fold among different sequences may reflect actual structural variability. This strategy can utilize phylogenetic information (as sequence diversity) among related species instead of genetic variability among same specie. The proteins with known orthologs in diverse species (with choice of sequence identity threshold depending on sequence diversity) can be used in identical manner to capture the fold of their protein family. How further it can be applied to such an approach needs to be tested as the case of HPV E7 revealed the limitation of this strategy for too diverse sequence space (and potentially dissimilar/diverse structures between orthologs).

Processing time The generation of decoys was not most time consuming step in this study. It took less than day for generation of 10,000 decoys with single thread of Rosetta run since computationally expensive full-atom relaxation step was omitted. As workstation equipped with 8-cores with hyperthreding capability, decoy set for each protein can be generated in a day (16 threads of Rosetta run simultaneously). The most time consuming step was structural alignment using TMalign. For the speed up, TMalign’s fortran source code has been ported to C and optimized for the speed. In-house code has been written using ported TMalign in the loop, specifically for the pairwise calculation in interpreter such as shell script is very slow and loop execution takes as much as score calculations. Use of Intel compiler with the best optimization switches improved the speed but TMalign is computationally expensive algorithm and hence it is slow. It is chosen despite its speed as it gives better measure for judging same fold. Since pairwise score calculation is the order of n2 for decoy number, 10,000 decoys require 50 million calculation of TM-scores. For single sequence, it took about a week with single thread of calculation. Thus, each protein can be finished in about a week with 16 sequences. Clustering was quite quick (~30min) compared to previous two steps although high memory requirement (~5GB for single calculation) limits the number of simultaneous calculations on same machine. Overall, single protein took about week of calculation time on single Mac Pro with 2.66 GHz dual quad-core Xeon X5500. With modern middle-sized cluster, this calculation should not take more than a day or two. This approach is “redundant” in terms of generation of decoys per proteins but current computational power of workstations already allows just the days of calculation for this approach on single workstation and it can be routinely utilized without investing for the expensive resources.

Figure S1: Similarity maps of sequences used for the study The sequences shown are maximum distance pairs by multiple alignments and used for consensus selections. In the maps, Red > Green > Blue indicate the similarity of sequence pairs and their cluster centers of decoys. Some sequences shares very low similarity with the other sequences although they are same protein from same specie (but different strains). In these cases, selection of consensus is not easy as clusters do not share same folds within their sequence spaces (cluster centers do not overlap among populous clusters)

Figure S2 : The effect of low energy score filtering The idea to use low energy score filter does not work very well as shown in the figure. It turned out that some cases, it works and in some cases it actually worsens the prediction results. In the cases of HPV proteins (E2/E7), high-energy decoy structures actually occupy more ‘native-like’ structures, thus, filtering these population worsens the results. This is the problem of decoy generation step and consensus approach cannot solve this issue as it relies on the Rosetta to generate decoys. Mainly, it appears to be the poor sampling of structures from viral proteins (HPV sequences are very unique and unlikely to have good representation of sequence/structure mapping in the library. At the same time, it is further concern that sequence/structure correlation is “biophysical features” or “evolutionary features” as scarce sequence space does not necessarily means always low match to library. In this study, HPV and HIV proteins are used as test cases. But for diverse sequence space with little overlap of sequence spaces (even within same protein in same specie) makes situation little complicated. The matching of sequence/structure can be potentially somewhat “evolutionary memory” and not completely “biophysically” determined, more likely the fragments with independent sequence spaces share the same structure space in some cases (HIV appears to share but not HPV).

References 1. Ho J, Uger RA, Zwick MB, Luscher MA, Barber BH, et al. (2005) Conformational constraints imposed on a pan-‐neutralizing

HIV-‐1 antibody epitope result in increased antigenicity but not neutralizing response. Vaccine 23: 1559-‐1573. 2. Ofek G, Guenaga FJ, Schief WR, Skinner J, Baker D, et al. (2010) Elicitation of structure-‐specific antibodies by epitope

scaffolds. Proc Natl Acad Sci U S A 107: 17880-‐17887. 3. Penn-‐Nicholson A, Han DP, Kim SJ, Park H, Ansari R, et al. (2008) Assessment of antibody responses against gp41 in HIV-‐1-‐

infected patients using soluble gp41 fusion proteins and peptides derived from M group consensus envelope. Virology 372: 442-‐456.

4. Brussow H (2009) The not so universal tree of life or the place of viruses in the living world. Philos Trans R Soc Lond B Biol Sci 364: 2263-‐2274.

5. Edwards RA, Rohwer F (2005) Viral metagenomics. Nat Rev Microbiol 3: 504-‐510. 6. Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, et al. (2006) The marine viromes of four oceanic regions. PLoS Biol 4:

e368. 7. McBurney SP, Ross TM (2008) Viral sequence diversity: challenges for AIDS vaccine designs. Expert Rev Vaccines 7: 1405-‐

1417. 8. Palmenberg AC, Rathe JA, Liggett SB (2010) Analysis of the complete genome sequences of human rhinovirus. J Allergy Clin

Immunol 125: 1190-‐1199; quiz 1200-‐1191. 9. Chung SY, Subbiah S (1996) A structural explanation for the twilight zone of protein sequence homology. Structure 4: 1123-‐

1127. 10. Bamford DH (2003) Do viruses form lineages across different domains of life? Res Microbiol 154: 231-‐236. 11. Bystroff C, Baker D (1998) Prediction of local structure in proteins using a library of sequence-‐structure motifs. J Mol Biol

281: 565-‐577. 12. Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234: 779-‐815. 13. Ozkan SB, Wu GA, Chodera JD, Dill KA (2007) Protein folding by zipping and assembly. Proc Natl Acad Sci U S A 104:

11987-‐11992. 14. Das R, Baker D (2008) Macromolecular Modeling with Rosetta. Annu Rev Biochem. 15. Roy A, Kucukural A, Zhang Y (2010) I-‐TASSER: a unified platform for automated protein structure and function prediction.

Nat Protoc 5: 725-‐738. 16. Bonneau R, Strauss CE, Baker D (2001) Improving the performance of Rosetta using multiple sequence alignment

information and global measures of hydrophobic core formation. Proteins 43: 1-‐11. 17. DeBartolo J, Hocky G, Wilde M, Xu J, Freed KF, et al. (2010) Protein structure prediction enhanced with evolutionary

diversity: SPEED. Protein Sci 19: 520-‐534. 18. Zhang Y, Skolnick J (2004) Scoring function for automated assessment of protein structure template quality. Proteins 57:

702-‐710. 19. Zhang Y, Skolnick J (2005) TM-‐align: a protein structure alignment algorithm based on the TM-‐score. Nucleic Acids Res 33:

2302-‐2309. 20. Xu J, Zhang Y (2010) How significant is a protein structure similarity with TM-‐score = 0.5? Bioinformatics 26: 889-‐895. 21. Benson DA, Karsch-‐Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2008) GenBank. Nucleic Acids Res 36: D25-‐30. 22. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, et al. (2007) Clustal W and Clustal X version 2.0.

Bioinformatics 23: 2947-‐2948. 23. Notredame C, Higgins DG, Heringa J (2000) T-‐Coffee: A novel method for fast and accurate multiple sequence alignment. J

Mol Biol 302: 205-‐217. 24. de Hoon MJ, Imoto S, Nolan J, Miyano S (2004) Open source clustering software. Bioinformatics 20: 1453-‐1454. 25. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-‐wide expression patterns.

Proc Natl Acad Sci U S A 95: 14863-‐14868. 26. Zhang Y (2009) Protein structure prediction: when is it useful? Curr Opin Struct Biol 19: 145-‐155.

Documents

Viral Protein Structure Predictions - Consensus Strategy