Bridging the Information Gap: Computational Tools for ......NMR, electron cryomicroscopy and modeling, the current structural database, the Protein Data Bank (PDB),1 which has greater

doi:10.1006/jmbi.2001.4633 available online at http://www.idealibrary.com on J. Mol. Biol. (2001) 308, 1033±1044

Bridging the Information Gap: Computational Toolsfor Intermediate Resolution Structure Interpretation

Wen Jiang1{, Matthew L. Baker1{, Steven J. Ludtke2 and Wah Chiu1,2*

1Program in Structural andComputational Biology andMolecular Biophysics2Verna and Marrs McLeanDepartment of Biochemistryand Molecular Biology, BaylorCollege of Medicine, HoustonTX 77030, USA

{Contributed equally to this workAbbreviations used: PDB, ProteinE-mail address of the correspond

[email protected]

0022-2836/01/051033±12 $35.00/0

Due to large sizes and complex nature, few large macromolecular com-plexes have been solved to atomic resolution. This has lead to an under-representation of these structures, which are composed of novel and/orhomologous folds, in the library of known structures and folds. While itis often dif®cult to achieve a high-resolution model for these structures,X-ray crystallography and electron cryomicroscopy are capable of deter-mining structures of large assemblies at low to intermediate resolutions.To aid in the interpretation and analysis of such structures, we havedeveloped two programs: helixhunter and foldhunter. Helixhunter iscapable of reliably identifying helix position, orientation and length usinga ®ve-dimensional cross-correlation search of a three-dimensional densitymap followed by feature extraction. Helixhunter's results can in turn beused to probe a library of secondary structure elements derived from thestructures in the Protein Data Bank (PDB). From this analysis, it is thenpossible to identify potential homologous folds or suggest novel foldsbased on the arrangement of alpha helix elements, resulting in a struc-ture-based recognition of folds containing alpha helices. Foldhunter uses asix-dimensional cross-correlation search allowing a probe structure to be®tted within a region or component of a target structure. The structural®tting therefore provides a quantitative means to further examine thearchitecture and organization of large, complex assemblies. These twomethods have been successfully tested with simulated structures modeledfrom the PDB at resolutions between 6 and 12 AÊ . With the integration ofhelixhunter and foldhunter into sequence and structural informatics tech-niques, we have the potential to deduce or con®rm known or novel foldsin domains or components within large complexes.

# 2001 Academic Press

Keywords: macromolecular assemblies; structure; fold recognition; electroncryomicroscopy; bioinformatics
*Corresponding author
Introduction

Through the efforts of X-ray crystallography,NMR, electron cryomicroscopy and modeling, thecurrent structural database, the Protein Data Bank(PDB),1 which has greater than 14,000 structures,represents only a portion of all existing folds.2,3

Efforts in structural genomics have focused onestablishing a highly representative library ofunique folds, encompassing the vast majority ofknown atomic resolution models. One of theunderrepresented areas in the current structurallibrary is folds derived from large macromolecular

.Data Bank.

ing author:

assemblies. Thus, folds obtained from these assem-blies may provide additional insight to knownfolds, as well as a source for novel structuralarrangement.

Often, the initial structural analysis of a largecomplex is limited to intermediate resolution.4 ± 10

In the absence of atomic models, structuralinterpretation of large complexes at intermediateresolution is a formidable task, although structuralinformation is still present. However, if individualdomains, components and structural elements canbe identi®ed, pieces of the ``jigsaw puzzle'' may beput together to yield a tentative atomic model ofthe entire complex. With this in mind, the task athand becomes the mining of structural and func-tional information of these complexes.

We have developed two computational methodswhich will facilitate the quantitative identi®cation

# 2001 Academic Press

1034 Helixhunter and Foldhunter

of structural features in terms of known folds inthree-dimensional (3-D) density maps at low tointermediate resolutions (Figure 1). This provides ameans to identify structural homologs of com-ponents in large macromolecular complexescontaining alpha helices. The ®rst method,implemented in helixhunter, is used to analyze a3-D map for alpha helix content at intermediateresolutions, 6-10 AÊ , the resolution at which suchsecondary structure elements become discernable.Correlation of secondary structure elements usingalpha helix location and orientation, with knownprotein structures from the PDB, can then be usedto identify similar folds. The second method,implemented in foldhunter, can be used to localizeknown or predicted folds/domains within largermacromolecular assemblies at lower resolutions(<20 AÊ ). This could in turn lead to high-resolutioninterpretation of pieces within the macromolecularpuzzle.

The development of helixhunter and foldhunter isjust the ®rst step in the multi-level structuralinterpretation process. Localization and identi®-cation of structural features provides the necessarytemplate for the integration of these tools withsequence, structure and computation based anal-ysis. Further development and integration couldallow for the generation of structural and/or func-tional models based on low to intermediate resol-ution structures of macromolecules. The end resultwould represent a much more thorough under-standing of macromolecular complexes.

Figure 1. Helixhunter and foldhunter algorithm for iden-tifying structural elements in intermediate resolutionstructure. The general ¯ow of helixhunter and foldhunteralgorithms are shown. The left side represents the basichelixhunter search, while the right-hand side representsthe cross-correlation search of foldhunter. The centralarrow indicates that the cross-correlation method canserve as a pre-processing step in helixhunter. The probestructure in the right side of the diagram can representa prototypical helix in the case of helixhunter or a givensub-structure in foldhunter. Also, in the case of helixhun-ter, a ®ve-dimensional search is done rather than a six-dimensional search in foldhunter.

Results

As seen in the PDB, protein structures vary con-siderably in alpha helical and beta sheet content, aswell as adopting many types of folds and aggrega-tion states. There have been many efforts to classifythese structures, biochemically and structurally,into related groups. One of the most widelyaccepted classi®cations is SCOP, a StructuralClassi®cation of Proteins.11 The majority of knownstructures can be broken-down into four majorSCOP families: all alpha (a), all beta (b), alpha andbeta (a/b), and alpha plus beta (a � b). To compre-hensively sample these classes, we have examinedover two dozen proteins with varying structuralcon®gurations. Here, we present four typical pro-teins from our larger test data set: bacteriorhodop-sin (1C3W, a),12 triose phosphate isomerase (1TIM,a/b),13 tyrosine kinase domain from the insulinreceptor (1IRK, a � b),14 and Bluetongue Virusouter shell coat protein (1BVP, a b upper domainand an a lower domain).15,16 These four proteinsrepresent the major SCOP classes and are used tovalidate the results of helixhunter and foldhunter.

The helixhunter program, which identi®es puta-tive alpha helices, incorporates a multi-step processincluding cross-correlation, density segmentation,segment quanti®cation, helix identi®cation, andexplicit description of the identi®ed helices. The®nal helices are represented as cylinders, eachspeci®ed by six parameters (three for center, twofor orientation, and one for length). This abstracthelix representation makes visualization of heliceseasy, as well as allowing for subsequent spatialfold recognition.

In many instances, the structures of fragments,domains or subunits of a larger complex areknown by X-ray crystallography. The most popu-lar method for ®tting these parts into the complexstructure is still manual ®tting using visualizationtools,17 for which accuracy of the ®tting issubjective. Here we provide a template basedcross-correlation tool called foldhunter to ®t theknown structures objectively employing all sixpossible degrees of freedoms (three rotations andthree translations).

Testing helixhunter

Each of the aforementioned proteins was mod-eled at 8 AÊ resolution, representing intermediateresolution structure determination by electroncryomicroscopy or X-ray crystallography (Figure 2).In all four cases, helixhunter correctly localizes bet-ter than 88 % of the helices. Typically, helicalassignment was within ®ve degrees of the trueorientation and within one turn of the correctlength. This level of accuracy was largely unaf-fected by composition or size of the test model,which ranges from 24 to 116 kDa. Additionally,the misidenti®ed helices were kept to a minimum.Out of the 54 total helices in the four proteins, onlytwo non-helical segments were falsely identi®ed as

Figure 2. Testing helixhunter on model data. The four model structures: 1C3W (a), 1TIM (b), 1IRK (c) and 1BVP (d)are displayed as ribbons for their atomic structure in column 1, and as surface renderings of the simulated 8 AÊ struc-tures, contoured at approximately the molecular mass, in column 2 (purple). Column 3 shows the found helices fromhelixhunter as cylinders (grey). Column 4 shows the superimposition of the found helices (cylinders) onto the ribbondiagram of the known structures. The red triangles show helices not identi®ed by helixhunter, while the yellow tri-angles represent false positives. The number of correctly identi®ed helices versus the total number of helices is listedjust after column 4. In the simplest case, bacteriorhodopsin (1C3W), a seven transmembrane helix protein, helixhunterwas able to identify the location and orientation, within ®ve degrees, of all the helices. Additionally, the length of thehelixhunter assigned helices was in agreement to within one turn of the known helices. In the triose phosphate isomer-ase (1TIM) model, an alpha/beta protein, a very high degree of agreement was seen between the real and predictedhelices using helixhunter. All but one short helix, smaller in length than the prototypical helix in the correlation rou-tine and immediately adjacent to a long helix, was not identi®ed. The insulin receptor tyrosine kinase structure(1IRK), an a � b protein, also showed a good correlation with the helixhunter helices and actual helices, missing onlya short 1.5 turn helix. In the ®nal test model, a trimer of BTV VP7 containing an all-helical lower domain and an allbeta sheet upper domain, 24 of the 27 helices were correctly identi®ed, agreeing in location and length. A total ofthree short helices were missed and a single turn in the beta sheet region was misidenti®ed.

Helixhunter and Foldhunter 1035


a helix, while four helical segments were not ident-i®ed. Typically, these missed or incorrectlyassigned segments were either short helices orturns.

Structure identification

Due to the fact that helixhunter ultimately pro-vides the user with spatial identi®cation of helices,we have an explicit representation of all helicalparameters. Thus, the information encoded by theparameters as well as the relative position andorientation of the helices provides an extremelycoherent descriptor for protein structural features.Using this information, we used the structuralmatching programs, DejaVu18 and COSEC19,20 toidentify homologous structures based on spatialarrangement of secondary structural elements of aprotein using a library based on protein structuresin the PDB.

The assigned helices from the helixhunter resultson the 8 AÊ resolution models were compared to amodi®ed DejaVu library of secondary structureelements generated from the structures in the PDB.Since the Ca positions of the polypeptide are notdetermined in an intermediate resolution structure,we modi®ed the operational de®nition of alphahelices from the original library of secondary struc-ture elements in DejaVu, minimizing the potentialcenter and orientation mismatch as described laterin Materials and Methods. Table 1 shows theassignment of the top three putative homologousstructures from DejaVu to each of the four helixhun-ter results for the model structures. In all cases, thestructural assignment from DejaVu represents thesame structure (same PDB identi®er) or related iso-forms (same protein, different PDB identi®er) to

Table 1. DejaVu output using helixhunter results

ModelIdentifiedstructure

1C3W 1BM1 BacteriorBacteriorhodopsin 1AT9

1BRX Bacte

1TIM 8TIM TrTriose Phosphate 1TCD TriosephospIsomerase 7TIM Triosep

1BVP 1BVPBluetongue Virus 1FIY PhospVP7 Monomer 1AXG A

1IRK 1IRK Insulin rInsulin Receptor 2PHKTyrosine Kinase 1PTK ChDomain 1AQ1 Huma

1PHK1FGI Tyrosine kina1AD5 Sr1CMK Camp-depend

Listed are the results of the DejaVu program with the modi®ed hthe four structures modeled at 8 AÊ resolution (®rst column). The sestructure found by the DejaVu search, respectively. The results were

the original proteins tested in helixhunter. The cen-ter-to-center distance between the correspondingalpha helices of the probe and the putative struc-tural homolog is reported as root mean squared(RMS) deviation. The scores of the top DejaVumatches for the helixhunter results are similar tothose of the scores from matching the secondarystructure elements directly de®ned by the PDB ®le(data not shown). Of further interest, illustrated in1IRK, the top identi®ed structures found by DejaVucorresponded not only to itself but also with a var-iety of other serine/threonine kinases, which allshare a common fold. Similar results were alsoobtained using the COSEC program and helixhun-ter results (data not shown), verifying the accuracyof the helical assignment in helixhunter.

Testing foldhunter

Each of the four model structures and theirrelated sub-structures were modeled at 8 AÊ . Thesesub-structures were centered in a cube of the samesize as the complete model data, rotated and trans-lated from the original position, and then used asprobes against the intact model structure usingfoldhunter. In each of the four models, the corre-sponding sub-structure was correctly localized andoriented with respect to the intact structure withintwo degrees. As with helixhunter, composition andsize of the target and the probe sub-structure didnot affect the accuracy of localization signi®cantly.

1C3W was probed with a segment containingthree alpha helices and two small strands. Thissub-structure was placed back into the intact den-sity at the correct position and orientation. Thesub-structures for 1TIM, two helices and twostrands, and 1IRK, the beta sheet portion, were

Name RMSD (score)

hodopsin in light adapted state 0.87 (0.73)Bacteriorhodopsin 1.02 (0.87)

riorhodopsin/lipid complex 1.66 (1.35)

iosephosphate isomerase 0.91 (0.81)hate isomerase-Trypanosoma cruzi 1.24 (1.10)hosphate isomerase complex 1.59 (1.43)

Bluetongue virus vp7 1.61 (1.37)hoenolpyruvate carboxylase 3.27 (2.95)lcohol dehydrogenase 3.59 (3.18)

eceptor tyrosine kinase domain 0.95 (0.86)Phosphorylase kinase 1.41 (1.21)icken src tyrosine kinase 1.45 (1.27)n cyclin dependant kinase 2 1.63 (1.40)Phosphorylase kinase 1.64 (1.46)se domain (fibroblast growth factor) 1.71 (1.55)c family kinase complex 1.75 (1.56)ent protein kinase catalytic subunit 1.82 (1.58)

elix library and the helices de®ned by helixhunter (Figure 2) forcond and third columns denote the PDB identi®er and name ofordered by RMSD (in AÊ ), which is found in the last column.


also ®tted to the correct location within theirrespective intact structures. The beta sheet regionof the BTV VP7 trimer was also used to probe theintact structure. As with the other three structures,the correct location of the sub-structure was ident-i®ed (Figure 3).

Integrating fold recognition with foldhunter

In the case of large macromolecules, often onlysequence-derived homologous sub-structures areknown. To better represent this type of situation infoldhunter, we tested one of the four model datasets, 1BVP, against several related proteins(Figure 4). A number of putatively homologousfolds with varying degrees of homology to the b

sheet domain of the 1BVP protein were tested todetermine whether foldhunter was capable of cor-rectly localizing these homologs. The VP7 mono-mer contains both a b sheet region in the upperdomain and a helical region in the lower domain.Using the UCLA-DOE Fold-Recognition Server,21

several candidate folds were identi®ed as potentialstructural homologs to the b sheet domain ofthe 1BVP. The best matches were to itself and theAfrican Horse Sickness Virus capsid protein,22

while tumor necrosis factor alpha23 had a border-line score for structure homology.24 Figure 4(c)illustrates foldhunter`s ability to correctly place thecapsid protein from African Horse Sickness Virusinto the domain of the 8 AÊ map of the 1BVP. How-ever, the lesser-valued candidate structure, tumor

Figure 3. Testing foldhunter onmodel data. The four model struc-tures: 1C3W (a), 1TIM (b), 1IRK(c) and 1BVP (d) are displayed asribbons in column 1. Portions fromthese structures, 1C3W (5-99), 1TIM(1-64), 1IRK (981-1078) and 1BVPtrimer (117-261), were extracted(column 1, red) and modeled at8 AÊ resolution (column 2). Resultsof foldhunter using these extractedportions to map on the entire struc-ture are shown in column 3.

Figure 4. Sensitivity of foldhunter with homologousstructures. The sensitivity to which foldhunter can detecta homologs fold is shown using 1BVP (a) against thebeta barrel segment of 1BVP (b), a close homolog(c) (1AHS), and a remote homolog (d) (2TNF) as rep-resented by the ribbon diagrams in column 1, respect-ively. Results of foldhunter are seen in column 2.


necrosis factor alpha (Figure 4(d)), was misplacedin the target structure as seen by visual inspection.Though foldhunter can always yield an answer, thematch has to be judged by both the correlationcoef®cient and also visual matching of the twostructures. This result suggests that a correctly

chosen fold can be placed in the correct region ofthe target structure by foldhunter.

Resolution dependency of helixhunterand foldhunter

The resolution dependency of helixhunter wasassayed using 1C3W, modeled at 6, 8, 10, and 12 AÊ

resolution (data shown at http://ncmi.bcm.tmc.edu/� wjiang/hhfh). Helixhunter was able to veryaccurately identify the helices in both the 6 AÊ and8 AÊ models. In the 10 AÊ model of 1C3W, helix pos-ition and orientation retains high ®delity, but helixlength is not as accurate. At 12 AÊ , similar helixpositioning to that of the 10 AÊ model can be seen,however false helices are assigned to the small betastrand region and a turn. Along with the intactstructure, the sub-structures for 1C3W were alsomodeled at the four aforementioned resolutions totest foldhunter. Assignment of the correct locationand orientation of the sub-structure occurs at allresolutions. Additional testing of foldhunter wasdone using a lower resolution model (20 AÊ ) of bac-teriorhodopsin. While the sub-structure was loca-lized correctly, a smaller angular search step wasrequired.

Discussion

The trend in structural biology is to study largerand larger macromolecular assemblies in differentfunctional states. These assemblies generally con-tain multiple subunits, with a variety of structuralelements. The divide and conquer approach hasbeen used to determine the atomic structure of thecomponents or domains, as well as the molecularblueprints for the entire assembly at a lowerresolution.25 While this approach works for somesystems, it may not universally apply. In somecases, it may be dif®cult to obtain an atomic modelof the components for various biochemical reasons,as would be the case if a component was in a mol-ten globule state while expressed in the absence ofother components.26 Electron cryomicroscopy andX-ray crystallography have offered opportunitiesto study large assemblies as a whole at low tointermediate resolutions. As we approach higherresolution, information regarding the structuralelements becomes discernable. These elements,from secondary structure to entire domains, mustbe assembled in the context of the molecular mapin order to obtain an atomic model. In this study,we have developed a set of computational toolswith which we can mine the data to infer a proteinfold or domain from intermediate resolution struc-tures of a large complex. Such analysis could leadto the structural and functional characterization ofnot only single domains of a large complex, butalso the complex itself.


Alpha helix identification

The most de®nable structural elements in inter-mediate resolution structures are alpha helices, asthey typically have regular, identi®able features.This has allowed for the interpretation of helicesbased on simple visual inspection and assignmentwithin the map.4,5,10 However, experience, bias andinterpretation lead to a fairly subjective, non-quan-titative and possibly con¯icting assignment ofhelices. We have thus implemented helixhunter andtested it on a variety of different structural models,varying in resolution and structural con®guration(Figure 2).

From the results of the model data, helixhunterhas demonstrated the ability to accurately identifyhelix position, orientation and length at intermedi-ate resolutions. As the resolution of the model dataapproaches atomic resolution, the reliability ofhelix assignment increases. At the lower resol-utions, short false helical assignments werenoticed, while only well de®ned, long helices werecorrectly identi®ed in length and location. This res-olution dependency is not unexpected. Measure-ments on the closely packed helices inbacteriorhodopsin reveal minimal helix-helix dis-tance between 10 to 12 AÊ . This is in agreementwith our helical assignments in the resolutiondependency test as helix assignment began todeteriorate in accuracy on the 12 AÊ 1C3W model(http://ncmi.bcm.tmc.edu/�wjiang/hhfh).

The ability to identify the alpha helices is sensi-tive to the thresholding of the density map. Weapply a segmentation procedure using a neighbor-grouping algorithm, which relies on the ability toseparate features into distinct objects. Differencesin density between helices and adjacent features inthe density map can be subtle. To decrease thethreshold sensitivity of the segmentation process,we have adopted a pre-process ®ltering step usingcross-correlation. By correlating the density mapwith a prototypical helix, we up-weight the helicaldensity and thus suppress the non-helical density.Our segmentation process is thus carried out withthe normalized correlation coef®cient map ratherthan the original density map. The improvementresults in considerably less threshold dependency,for which successful segmentation of helical den-sities can be done.

At intermediate resolutions, helices most closelyresemble cylinders with a relatively ®xed radius of�2.5 AÊ and variable length. Thus, by using thebasic character of a cylinder, we can evaluate thecandidate segments for potential helical assign-ment. However, non-equivalency of side-chainsand bends along the helical axis cause deviationfrom the idealized cylinder model. Therefore, across-section about the helical axis could bedescribed as approximately as an ellipse, with asemi-major and a semi-minor axis (see Materialsand Methods). We can thus use the relative valuesof these two axes as criterion during helix identi®-cation, as signi®cant deviation would indicate a

non-cylindrical, and therefore non-helical density.As the backbone of a helix has a fairly constantdiameter at 5 AÊ , the two semi-axis values shouldbe close to 2.5 AÊ . Additionally, helix length canalso be used as a criterion in helix identi®cation byspecifying minimum and/or maximum values forhelix length.

Fold recognition based on alpha helices

In the classi®cation of proteins, families can bede®ned by the spatial arrangement of secondarystructure elements, which in turn determine struc-turally homologous proteins. Thus, in addition tovisual interpretation of structural elements, helix-hunter allows for structure-based fold recognitionwhen used in conjunction with DejaVu andCOSEC. The spatial relationship of helixhunteridenti®ed helices can be compared to a large data-base of known secondary structure elements frompublished PDB ®les. This assumes that the ident-i®ed alpha helices accurately represent nearly com-plete architectural information capable ofdistinguishing different folds. As an example, truehomologous structures for 1IRK were found onlywhen four or more of the longest helices wereused. Four helices, half of the helices in 1IRK, thusrepresent enough unique information regardingthe fold. However, the minimum structural infor-mation suf®cient for fold prediction is likely tovary amongst different protein folds. Thus, by pro-viding a reasonable set of helixhunter generatedhelices as input, we are able to screen all knownprotein structures for similar occurrences of thesehelical elements. This serves two purposes: (1) veri-®cation of our approach, and (2) possibility ofintermediate resolution structure-based recognitionof alpha helical folds. While the ®rst of these twoobjectives is primarily used in testing the modeldata, it is the second that is of primary interestwhen examining real data (M.L.B. et al., unpub-lished). Identi®cation of homologous folds ordomains of proteins may then be done in theabsence of any sequence-derived information.While in some cases a homologous fold may beidenti®ed, the absence of a structural relative maysuggest the presence of a novel fold.

In Table 1, which summarizes the DejaVu resultsfor each of the four test models, we see that theRMS deviation and score are not zero for itself,which would indicate a less than perfect match.While the segmentation step, as mentioned earlier,has been corrected for some of the thresholdingdependency, helix length and orientation may stillbe affected by the threshold. The ends of thehelices are less well de®ned than other regions ofthe helices, making them more sensitive tothreshold effects.

Based on these test data, it appears that theintermediate resolution structure-based fold recog-nition is feasible and reliable. However, while deal-ing with the experimental data, the reliability ofthis method is limited by the accuracy of the map.


Inaccuracies in helix identi®cation and thus foldrecognition are considerably higher in a poorlyde®ned map. Furthermore, while two structuresmay appear visually similar, DejaVu and COSECmay not identify the structures as being strictlyhomologous. The quantitative identi®cation forthese two programs relies on a strict matching ofcorresponding helical elements. Local differencesin the alpha helices may thus cause insuf®cientspatial overlap of the secondary structuralelements. Therefore, remote homologs may not bedetectable by DejaVu and/or COSEC.

Fold identification and localization

In the ideal case, results obtained from helixhun-ter and DejaVu/COSEC will produce a correctlyidenti®ed homologous structure that can bemapped directly to the low or intermediate resol-ution structure. As this fold identi®cation is basedsolely on alpha helices, this method can work witha variety of folds containing a signi®cant amountof helices, such as TIM barrels (1TIM) and proteinkinase-like (1IRK). However, recognition might notalways be possible, as would be the case for beta-sheet rich proteins. Thus, using structural infor-mation about domains or sub-structures, obtainedfrom either sequence-based fold identi®cation orstructural homology searches, the task becomesassigning the correct position and orientation to amore general sub-structure within the largerstructure.24 The visual assignment of sub-structureposition and orientation to a larger complex issomewhat subjective. Current methods of aligningstructures, such as Situs,27 require that the relativelocation of the probe structure is known within thetarget structure. In this regard, the correspondingtarget volume needs to be excised from the largetarget structure. Thus, this type of method relieson prior knowledge of the corresponding regionbetween the probe and target structures. Foldhunterdoes not have this limitation, as a relatively smallportion or sub-structure will be localized in amuch larger complex using the cross-correlation.As demonstrated by the ®tting of homologousstructures, identi®ed by fold recognition, foldhunterhas shown the ability to correctly localize relativelysimilar structures. Thus, foldhunter serves not onlyas a method to localize known folds or domains,but also con®rms if a predicted homologous fold isindeed truly homologous to the intermediate resol-ution structure.

Foldhunter, which performed well at intermediateresolutions, also has the ability to work on muchlower resolution structures. Since it docks asub-structure into a larger structure based on thedistribution of density, the major requirements ofthe target and probe structures are their relativestructural features. Probe and target structureswith relatively well-de®ned features at low resol-utions provide suitable ``landmarks'' for foldhunter,making it possible to localize sub-structures at20 AÊ or lower resolution.28

Building structural and functional models

In intermediate resolution structures, an explicitbackbone trace of the protein is not available,resulting in the inability to correlate the amino acidsequence and key functional residues in the contextof the structure. However, with the informationobtained from the DejaVu and COSEC searches inconjunction with additional computational analysis(homology modeling, secondary structure predic-tion, sequence alignments, etc.), one may conceiva-bly produce directionality for each helix andconnections between the helical elements. Throughfurther developments and integration, the resultinginformation may eventually allow for the assign-ment of sequence to the structural elements, yield-ing a rough backbone trace.

With a backbone model and the correlation ofbiochemically and/or genetically relevant residuesto the structural features, it may be possible tosuggest a putative function of the molecule, pro-viding additional guidance for future biochemicalexperiments. As an example, the helixhunter resultsfor 1IRK, the tyrosine kinase domain from theinsulin receptor, indicate the presence of eighthelices. When submitted to DejaVu, all of the topstructures identi®ed belonged to tyrosine kinasesor serine/threonine kinases (Table 1). If this struc-ture had been unknown or was an unrelatedsequence through convergent evolution, theseidenti®ed structures would have accurately pro-vided both functional and structural information.The related homologous structures, when alignedto the ``unknown'' structure's structural elementswould have also provided a template for furtherstructural modeling and biochemical analysis.Extending this approach beyond a single domainprotein, it may be possible to identify structuraland functional domains of large complexes. Thesedomains, often composed of non-contiguoussequence segments, might otherwise be impossibleto differentiate based on a purely sequence-basedanalysis, as seen in proteins from a large virus(M.L.B. et al., unpublished results).

In large multi-protein complexes, the atomic res-olution models of the individual proteins mayhave been determined. While this provides us withinteresting structural and functional informationfor the individual proteins, less information isgiven about the localization, association, and func-tion within the larger complex. By using foldhunterto accurately localize and orient a protein (or sub-structure) within the complex, we can obtain pre-viously unavailable information regarding thepositioning of the protein within the complex. Thepotential information from the localization is two-fold: (1) positioning of individual proteins within acomplex and (2) relative contacts and associationsof proteins within the complex. Thus, the inte-gration of structure and function with positionalinformation provides a much clearer picture of theoverall mechanism of macromolecule complexfunction.


Here we demonstrate the validity of two pro-grams primarily targeted at the interpretation oflow to intermediate resolution structures, obtainedby electron cryomicroscopy or by X-ray crystallo-graphy at an initial stage of analysis.4 ± 10 It hasbeen demonstrated here that both helixhunter andfoldhunter provide a basic means of bridging andintegrating multiple sources of structural and func-tional information from both known or potentiallynovel folds. Instead of being viewed solely asmethods for interpretation of structure, these pro-grams may be viewed as two nodes in the struc-ture/function prediction network. Not only arethey capable of correlating existing information,but they also provide insight into the function,composition and interaction of sub-structureswithin a larger complex, seen in the 2.4 MDa cal-cium release channel.28 Furthermore, such structureanalysis may prove valuable in structural geno-mics, by quickly screening large assemblies forpotentially unique or homologous structures.

Materials and Methods

Density segmentation

At intermediate resolutions, the densities for helicalregions appear to be higher than other regions of thedensity map, with the highest density located along thecentral axis of the helix. By masking out densities belowan appropriate threshold value, most of the non-helicaldensity in the map can be eliminated. The extracted den-sities can then be segmented into separate objects, con-sisting only of connected voxels. To enhance the helicalregions, while suppressing the other non-helical regions,we have implemented a pre-processing step to re-scalethe density map. This pre-processing step involves thecommonly used template-based cross-correlation tech-nique, by which a centered prototypical helix, with aGaussian-like radial density distribution and width of5 AÊ oriented at different angles, is correlated to the den-sity map. The highest correlation coef®cient among allthe prototypical helix orientations is recorded for eachvoxel. The helical regions would therefore have largercorrelation coef®cients than other regions, resulting in acoef®cient map with greater differences in valuesbetween helical regions and non-helical regions than theoriginal density map. Two variants of the correlationfunction,29 the cross-correlation function (CCF) andmutual correlation function (MCF), were implemented asuser selectable options in the pre-processing step. Fromthis ``enhanced'' map, density segmentation, as describedabove, may result in cleaner identi®cation of helices.

Segment quantification

Critical to segment characterization is the de®nition ofsegment shape and related helical axis, assuming the cri-teria for helix shape have been met. After segmentationof the density map, the attributes (center, mass, volume,second moments tensor, principal axes, length, andwidth) for each segment are quanti®ed (equations (1)-(3)):

Mass:

m �Z

f �~x�d~x �1�

Center:

~X � 1

m

Zf �~x�~xd~x �2�

Second moment tensor:

Mij � 1

m

Zf �~x��xi ÿ �xi��xj ÿ �xj�d~x; �i; j � 0; 1; 2� �3�

Here, ~x is the voxel, f(~x) is the density value and i,j aredimensional indices.

Among all the attributes, the 3 � 3 second momentstensor (equation (3)) is the most valuable attribute, as itdescribes the nine essential shape parameters of the seg-ment density. This attribute is used in computing thenine parameters (two angles representing each of thethree principal axis directions and three de®ning seg-ment dimensions) by eigen-analysis.30,31 The tensormatrix is symmetric and can be diagonalized using theJacobi transformation.31 The resulting three eigen-vectorsthen become direction vectors of the three principal axes.The three eigen-values represent segment dimensions,with segment length represented by twice the largest ofthe three eigen-values. The remaining two eigen-valuesrepresent segment radii about the segment axis.

Helix identification

The relationship among the three eigen-valuesdescribes the shape of the object, which for helices atintermediate resolutions is cylindrical. This thereforerequires the density segment to satisfy conditions basedon cylindrical parameters. The largest eigen-value mustbe signi®cantly larger than the two smaller eigen-values,while the two smaller eigen-values must be similar invalue. Additionally, the two smaller eigen-values shouldbe less than 3 AÊ , representing the alpha helix radius.Eigen-values outside these conditions indicate that thesegment does not correspond to an alpha helix. Inaddition to the eigen-values, minimal segment lengthcan also be included as a criterion for identifying a helix.By specifying a ``short-helix'' length, small non-cylindri-cal densities that still meet the initial eigen-value criteriaare excluded, decreasing the likelihood of false peaks.The default short-helix value was set to 9 AÊ , representingjust under two complete turns.

Representing helices

As stated previously, at intermediate resolution analpha helix can be represented as a cylindrical object.This can now be de®ned by seven parameters: three forthe center position, two angles for the axis direction, onefor the helix length, and a constant diameter of 5 AÊ .Once these parameters are determined, the assignedhelices can then be visualized and subjected to furtherprocessing. Currently, four output formats are sup-ported: Open Inventor, VRML, SSE (DejaVu) and COSECformat. The Open Inventor format output can be dis-played using 3-D visualization tools, such as Iris Explorer(NAG, Downers Grove, Illinois), where the helices areshown as stylized cylinders. This format was the pri-mary method used for visualization, as the cylinders

Figure 5. Comparison of helix representations. Shownin (a) is a ribbon model for 1C3W, bacteriorhodopsin,and its corresponding helices using the original DejaVurepresentation method. In (b), the new central axis rep-resentation of helices is shown, in addition to the same1C3W ribbon model. There are noticeable differences inthe helical axis orientation. Artifacts in the originalDejaVu representation method include tilting and lateraltranslation of the helix center. No such artifacts can beseen in the central axis representation.


(helices) and the density map can be viewed simul-taneously. Apart from helping to visualize helices, theexplicit description of the starting and ending position ofeach helix can be represented in a secondary structureelement ®le. The topological structural informationencoded by the relative positions and orientationsamong the helices make it possible to search for structur-al homologs, even in the absence of an atomic model.This is accomplished by using the ``bones'' search in theDejaVu package, which ignores directionality andsequential order of structural elements. Additionally,representing helices in a vectorized format, we can do asimilar search for structural homologs using COSEC.19,20

Rebuilding helix specific DejaVu library

The operational de®nition of alpha helices differsbetween DejaVu and helixhunter. In helixhunter, the helixis represented by starting and ending positions along thecentral helical axis. In contrast, the original DejaVulibrary uses only the ®rst and last Ca as the starting andending coordinates in the spatial de®nition of the helix.Therefore, when both termini occur on a single side ofthe helix, the positioning of helix could be off byapproximately the radial distance of a helix, 2.5 AÊ . How-ever, when the termini of the helices reside on oppositesides, the orientation of the helices could be subject to afew degrees of error. To minimize this effect, the second-ary structure library was rebuilt using the projections ofthe starting and ending Ca onto a centrally ®tted helicalaxis. Additionally, this new library excludes non-helicalstructures making it more suitable for analysis of helix-hunter results.

De®nitions of helical segments were extracted directlyfrom the PDB headers. No additional parameterizationof helical character was done, contrary to the DejaVumethod, which uses its own criteria for helix de®nition.Fitting of the central helical axis was done using therotational least-squares (rot®t) algorithm.32 The Ca end-points were then mapped to the central axis, yielding anaccurate representation of the helix. The library of SSE®les were then constructed using this technique with arelatively recent version of the PDB (July 1999, releaseno. 89). Shown in Figure 5, is bacteriorhodopsin, 1C3W,represented using the original DejaVu method and thecentral axis method. Helical positioning in comparisonwith the atomic model suggests an increase in general®delity using the central axis description of helices.

Structure fitting

The technique used by foldhunter is also a correlation-based method. The fragment or sub-structure being loca-lized within the reconstructed model is rotated to allpossible orientations (three degrees of freedom). A 3-Dcross-correlation map is then calculated and the maximalcorrelation coef®cient and the corresponding orientationsare recorded for each voxel. A sorted list of best sol-utions (rotation and translation) is then produced. Thetop solutions, rotational and translational parameters,represent the likely ®t of the probe structure to the targetstructure. Thus, the foldhunter results re¯ect only themost probable solutions relative to the accuracy of theprobe, target map, and angular search sampling. There-fore, it is necessary to visually inspect the resulting ®ttedstructures.

For a 3-D structure alignment, the exhaustive six-dimensional search is computationally demanding, how-

ever, with current computational resources, this processfalls within acceptable time constraints for most searches.For instance, just under three hours was required toprobe a structure of 116 kDa on a single Origin 2000R10000 processor (195 MhZ). Foldhunter and helixhunterhave been parallelized for use on shared-memory super-computers. Thus, even for large-scale problems, compu-tational time requirements should not be prohibitive.

Generating model data

Density maps used for testing were generated usingpdb2mrc, a program included in the EMAN suite.33 Foreach atom in the PDB model, a Gaussian, nee

ÿ rk� �2 is gen-

erated in the corresponding position in the density map,where ne is the atomic number and r is the distance fromthe center of the atom. The Gaussian width, k, is constantfor all atoms to approximately simulate a 3-D structureat moderate resolution. For an 8 AÊ resolution 3-D struc-ture, k is equal to 8=

��2p

. This corresponds to startingwith an atomic resolution model and applying a Gaus-sian Fourier ®lter with a half-width of 1/8 AÊ ÿ1. The res-olution is suf®ciently low that the atomic form factor isnegligible, and thus excluded. While this is not a perfecttest of how the algorithm will operate on real data, it is areasonable simulation for evaluation of the helixhunter/foldhunter algorithms.

Four representative proteins, comprising four distinctsuper-families in the SCOP classi®cation, were down-loaded from the PDB. These structures were then con-verted into cubic electron density maps at 8 AÊ resolutionusing pdb2mrc. All structures were generated using asampling of 1 AÊ 3 per voxel and centered in a 643 map,except for 1BVP, which was centered in a 1283 map.Additionally, individual sub-structures were extractedfrom the four test PDB sequences and generated underthe previously mentioned conditions. Thus, a standardtest set of four test models and four sub-structures weregenerated for use in helixhunter and foldhunter.


Two additional data sets were generated, one for test-ing the resolution dependence of helixhunter and foldhun-ter, the other for testing the robustness of foldhunter. Thebacteriorhodopsin structure (1C3W) and its correspond-ing sub-structure were generated at 6, 8, 10, and 12 AÊ

resolution, representing current intermediate resolutionrange of electron cryo-microscopy. The third data setconsisted of the Bluetongue Virus VP7 timer (1BVP), thebeta-sheet trimer region from BTV, a trimer from theAfrican Horse Sickness Virus major capsid protein(1AHS), and tumor necrosis factor alpha (2TNF). Thesestructures were generated at 8 AÊ resolution in a 1283

map as mentioned above.

Testing with model data

For each of the four intact structures from the ®rstdata set and the four intact different resolution structuresin the second data set, two correlation maps, CCF andMCF, were calculated using a prototypical helix 15 AÊ inlength and an angular search stepsize of ®ve degrees inhelixhunter. Both the correlation maps and the originaldensity were then further analyzed for helical contentwith helixhunter`s feature extraction routine, with a mini-mum helical length of 7 AÊ . The ``percen'' option wasused to de®ne the appropriate threshold for helix identi-®cation. Additionally, the ``sse'' option was enabled,generating a list of helices in the DejaVu SSE ®le for-mat.18

The generated SSE ®les from the four MCF ®lteredhelixhunter results were then used as input into DejaVu,in conjunction with the new helix library in order toidentify potential structural homology. DejaVu was runwith the ``bones'' search option using the default valuesfor the most part, however since no directionality of thehelices is assigned in helixhunter, no directionality wasspeci®ed in the DejaVu options.

Each of the four intact structures was correlated withthe corresponding sub-structures as probes in foldhunter.An initial search was done using a stepsize of 30degrees, which was then systematically re®ned using the``smart'' option. The ``smart'' option allows for a sys-tematic re®nement of fold orientation using increasingly®ner angular searches. For the second data set, the var-ious resolution 1C3W structures were probed against thecorresponding sub-structure using the aforementionedparameters. The ®nal data set, under the sameparameters, used the beta sheet trimer of VP7, 1AHS,and 2TNF to probe the intact BTV VP7 trimer in foldhun-ter for the highest correlation ®tting. Additional data sets(not shown, available at http://ncmi.bcm.tmc.edu/� wjiang/hhfh) were analyzed, including structuresacross SCOP folds, superfamilies, families, proteins andspecies.

Visualization

While default output of the helices is done in theOpen Inventor format, which can be viewed in the IRISExplorer package, we have provided an additional out-put in the form of the popular 3-D virtual reality markuplanguage (VRML). To visualize an atomic structuremodel, the PDB ®les were ®rst rendered in Ribbons34 andsubsequently exported as Open Inventor ®les. The orig-inal density, correlation maps, the ribbon diagram andthe foldhunter/helixhunter results were then visualized inIRIS Explorer on SGI Onyx dual-processor system.

Software availability

We have integrated both helixhunter and foldhunterinto the newly published single particle image proces-sing package, EMAN, which contains additional ®lehandling features. The EMAN package33 is freely avail-able at http://ncmi.bcm.tmc.edu/�stevel/EMAN/doc.

Acknowledgments

We thank Jon A. Christopher for providing the sourcecode for the helical axis ®tting and Michael F. Schmidfor his comments. This research has been supported bygrants from National Institutes of Health (P41RR02250,R01AI43656, R01AI38469), Robert Welch Foundation andthe National Partnership for Advanced ComputationalInfrastructure (NPACI), through the support of theNational Science Foundation. M.L.B. was supported inpart by BRASS and the W.M. Keck Center for Compu-tational Biology through a training grant from theNational Library of Medicine (2T15LM07093).

References

1. Sussman, J. L., Lin, D., Jiang, J., Manning, N. O.,Prilusky, J., Ritter, O. & Abola, E. E. (1998). ProteinData Bank (PDB): database of three-dimensionalstructural information of biological macromolecules.Acta Crystallog. sect. D, 54, 1078-1084.

2. Holm, L. & Sander, C. (1996). Mapping the proteinuniverse. Science, 273, 595-603.

3. Burley, S. K. (2000). An overview of structural geno-mics. Nature Struct. Biol. 7, 932-934.

4. BoÈ ttcher, B., Wynne, S. A. & Crowther, R. A. (1997).Determination of the fold of the core protein ofhepatitis B virus by electron cryomicroscopy. Nature,386, 88-91.

5. Conway, J. F., Cheng, N., Zlotnick, A., Wing®eld,P. T., Stahl, S. J. & Steven, A. C. (1997). Visualizationof a 4-helix bundle in the hepatitis B virus capsid bycryo-electron microscopy. Nature, 386, 91-94.

6. Ban, N., Freeborn, B., Nissen, P., Penczek, P.,Grassucci, R. A., Sweet, R., Frank, J., Moore, P. B. &Steitz, T. A. (1998). A 9 AÊ resolution X-ray crystallo-graphic map of the large ribosomal subunit. Cell, 93,1105-1115.

7. Matadeen, R., Patwardhan, A., Gowen, B., Orlova,B. V., Pape, T., Cuff, M., Mueller, F., Brimacombe,R. & van Heel, M. (1999). The Escherichia coli largeribosomal subunit at 7.5 AÊ resolution. Structure Fold.Des. 7, 1575-1583.

8. Unger, V. M., Kumar, N. M., Gilula, N. B. & Yeager,M. (1999). Three-dimensional structure of a recombi-nant gap junction membrane channel. Science, 283,1176-1180.

9. Mancini, E. J., Clarke, M., Gowen, B. E., Rutten, T.& Fuller, S. D. (2000). Cryo-electron microscopyreveals the functional organization of an envelopedvirus, Semliki Forest virus. Mol. Cell, 5, 255-266.

10. Zhou, Z. H., Dougherty, M., Jakana, J., He, J., Rixon,F. J. & Chiu, W. (2000). Seeing the herpesviruscapsid at 8.5 AÊ . Science, 288, 877-880.

11. Lo Conte, L., Ailey, B., Hubbard, T. J., Brenner, S. E.,Murzin, A. G. & Chothia, C. (2000). SCOP: a struc-tural classi®cation of proteins database. Nucl. AcidsRes. 28, 257-259.


12. Luecke, H., Schobert, B., Richter, H. T., Cartailler,J. P. & Lanyi, J. K. (1999). Structure of bacterio-rhodopsin at 1.55 AÊ resolution. J. Mol. Biol. 291,899-911.

13. Banner, D. W., Bloomer, A., Petsko, G. A., Phillips,D. C. & Wilson, I. A. (1976). Atomic coordinates fortriose phosphate isomerase from chicken muscle.Biochem. Biophys. Res. Commun. 72, 146-155.

14. Hubbard, S. R., Wei, L., Ellis, L. & Hendrickson,W. A. (1994). Crystal structure of the tyrosine kinasedomain of the human insulin receptor. Nature, 372,746-754.

15. Grimes, J. M., Basak, A. K., Roy, P. & Stuart, D. I.(1995). The crystal structure of bluetongue virusVP7. Nature, 373, 167-170.

16. Grimes, J. M., Burroughs, J. N., Gouet, P., Diprose,J. M., Malby, R., Zientara, S., Mertens, P. P. C. &Stuart, D. O. (1998). The atomic structure of thebluetongue virus core. Nature, 395, 470-477.

17. Gabashvili, I. S., Agrawal, R. K., Spahn, C. M. T.,Grassucci, R. A., Svergun, D. I., Frank, J. & Penczek,P. (2000). Solution structure of the E. coli 70 S ribo-some at 11.5 AÊ resolution. Cell, 100, 537-549.

18. Kleywegt, G. J. & Jones, T. A. (1997) Detecting fold-ing motifs and similarities in protein structures. InMethods in Enzymol. (Carter, C. W., Jr & Sweet,R. M., eds), vol. 277, pp. 525-545, Academic Press,London, England.

19. Mizuguchi, K. & Go, N. (1995). Comparison ofspatial arrangements of secondary structuralelements in proteins. Protein Eng. 8, 353-362.

20. Kinoshita, K., Kidera, A. & Go, N. (1999). Diversityof functions of proteins with internal symmetry inspatial arrangement of secondary structuralelements. Protein Sci. 8, 1210-1217.

21. Fischer, D. & Eisenberg, D. (1996). Protein fold rec-ognition using sequence-derived predictions. ProteinSci. 5, 947-955.

22. Basak, A. K., Gouet, P., Grimes, J., Roy, P. & Stuart,D. (1996). Crystal structure of the top domain ofAfrican horse sickness virus VP7: comparisons withbluetongue virus VP7. J. Virol. 70, 3797-3806.

23. Baeyens, K. J., De Bondt, H. L., Raeymaekers, A.,Fiers, W. & De Ranter, C. J. (1999). The structure of

mouse tumour-necrosis factor at 1.4 AÊ resolution:towards modulation of its selectivity and trimeriza-tion. Acta Crystallog. sect. D, 55, 772-778.

24. Lu, G., Zhou, Z. H., Baker, M. L., Jakana, J., Cai, D.,Wei, X., Chen, S., Gu, X. & Chiu, W. (1998). Struc-ture of double-shelled rice dwarf virus. J. Virol. 72,8541-8549.

25. DeRosier, D. J. & Harrison, S. C. (1997). Macromol-ecular assemblages. Sizing things up. Curr. Opin.Struct. Biol. 7, 237-238.

26. Kirkitadze, M. D., Barlow, P. N., Price, N. C., Kelly,S. M., Boutell, C. J., Rixon, F. J. & McClelland, D. A.(1998). The herpes simplex virus triplex protein,VP23, exists as a molten globule. J. Virol. 72, 10066-10072.

27. Wriggers, W., Milligan, R. A. & McCammon, J. A.(1999). Situs: a package for docking crystal struc-tures into low-resolution maps from electronmicroscopy. J. Struct. Biol. 125, 185-195.

28. Baker, M. L., Seryskeva, I. I., Wu, Y., Sencer, S.,Tung, W., Pate, P., Zhang, J. Z., Ludtke, S., Jiang,W., Chiu, W. & Hamilton, S. (2001). Identi®cation ofan N-terminal redox sensor in RYR1. Biophysical J.80, 330a-331a.

29. van Heel, M., Schatz, M. & Orlova, E. V. (1992).Correlation functions revisited. Ultramicroscopy, 46,307-316.

30. Gonzalez, R. C. & Woods, R. C. (1992). Digital ImageProcessing, Addison-Wesley, Reading, Mass.

31. Press, W. H., Teukolsky, S. A., Vetterling, W. T. &Flannery, B. P. (1997). Numerical Recipes in C, Cam-bridge University Press, New York, NY.

32. Christopher, J. A., Swanson, R. & Baldwin, T. O.(1996). Algorithms for ®nding the axis of a helix:fast rotational and parametric least-squares methods.Comput. Chem. 20, 339-345.

33. Ludtke, S. J., Baldwin, P. R. & Chiu, W. (1999).EMAN: semi-automated software for high resolutionsingle particle reconstructions. J. Struct. Biol. 128, 82-97.

34. Carson, M. (1997). Ribbons. Methods Enzymol. 493-505.

Edited by W. Baumeister

(Received 5 December 2000; received in revised form 26 February 2001; accepted 14 March 2001)

Documents

Bridging the Information Gap: Computational Tools for ......NMR, electron cryomicroscopy and modeling, the current structural database, the Protein Data Bank (PDB),1 which has greater