14
BIOINFORMATICS Vol. 20 no. 12 2004, pages 1822–1835 doi:10.1093/bioinformatics/bth143 A knowledge-based scale for the analysis and prediction of buried and exposed faces of transmembrane domain proteins Thijs Beuming 1 and Harel Weinstein 1,2, 1 Department of Physiology and Biophysics, Mount Sinai School of Medicine, New York, NY 10029, USA and 2 Department of Physiology and Biophysics, and Institute for Computational Biomedicine, Weill Medical College of Cornell University, New York, NY 10021, USA Received on August 20, 2003; revised on January 10, 2004; accepted on January 11, 2004 Advance Access publication February 26, 2004 ABSTRACT Motivation: The dearth of structural data on α-helical mem- brane proteins (MPs) has hampered thus far the development of reliable knowledge-based potentials that can be used for automatic prediction of transmembrane (TM) protein structure. While algorithms for identifying TM segments are available, modeling of the TM domains of α-helical MPs involves assem- bling the segments into a bundle. This requires the correct assignment of the buried and lipid-exposed faces of the TM domains. Results: A recent increase in the number of crystal structures of α-helical MPs has enabled an analysis of the lipid-exposed surfaces and the interiors of such molecules on the basis of structure, rather than sequence alone. Together with a con- servation criterion that is based on previous observations that conserved residues are mostly found in the interior of MPs, the bias of certain residue types to be preferably buried or exposed is proposed as a criterion for predicting the lipid-exposed and interior faces of TMs. Applications to known structures demonstrates 80% accuracy of this prediction algorithm. Availability: The algorithm used for the predictions is imple- mented in the ProperTM Web server (http://icb.med.cornell. edu/services/propertm/start). Contact: [email protected] INTRODUCTION Integral membrane proteins (MPs) play important roles in signal transduction, the transport of substrates across the membrane, the maintenance of ionic and proton gradients, photosynthesis, light harvesting and other biological pro- cesses. It has been estimated that 25–30% of the genes in the genomes of several organisms encode proteins with a transmembrane (TM) domain (Wallin and von Heijne, 1998; Stevens and Arkin, 2000a). Additionally, most of the currently To whom correspondence should be addressed. employed therapeutics have MPs as targets [predominantly G-protein coupled receptors (GPCRs) (Dahl et al., 2002)]. Despite the central role of these proteins in biology and medicine, our understanding of their structure and function is still very limited. This is largely due to problems asso- ciated with over-expression, purification and availability in stable forms suitable for X-ray crystallography and electron microscopy (EM) studies. To date, only about 50 struc- tures for (mostly bacterial) MPs have been solved at high resolution (for an overview, see http://blanco.biomol.uci. edu/Membrane_Proteins_xtal.html and http://www.mpibp- frankfurt.mpg.de/michel/public/memprotstruct.html), and a limited number of low-resolution structures solved by cryo- EM are available. This contrasts strongly with the total number of structures in the Protein Data Bank (PDB) (Berman et al., 2000), which currently exceeds 20 000. This difficulty has engendered a large variety of studies probing specific structural aspects of MPs by means other than direct structure determination, using approaches such as sub- stituted cysteine accessibility method (SCAM) (Karlin and Akabas, 1998), metal-binding site engineering (Norregaard et al., 2000), spin labeling studies (Farrens et al., 1996), cross- linking (Kaback et al., 2001) as well as molecular modeling and computational simulation methods (Visiers et al., 2002). The interpretation of results from these methods in a struc- tural context is facilitated by the relatively simple architecture of MPs. All MP structures solved to date have TM domains that fold as either single α-helices, bundles of α-helices or β -barrels. For recent reviews on MP folding and structure, see Bowie (1997), Garavito and White (1997), White and Wimley (1999), Popot and Engelman (2000), Ubarretxena-Belandia and Engelman (2001), and Liang (2002). The present study is focused exclusively on MPs with α-helical TM domains. The folding of MPs has been described as a two-stage process (Popot and Engelman, 1990). First, TM helices are inserted in the endoplasmic reticulum (ER) membrane, where they form independent stable structures that can be considered 1822 Bioinformatics 20(12) © Oxford University Press 2004; all rights reserved. by guest on February 17, 2016 http://bioinformatics.oxfordjournals.org/ Downloaded from

A knowledge-based scale for the analysis and prediction of buried and exposed faces of transmembrane domain proteins

Embed Size (px)

Citation preview

BIOINFORMATICS Vol. 20 no. 12 2004, pages 1822–1835doi:10.1093/bioinformatics/bth143

A knowledge-based scale for the analysis andprediction of buried and exposed faces oftransmembrane domain proteins

Thijs Beuming1 and Harel Weinstein1,2,∗

1Department of Physiology and Biophysics, Mount Sinai School of Medicine, New York,NY 10029, USA and 2Department of Physiology and Biophysics, and Institute forComputational Biomedicine, Weill Medical College of Cornell University, New York,NY 10021, USA

Received on August 20, 2003; revised on January 10, 2004; accepted on January 11, 2004

Advance Access publication February 26, 2004

ABSTRACTMotivation: The dearth of structural data on α-helical mem-brane proteins (MPs) has hampered thus far the developmentof reliable knowledge-based potentials that can be used forautomatic prediction of transmembrane (TM) protein structure.While algorithms for identifying TM segments are available,modeling of the TM domains of α-helical MPs involves assem-bling the segments into a bundle. This requires the correctassignment of the buried and lipid-exposed faces of the TMdomains.Results: A recent increase in the number of crystal structuresof α-helical MPs has enabled an analysis of the lipid-exposedsurfaces and the interiors of such molecules on the basis ofstructure, rather than sequence alone. Together with a con-servation criterion that is based on previous observations thatconserved residues are mostly found in the interior of MPs, thebias of certain residue types to be preferably buried or exposedis proposed as a criterion for predicting the lipid-exposedand interior faces of TMs. Applications to known structuresdemonstrates 80% accuracy of this prediction algorithm.Availability: The algorithm used for the predictions is imple-mented in the ProperTM Web server (http://icb.med.cornell.edu/services/propertm/start).Contact: [email protected]

INTRODUCTIONIntegral membrane proteins (MPs) play important roles insignal transduction, the transport of substrates across themembrane, the maintenance of ionic and proton gradients,photosynthesis, light harvesting and other biological pro-cesses. It has been estimated that 25–30% of the genes inthe genomes of several organisms encode proteins with atransmembrane (TM) domain (Wallin and von Heijne, 1998;Stevens and Arkin, 2000a). Additionally, most of the currently

∗To whom correspondence should be addressed.

employed therapeutics have MPs as targets [predominantlyG-protein coupled receptors (GPCRs) (Dahlet al., 2002)].

Despite the central role of these proteins in biology andmedicine, our understanding of their structure and functionis still very limited. This is largely due to problems asso-ciated with over-expression, purification and availability instable forms suitable forX-ray crystallography and electronmicroscopy (EM) studies. To date, only about 50 struc-tures for (mostly bacterial) MPs have been solved at highresolution (for an overview, see http://blanco.biomol.uci.edu/Membrane_Proteins_xtal.html and http://www.mpibp-frankfurt.mpg.de/michel/public/memprotstruct.html), and alimited number of low-resolution structures solved by cryo-EM are available. This contrasts strongly with the total numberof structures in the Protein Data Bank (PDB) (Bermanet al.,2000), which currently exceeds 20 000.

This difficulty has engendered a large variety of studiesprobing specific structural aspects of MPs by means other thandirect structure determination, using approaches such as sub-stituted cysteine accessibility method (SCAM) (Karlin andAkabas, 1998), metal-binding site engineering (Norregaardet al., 2000), spin labeling studies (Farrenset al., 1996), cross-linking (Kabacket al., 2001) as well as molecular modelingand computational simulation methods (Visierset al., 2002).The interpretation of results from these methods in a struc-tural context is facilitated by the relatively simple architectureof MPs. All MP structures solved to date have TM domainsthat fold as either singleα-helices, bundles ofα-helices orβ-barrels. For recent reviews on MP folding and structure, seeBowie (1997), Garavito and White (1997), White and Wimley(1999), Popot and Engelman (2000), Ubarretxena-Belandiaand Engelman (2001), and Liang (2002). The present study isfocused exclusively on MPs withα-helical TM domains.

The folding of MPs has been described as a two-stageprocess (Popot and Engelman, 1990). First, TM helices areinserted in the endoplasmic reticulum (ER) membrane, wherethey form independent stable structures that can be considered

1822 Bioinformatics 20(12) © Oxford University Press 2004; all rights reserved.

by guest on February 17, 2016http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Buried and exposed faces of TM domains

domains. Second, these TM helices associate to form the finalstructure. At this stage, there is no hydrophobic effect to driveprotein folding. It has been postulated that in order to achievestability, the internal packing of MPs is very tight (Eilerset al.,2000) and helix–helix interaction is mainly through van derWaals forces. However, polar interactions in TM domains alsohave a critical role in maintaining MP stability, as inter-helicalhydrogen bonds have been shown to cause dimerization of TMhelices in model systems (Chomaet al., 2000; Zhouet al.,2001). Residue–residue interactions between helices in MPshave been extensively studied (Adamian and Liang, 2001;Eilers et al., 2002; Adamianet al., 2003) and appear to bemuch more diverse than those in globular proteins.

After migration to the cellular membrane, MPs are surroun-ded by a complicated, heterogeneous environment formed bywater, charged phospholipid-headgroups and a hydrophobiclipid phase. Considerable accuracy (up to 95%) has beenachieved in predicting, based on sequence alone, which partsof a MP form either the TM domains or the solvent-exposedregions (Kroghet al., 2001; Bertaccini and Trudell, 2002;Chen et al., 2002). Prediction methods for this step havedepended on the development of amino acid propensity scalesfor residues to be located within the membrane region or inthe water-exposed regions in the protein (von Heijne, 1992;Monneet al., 1999). However, successful modeling of MPsinvolves assembling these predicted TMs into a bundle, rep-resenting the tertiary structure of the TM region. As reviewedin detail (Visierset al., 2002), this requires knowledge notonly of the start- and end-point of the TMs, but also of thecorrect assignment of the TM faces that are exposed to eitherthe interior of the protein, or are located on the protein’s sur-face that faces the phospholipid membrane. If the MP is part ofan oligomeric assembly, then the situation is further complic-ated by the requirements of the oligomeric or multi-subunitinterface.

It has been shown previously that conserved residues inMPs are located mostly within the interior of the protein(Stevens and Arkin, 2001). However, non-conserved residuesdo occur within the protein interior, and although rarely, con-served residues that serve an architectural role can be found onthe membrane-facing protein surface (e.g. prolines, glycinesand arginines/lysines that interact with the phospholipid-head groups (Ballesteros and Weinstein, 1992; Sansom andWeinstein, 2000).

Various attempts have been made to determine thepropensity of amino acids to face the phospholipid bilayer.An early suggestion about MPs is that they are ‘inside-out’soluble proteins with hydrophobic exteriors and polar cores(Engelman and Zaccai, 1980; Reeset al., 1989). However,based on the analysis of several MP structures, this paradigmhas been challenged recently (Stevens and Arkin, 1999,2000b; Rees and Eisenberg, 2000; Silverman, 2003). Sev-eral groups have calculated lipid-facing propensities withoutany available structural data. Samateyet al. (1995) used

correlation matrices and Fourier transforms on a large set ofMP sequences to determine propensities of pairs of residuesto lie on the same or on opposite helical faces. They foundpolar/aromatic residues to lie on one side (presumably theinterior), and aliphatic residues to lie on another side (prob-ably facing lipid). Others analyzed the occurrence of residuetypes in multi- versus single-spanning MPs, and assumedthat residues that prefer to be exposed to the lipid are foundmore frequently in single-span MPs, while amino acidsthat have a high propensity to be buried in MPs would beover-represented in multi-span MPs (Pilpelet al., 1999).Recently, Ulmschneider and Sansom (2001) analyzed aminoacid distributions in 14α-helical MP structures. They foundsmall differences in the propensity for hydrophobic residues(F, L, I, V) to be located inside versus on the proteins surface,but large preferences for A and G to be located inside. Resultson other residue types were not reported, probably due to alimited data set.

With this background, the motivation of the present workincluded two main goals. First is the need to re-evaluate theresidue distribution between the surface and the interior ofMPs, because the number of availableα-helical MP struc-tures has doubled since the last report (Ulmschneider andSansom, 2001). Thus, 28α-helical MPs in the PDB wereevaluated in this work. The inside/outside distribution wasalso determined for specified regions in the TMs (i.e. the intra-and extra-cellular parts and the central regions). The resultsare compared with those from an alternative method, wherewe have used the crystal structure of bovine rhodopsin and analignment of 328 rhodopsin-like GPCRs considered to sharea common structure, in order to determine the inside/outsidedistribution. GPCRs constitute the largest superfamily ofMPs, with more than 1000 family members in the humangenome.

The second goal was to use the information obtained fromthe analysis to develop an amino acid property scale that cor-responds to the propensity of residues to be located on the TMsurface. The ability of this knowledge-based scale to refinepredictions based on conservation criteria alone was tested forMPs with known structures, and the combination of conserva-tion criteria and the knowledge-based scale was subsequentlyemployed to predict the residue orientation in the TMs of sev-eral MPs for which the structures are known. The predictionmethod described here was incorporated in a Web-accessibleserver, named ProperTM.

METHODS AND RESULTSStatistical analysis of amino acid distributionsin MPsGeneration of the database A MP database was generatedcontaining all TM domains extracted fromα-helical MPswith available structures solved to a resolution of<4 Å (seehttp://blanco.biomol.uci.edu/Membrane_Proteins_xtal.html).

1823

by guest on February 17, 2016http://bioinform

atics.oxfordjournals.org/D

ownloaded from

T.Beuming and H.Weinstein

Table 1. Membrane protein structures used in generating the database

Protein Species Å PDB Unique chains with TM domain TM

Bacteriorhodopsin Halobacterium salinarium 1.6 1AP9 A 7Halorhodopsin H.salinarium 1.8 1E12 A 7Sensory-rhodopsin+transducer Natronomonas pharaonis 1.9 1H2S A, B 9rhodopsin Bos taurus 2.8 1F88 A 7Photosynthetic reaction center Rhodopseudomonas viridis 2.3 1PRC L, M, H 11Light harvesting complex Rhodopseudomonas acidophila 2.5 1KZU A, B 2Light harvesting complex Rhodospirillum molischianum 2.4 1LGH A, B 2Photosystem I Synechococcus elongatus 2.5 1JB0 A, B, F, I, J, L ,M, X 30KcsA K+ channel Streptomyces lividans 3.2 1BL8 A 2MscL mechanosensitive channel Mycobacterium tuberculosis 3.5 1MSL A 2MscS mechanosensitive channel E.coli 3.9 1MXM A 3ClC chloride channel Salmonella typhimurium 3.0 1KPL A 14Aquaporin B.taurus 2.2 1J4N A 6Glycerol channel E.coli 2.2 1FX8 A 6BtuCD vitamin B12 transporter E.coli 3.2 1L7V A 10ACRB transporter E.coli 3.5 1IWG A 12LacY permease E.coli 3.6 1PV7 A 12Glucose-6-phosphate transporter E.coli 3.3 1PW4 A 12Calcium ATPase (E1 state) Oryctolagus cuniculus 2.6 1EUL A 10Fumarate reductase complex E.coli 3.3 1FUM C,D 6Fumarate reductase complex Wolinella succinogenes 2.2 1QLA C 5Formate dehydrogenase E.coli 1.6 1KQF B, C 5Succinate dehydrogenase Dehydrogenase 2.6 1NEK C, D 6Cytochromec oxidase B.taurus 2.8 1OCC A–M 28Cytochromec oxidase Paracoccus denitrificans 2.8 1AR1 A 12Cytochromec oxidase Thermus thermophilus 2.4 1EHK A–C 14Cytochromebc1 complex B.taurus 3.0 1BGY A–K 13Glycophorin H.sapiens nmr 1AFO A 1

The TM domain of glycophorin, which has been solved bynuclear magnetic resonance (NMR)-spectroscopy, was alsoincluded. To ensure non-redundancy, proteins with>30%sequence identity to other proteins in the database wereexcluded. Table 1 lists the 28 structures that have been usedto generate the database.

TM boundaries were assigned by visual inspection of thestructure of each protein. TMs typically start and end withexposed and charged residues (arginine, lysine, histidine,aspartate, glutamate) that may interact with the phosphol-ipid headgroups (in exposed TMs), or with the solvent (inburied TMs). Thus, the region of the MP located withinthe hydrocarbon core of the membrane was assigned toinclude the amino acids located in between these terminalresidues. For the purpose of the analysis, the terminal residueson either side of the TM were not included in the data-base, and only residues within the hydrocarbon core wereconsidered.

Frequency of amino acid occurrence in MPs The averageamino acid composition of the TM domains was determ-ined for the 28 proteins in Table 1. As shown in Table 2,

hydrophobic residues (A, I, L, V) make up the bulk of theamino acids, accounting for 48.7% of all residues. Chargedresidues (D, E, H, K, R) constitute only 5.5% of the total.Among these types, the higher occurrence of histidine (2.3%)relative to the other charged residues is largely due to itsabundance in Photosystem 1. Glycine is also relatively abund-ant, as well as methionine, serine and threonine. Together,small residues (A, C, S, T) form 30.6% of the total. Aro-matic residues (F, W, Y) represent 15.8% of the total, butphenylalanine is by far the most common.β-Branchedresidues (T, I, V) form 24.9% of the total. Proline can beconsidered to be a helix breaker and may have special struc-tural and functional properties (Sansom and Weinstein, 2000).It is underrepresented in TM helices (2.8%). Cysteine and thepolar residues glutamine and asparagine are also rarely foundin TM domains.

The overall amino acid composition of MPs deviates sig-nificantly from that of the whole genome, as evidenced bythe comparison of residues in TM segments with residue dis-tribution in whole genomes [the human andEscherichia coligenomes, shown in Table 2 (Tekaiaet al., 2002)]. As expec-ted, hydrophobic residues A, F, G, I, L, M, V and W occur

1824

by guest on February 17, 2016http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Buried and exposed faces of TM domains

Table 2. Average amino acid compositions

Type Membrane proteinsa Whole genomesb Globular proteinsc

Overall Interior Surface HumanE.coli Interior Surface

A 12.1 13.8 9.9 7.0 9.5 11.0 7.9C 1.2 1.3 1.1 2.2 1.2 5.4 1.8D 0.7 0.9 0.5 4.9 5.1 2.2 7.4E 0.9 1.1 0.6 7.0 5.7 1.0 6.2F 8.8 7.8 10.0 3.7 3.9 7.7 2.5G 9.0 12.0 5.1 6.6 7.4 9.7 8.8H 2.3 3.3 1.1 2.5 2.3 2.4 2.2I 9.8 8.1 11.9 4.4 6.0 10.5 3.0K 0.7 0.5 0.8 5.7 4.4 0.3 8.9L 16.7 13.0 21.4 9.8 10.6 12.8 4.3M 4.4 4.4 4.4 2.2 2.9 3.0 0.9N 1.2 1.5 1.0 3.7 4.0 2.0 6.3P 2.8 3.2 2.3 6.1 4.4 2.2 4.7Q 0.9 1.1 0.7 4.7 4.4 1.3 4.5R 1.1 1.2 1.0 5.6 5.5 0.4 4.0S 4.9 6.4 3.0 8.0 5.8 5.0 8.9T 5.3 6.0 4.5 5.3 5.4 4.6 7.1V 10.6 8.9 12.7 6.1 7.0 12.7 4.6W 3.3 2.4 4.5 1.2 1.5 2.7 1.3Y 3.4 3.2 3.6 2.8 2.8 3.3 4.8

aPercentage of the TM domains of 28 MPs, their interior or their surface.bPercentages of human andE.coli genomes (Tekaiaet al., 2002).cPercentage of the surface or interior of globular proteins (Milleret al., 1987).

more frequently in MPs than in whole genomes. Conversely,residues C, D, E, K, N, P, Q, R are underrepresented in MPs,while H, S, T and Y have equal distribution in MPs and wholegenomes.

Assignment of residues to the interior or exterior ofstructures based on surface accessibility criteria To assessthe propensity of residues in TMs for being buried or exposed,the relative solvent accessible surface area (SASA) of the sidechain of each residue was calculated with the program Gepol(Silla et al., 1990), using a probe with radius 1.4 Å (H2O) or2.0 Å (CH2). The choice of CH2 to describe the interactionbetween phospholipid molecules and the residues on the pro-tein surface is motivated by the exposure of the residues to thehydrocarbon core of the membrane (rather than water).

The relative SASA was obtained by dividing the SASA ofthe residue calculated in the environment of the protein, by aSASA reference value. The reference value was calculated forany residue side chain, X, in a Gly–X–Gly tripeptide in exten-ded conformation (Milleret al., 1987). Residue side chainswith a relative SASA>0.10 (with H2O probe) or>0.07 (forCH2)were defined as located on the surface of the protein. Theresults for the H2O probe were also collected for alternativecut-offs of>0.07 and>0.13. However, because the compos-ition of the interior and the surface appears to be relativelyinsensitive to the choice of probe and the value of the cut-off, only the results for the 0.10 cut-off with a H2O probe are

discussed here. In some cases, residues with a high solventaccessible surface are not located on the lipid-facing surface,but, for instance, in water-filled pores in the interior of theprotein (for instance, in channels). Removal of these residuesfrom the calculation had negligible effect on the propensityvalues for being buried or exposed.

Assignment of residues to the interior or exterior of structuresbased on a multiple sequence alignment (MSA) of GPCRsThe crystal structure of bovine rhodopsin and a MSA of328 rhodopsin-like GPCRs were used to determine amino acidpropensities to face the interior or the lipid. Sequences wereselected from an alignment of 1033 rhodopsin-like GPCRsavailable at the GPCRDB (http://www.gpcr.org) (Hornet al.,1998). All sequences with>25% identity in the TM domainswere included in the alignment. Since these 328 sequencesconsist of more than 58 000 amino acids, this approach hasthe advantage that the statistics of the results are better thanthe structure-based method, especially for the rare polar andcharged residues. On the other hand, this method assumes thatall GPCRs in the alignment have the same structure in the TMdomains, which might not be the case, and thus could lowerthe reliability of the data. Amino acid compositions of all pos-itions, the interior positions and the exterior positions weredetermined from the MSA. The interior and exterior positionsin the alignment were determined from their correspondenceto the residues in the rhodopsin structure.

Amino acid composition of interior and surfaceof MPsGeneral analysis of all TMs in the database The residuecomposition of the interior and the surface of the TMs, determ-ined with a H2O probe and a cut-off of 0.10, is shown inTable 2 and Figure 1. The surface of the protein contains morehydrophobic residues (A, I, L, V) than the interior (55.9%versus 43.8%), but clearly the interior of the protein is stillvery hydrophobic. Similarly, there is an enrichment of aro-matic residues (18.1 versus 13.4%) on the protein surface.The interior of the protein is enriched in small residues (39.5versus 23.6%), consistent with their proposed role in helix–helix interactions (Seneset al., 2000).β-Branched residues(I, T, V) have been proposed to be suitable for helix–helixinteractions due to a limited loss in conformational entropybetween free and buried states (Liuet al., 2003). This mightbe reflected in the decreased propensities for I and V to beexposed, relative to (the non-β-branched) L. However, nosuch difference is seen between S and T. Finally, the numberof charged residues on the TM surfaces is small (4.0 versus7.2% for the interior), but not negligible.

Because propensities for amino acids to be buried orexposed might vary in different regions of the TM (i.e.at the intra- and extra-cellular boundaries or in the centralcore), the amino acid compositions of the central part andthe terminal parts of the TMs were investigated separately

1825

by guest on February 17, 2016http://bioinform

atics.oxfordjournals.org/D

ownloaded from

T.Beuming and H.Weinstein

0

2

4

6

8

10

12

14

16

18

20

22

24

A C D E F G H I K L M N P Q R S T V W Y

Fig. 1. Amino acid composition (%) of 28 MPs (white), their interiors (gray) and their surfaces (black).

Table 3. Region-specific distribution of amino acids (%) in the TM domainsof MPs

Type Central TerminalInterior Lipid facing Interior Lipid facing

A 13.2 10.2 15.8 9.5C 1.7 1.1 1.2 1.0D 0.9 0.4 0.6 0.7E 0.9 0.3 1.5 0.8F 8.3 10.6 7.0 9.8G 14.1 4.9 10.2 6.0H 2.7 1.0 2.4 0.9I 8.1 13.4 8.6 9.5K 0.5 0.5 0.7 0.7L 12.6 22.0 13.4 21.3M 4.8 3.6 3.9 5.0N 1.6 1.2 1.1 0.7P 2.6 2.0 4.7 2.7Q 0.9 0.6 1.4 0.6R 0.6 0.6 2.3 1.3S 6.7 2.6 5.5 3.6T 5.8 5.0 5.7 3.1V 8.5 14.4 9.2 11.4W 2.6 2.8 1.7 6.9Y 3.0 2.7 3.2 4.6

The intra- and extra-cellular parts of the TMs were taken to be the first five residues.The combined intra- and extra-cellular parts are defined as the terminal region. Theremaining part of the TM domain is defined as the central region.

(Table 3). Most of the exposed (lipid facing), charged residues(D, E, H, K, R) that are found in TMs are located in the ter-minal regions (4.4%) rather than in the central parts (2.7%).Similarly, the exposed terminal parts are very rich in aro-matic residues (21.3%), whereas the central parts have fewerexposed aromatics (16.1%). This effect is largely due to dif-ferences in distribution of W and Y in the lipid-facing side of

the TM. These aromatic residues can form hydrogen bonds,and are often observed in interactions with the phospholipid-headgroups [see Sankararamakrishnan and Weinstein (2000)and references therein]. In the central parts of the TM, hydro-phobic residues make up>60% of the exposed residues.Notably, the interior preference of glycine in the central regionis much more pronounced than for the terminal region, inagreement with the proposed role of glycine in helix–helixinteractions. Conversely, interior prolines are rare in the cent-ral region, but more common at the ends of the helices, inaccordance with the role of proline as helix breaker. The dif-ference in amino acids composition of the surface and theinterior between the central and terminal parts of a TM is sig-nificant, but more structures will be required to increase therobustness of this conclusion.

Analysis of the MSA of GPCRs The GPCR-specific analysisyields the same quantitative results as that of all the TMs inthe database (Fig. 2). Thus, as determined by the structure-based method, residues F, I, L, V, W and Y are enriched on theTM surface. When determined with the rhodopsin-alignment-based method, residues F, I, L, V, W, Y and K occur morepredominantly on the TM surface. However, the exact mag-nitude of the enrichment varies between the two methods.Most notably, glycine has a clear preference for being buried,as determined from the structures. In the rhodopsin-alignmentanalysis, this preference appears less pronounced. In contrast,asparagine has a small preference for being buried based onthe structural analysis, but in the rhodopsin alignment, theinterior preference is very large. Almost all asparagines in therhodopsin alignment occur at five highly conserved positions.The fact that these conserved positions are all buried results inthe observed large interior preference. No conserved glutam-ines are found in the rhodopsin TMs, and hence glutamine hasno extreme preference for the interior or the surface.

1826

by guest on February 17, 2016http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Buried and exposed faces of TM domains

A C D E F G H I K L M N P Q R S T V W Y

0

2

4

6

8

10

12

14

16

18

20

Fig. 2. Amino acid composition (%) of 328 GPCRs (white), their predicted interiors (gray) and their predicted surfaces (black).

The surface propensity scale: development,validation and applicationDevelopment of the prediction method The results obtainedabove make possible the development of a surface propensity(SP) scale. For the purpose of developing this knowledge-based scale, it is more useful to describe the inside/outsidepropensity of amino acids by determining the surface fraction(SF), e.g. the fraction of an amino acid type that is located onthe protein’s surface. SFs calculated for a H2O probe and a cut-off of 0.10 are given in Table 4 (column 2). The SP scale wasdeveloped based on the SFs. The scale reflects the probabilityof finding a residue on the surface of the TM protein. To enablethe use of the scale in combination with a method developedpreviously for determining the extent of conservation (Visierset al., 2002), the SP values were normalized by setting the SPof the residue type with the lowest SF (His) to 0, and with thehighest SF (Trp) to 1. On this scale, the SP values for all otherresidue types were then calculated as:

SPX = SFX − SFHIS

SFTRP − SFHIS. (1)

A ‘jack-knife’ approach was used to determine SDs for the SPscale, i.e. 28 different scales were calculated by removing oneMP structure at a time from the database. The resulting val-ues for the SP are an average of the 28 scales and are shownin the third column of Table 4. The SDs are small (<0.06)for all residue types, indicating the robustness of the data-set. The correlation of the SP scale with various other aminoacid propensity scales is shown in Table 5. The correlationcoefficientc is calculated according to Tomii and Kanehisa(1996) as

c =∑n

i=1(xi − x)(yi − y)[∑ni=1(xi − x)2(yi − y)2

]1/2. (2)

The correlation between the scales developed from the struc-tures or from the rhodopsin alignments is 0.73. High correla-tions with the structure-derived propensity scale are also found

Table 4. Surface fractions and the SP scales

Type Structure SF Structure SP scale Rhodopsin SP scale

A 36.25 0.40± 0.03 0.58C 39.13 0.47± 0.04 0.38D 29.27 0.23± 0.06 0.00E 31.37 0.28± 0.05 0.08F 50.49 0.76± 0.02 0.99G 25.15 0.12± 0.03 0.61H 20.30 0.00± 0.00 0.32I 53.81 0.84± 0.02 1.00K 55.26 0.88± 0.05 0.57L 56.68 0.92± 0.02 0.98M 44.14 0.60± 0.02 0.50N 34.72 0.36± 0.03 0.01P 35.80 0.39± 0.04 0.70Q 33.33 0.33± 0.05 0.47R 39.06 0.47± 0.04 0.47S 27.05 0.17± 0.03 0.40T 37.34 0.43± 0.03 0.58V 52.94 0.82± 0.03 0.92W 60.00 1.00± 0.00 0.77Y 47.42 0.68± 0.03 0.84

SDs in column 3 are determined by a ‘jack-knife’ approach, i.e. 28 different scales werecalculated by removing one MP structure at a time from the database.

for several scales representing hydrophobicity or volume, andwith the kPROT SP scale. Notably, there is no correlation witha SP scale derived from periodicity analysis of MP sequences(Samateyet al., 1995).

The method for determining the conservation index (CI) hasbeen described previously (Gorodkinet al., 1997; Shiet al.,2001; Visierset al., 2002). In brief, the calculation of a CIrequires the estimation of the probability for the presence ofa set of N different amino acids from a set of pairwise dis-tribution probabilities (Overingtonet al., 1992). There is awell-defined mathematical formula for such a calculation inthe theory of polytopes. A polytope is defined as a closed

1827

by guest on February 17, 2016http://bioinform

atics.oxfordjournals.org/D

ownloaded from

T.Beuming and H.Weinstein

Table 5. Correlation of the structure-derived SP scale with other propensityscales

Scale Correlation

Apparent partial specific volume (BULH740102) (Bull andBreese, 1974)

0.84

Hydrophobicity factor (GOLD730101) (Goldsack andChalifoux, 1973)

0.84

Bulkiness (ZIMJ680102) (Zimmermanet al., 1968) 0.81Transfer free energy to surface (BULH740101) (Bull and

Breese, 1974)0.77

Partial specific volume (COHE430101) (Cohn and Edsall,1943)

0.75

Hydrophobicity index (ARGP820101) (Argoset al., 1982) 0.75Hydrophobicity (JOND750101) (Jones, 1975) 0.75Rhodospin-alignment scale 0.73kPROT (Pilpelet al., 1999) 0.70Hydropathy index (KYTJ820101) (Kyte and Doolittle, 1982) 0.52Consensus normalized hydrophobicity scale (EISD840101)

(Eisenberget al., 1982)0.41

(Samateyet al., 1995) propensity scale 0.02

Accession codes from the AAindex database are given in parentheses (Tomii andKanehisa, 1996; Kawashima and Kanehisa, 2000).

geometrical object in aN − 1 dimensional space defined by[N ∗(N −1)]/2 distances. The volume of this objectV estim-ates the CI at the position in the alignment. The volume of thepolytope is given by

V {AA1,AA2, . . . ,AAN }

=

√√√√√√√√√(−1)N

2N−1 ∗ [(N − 1)!]2 ∗

∣∣∣∣∣∣∣∣∣∣

0 1 1 . 11 1 P 2

1,2 . P 21,N

1 P 22,1 . P 2

2,N. . . . .1 P 2

N ,1 P 2N ,2 . 1

∣∣∣∣∣∣∣∣∣∣.

(3)

To differentiate among cases where the total number of aminoacids that appear at a given position is the same, but the fre-quencies are different, a second term [the information content(IC)] is added toV . This term takes into account the dif-ference in conservation for various distributions, e.g. that adistribution of 98:1:1 is more conserved than a distribution of40:30:30 even though three different residues are present inboth cases. The IC for each position is calculated as

ICi =∑a∈A

fia log2fia

pa, (4)

whereA is the set of residues present at positioni, andpa is thea priori distribution of the residues for the environmental andstructural context (taken to be 0.05 for each of the 20 aminoacids). To integrate these two scales, we take the average of

V and IC to be the conservation index (CI):

CI = (V + IC)

2. (5)

Prediction of the lipid-facing probability of residues in a pro-tein sequence based on the SP scale requires an MSA. Theprediction method involves calculation of the CI and the aver-age value of the SP (SPav) for each position in the MSA. Theprobability of finding a residue in the protein interior (Pinside)is then calculated as

Pinside = 0.5∗ (CI − SPav). (6)

To assign positions in the protein sequence to the interior or tothe surface of the TMs, a cut-off value for thePinsidepropertyhas to be specified, so that:

If Pinside > cut-off: interior prediction,

If Pinside < cut-off: surface prediction.(7)

The dependence of the optimal choice of cut-off on the choiceof alignment is discussed below. To determine the effectof combining conservation criteria and the SP scale in theprediction method,Pinsidewas also calculated using only con-servation criteria [Equation (8)], or using only the SP scale[Equation (9)]:

P CIinside = CI. (8)

P SPinside = SPav. (9)

Incorporation of the algorithm in ProperTM The pre-diction method has been incorporated in a suite of pro-grams named ProperTM (http://icb.med.cornell.edu/services/propertm/start), which allows for user-driven sequentialapplications of various algorithms that have been appliedbroadly and have been validated repeatedly, e.g. by sub-sequently determined structures (for a review, see Visierset al., 2002). In brief, these methods include the calculation ofproperties (SP, conservation, hydrophobicity, etc.) associatedwith positions in an MSA. A Fourier transform (FT) can becalculated for each of these properties, and used to predictsecondary structure elements (Komiyaet al., 1988; Donnellyand Cogdell, 1993). The methods in ProperTM have beendesigned to analyze MPs, but some of the applications (e.g.the calculation of a conservation index) should be useful forthe analysis of non-MPs as well.

Validation of the method The method has been tested on11 different proteins, each time using an SP scale that hasbeen developed without knowledge of the specific protein forwhich the predictions are made. Results from two measures ofperformance, accuracy and receiver-operating characteristic(ROC) area, are given in (Tables 6 and 7). The ROC curveis a plot of the sensitivity versus (1-specificity), and the areaunder the ROC curve is generally considered to be a measureof the overall accuracy of the method, and has the advantage of

1828

by guest on February 17, 2016http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Buried and exposed faces of TM domains

Table 6. Prediction results for different alignments of rhodopsin

Minimum similarityto rhodopsin (%)

No. of sequencesin alignment

Optimal cut-off[Equation (7)]

Accuracy Sensitivity Specificity Area underROC curve

>15 139 0.34 76 71 81 0.81>20 99 0.34 80 79 81 0.83>25 83 0.36 81 79 80 0.83>30 73 0.40 81 79 81 0.85>35 68 0.40 79 82 76 0.83>40 59 0.41 78 84 71 0.81

The accuracy is calculated as (TP+ TN)/(TP + TN + FP+ FN), the specificity as TP/(TP+ FN) and the sensitivity as TN/(TN+ FN), where TP is a correctly predicted interiorposition and TN is a correctly predicted lipid-facing position. The accuracy, sensitivity and specificity values are calculated for the optimal cut-off for Equation (7) (column 3, seetext). The ROC curve is a plot of the sensitivity versus (1− specificity), where different points of the curve correspond to cut-off points used to designate interior or lipid-facingresidues (see text).

being independent of a choice of cut-off (Hanley and McNeil,1982; Rosner, 2000).

The effect of the choice of sequences for the MSA wasevaluated for each of the proteins, and alignments with a sim-ilarity threshold of 20, 25, 30, 35 and 40% (no highly similarsequences, no fragments) were used for the prediction. Theprediction results for rhodopsin using different alignmentsare shown in Table 6. For rhodopsin the optimal similar-ity threshold is 30%, and this optimum is similar for mostother MPs (data not shown). However, some of the MPs stud-ied have few homologs (bacteriorhodopsin, halorhodopsin,sensory rhodopsin, LacY permease) and for these MPs thesimilarity threshold has to be as low as 25% to include suf-ficient sequences (>20) for accurate predictions. The choiceof cut-off for Equation (7) depends strongly on the choice ofMSA. The optimal cut-off for MSAs with similarity thresholds30% is∼0.40. For alignments with more remote sequences(>25%), the optimal cut-off is higher (∼0.50).

The accuracy values for all 11 MPs range between 64 and81%, and for 9 out of 11 MPs they are>75% (Table 7). Thereis strong dependence on the performance of the method onthe number of available homologs. Thus, for the protein withthe least available homologs (LacY permease) the accuracyis lowest (74%), while for rhodopsin, where many homo-logs are available, the accuracy is high (81%). Therefore,the method performs very well for rhodopsin, aquaporin, theglycerol transporter, the Ca2+–ATPase and the ACRB trans-porter, and the ROC areas are all>0.81, with maximumaccuracy between 75 and 81%. The prediction quality forbacteriorhodopsin is also high (ROC area 0.86), but this valueis biased due to the relatively small number of buried residues(assigning all positions to be lipid-exposed yields an accur-acy of ∼60%). This bias becomes apparent in comparingthe results with the more tightly packed halorhodopsin andsensory rhodopsin, which have similar numbers of sequencehomologs, but lower-quality predictions. The results can beconsidered poor for the LacY permease, most likely causedby the low number of available sequence homologs, while

predictions for another family member, GlpT (for which morehomologs are known), are much better (both with the CIand the knowledge-based scale). The poor predictions for thevitamin-B12 transporter, for which many sequence homologsare available, cannot be attributed to the same deficiency,and the reason for the relatively lower accuracy must lieelsewhere.

For the great majority of proteins, introduction of the SPscale according to Equation (6) improves the quality of pre-diction over the use of CI alone. Thus, in 10 out of 11 casesthe ROC area is higher for the combination of CI and SP[Equation (6)] than for conservation alone [Equation (8)].Large increases (>0.05) in ROC values due to the combin-ation of SP and CI are observed for rhodopsin, aquaporin,the glycerol transporter and the glucose-6-phosphate trans-porter. In other cases, the predictions using the combinationof conservation and the knowledge-based scale are onlymarginally better, or similar to those using only conserva-tion.

The ROC curves for rhodopsin are shown in Figure 3, whileFigure 4 puts the prediction in the context of the secondarystructure by presenting the details for rhodopsin on a hel-ical net. The quality of predictions for rhodopsin using eitherthe combination of conservation and SP [Equation (6)] oronly conservation [Equation (8)] is similar in six of the sevenhelices. However, for TM2, the use of the combination leadsto a dramatic improvement in the quality of prediction, andinstead of 10, only three residues are found to be predictedincorrectly.

DISCUSSIONThe method presented here for predicting which residues of aTM helix are exposed to the lipid or buried inside the proteinis based on the use of a knowledge-based scale. The scaleof residue propensities for facing the interior or the surface inMPs is of great significance for successful modeling of the pro-teins from sequence data and for understanding the structuralprinciples of such proteins, because structure determination of

1829

by guest on February 17, 2016http://bioinform

atics.oxfordjournals.org/D

ownloaded from

T.Beu

min

gan

dH

.Wein

stein

Table 7. Prediction results for 11 MPs

Protein PDBcode

Similaritythreshold(%)

No. ofsequences

Percentageof interiorresidues

Optimalcut-off[Equation (7)]

Accuracy (%)for (CI + SP)/2[Equation (6)]

Accuracy (%)for CI[Equation (8)]

Accuracy (%)for SP[Equation (9)]

ROC area for(CI + SP)/2[Equation (6)]

ROC areafor CI[Equation (8)]

ROC areafor SP[Equation (9)]

1. Rhodopsin 1f88 >30 73 50 0.4 81 74 71 0.85 0.8 0.672. Aquaporin 1j4n >30 82 54 0.39 75 72 70 0.81 0.75 0.753. Glycerol transporter 1fx8 >30 44 51 0.4 80 79 71 0.86 0.81 0.674. Ca2+–ATPase 1eul >30 61 51 0.34 77 74 63 0.81 0.79 0.65. ACRB transporter 1iwg >30 96 56 0.38 79 76 64 0.82 0.81 0.66. Glucose-6-phosphate

transporter1pw4 >30 37 52 0.34 73 67 69 0.75 0.68 0.68

7. Vitamin-B12transporter

1l7v >30 54 58 0.31 69 68 64 0.76 0.74 0.66

8. LacY permease 1pv7 >25 11 46 0.6 64 61 57 0.65 0.64 0.559. Bacteriorhodopsin 1ap9 >25 26 41 0.8 80 81 66 0.86 0.87 0.56

10. Halorhodopsin 1e12 >25 22 44 0.48 75 74 66 0.78 0.76 0.5711. Sensory-rhodopsin 1h2s>25 24 50 0.49 77 75 62 0.81 0.81 0.56

For each of the 11 proteins an alignment was generated using similarity thresholds of 30% (1–7) or 25% (8–11). The fraction of residues with a relative SASA <0.10 is reported in column 5. See the caption of Table 6 and the text fora description of accuracy and the ROC curve.

1830

by guest on February 17, 2016 http://bioinformatics.oxfordjournals.org/ Downloaded from

Buried and exposed faces of TM domains

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1-specificity

Fig. 3. ROC curves for prediction of rhodopsin residues. The blueline corresponds to the prediction using both conservation and theknowledge-based scale [Equation (6)], the red line corresponds to thepredictions using only conservation [Equation (8)] and the green linecorresponds to the predictions using only the knowledge-based scale[Equation (9)]. The black data-point corresponds to the 0.40 cut-off.

MPs is still problematic. The general conclusions we presentabout the satisfactory quality of the predictions rest on detailedanalysis of accuracy, specificity and sensitivity, which havebeen used to interpret the data, and the use of ROC curves toevaluate the overall performance of the method.

For globular proteins, the abundance of structural data hasenabled the development of knowledge-based potentials thathave been successfully applied to protein structure prediction(Lazaridis and Karplus, 2000). Application of such meth-ods to the structure prediction of MPs has been slowed bythe lack of sufficient structural information on these proteins.However, the determination of increasingly large numbers ofMP structures, bringing the total of available structures of(α-helical) MPs to over 30, enabled the present derivationof a knowledge-based scale that assists in the prediction ofexposed and buried residues in TM domains. The results show(Table 2 and Fig. 1) that not only the hydrophobic and aro-matic residues F, I, L, V, W and Y, but also K, are enrichedon the TM surface. In contrast, small and charged residuesA, D, E, G, H, N, Q and S are preferentially buried, whileC, R, M, P and T have intermediate properties.

These results from our method differ somewhat from earlierreports (Pilpelet al., 1999; Ulmschneider and Sansom, 2001).Ulmschneideret al. found no clear preference for large hydro-phobic residues F, L, I, V to be exposed. They analyzed asmaller set of MPs (15), which might explain the observeddifferences. Pilpelet al. (1999) (kPROT scale) found a pref-erence for aromatic residues to be buried rather than exposed,

while A was found to be exposed rather than buried. Other-wise, their results correspond with those presented here (thecorrelation between the kPROT scale and the one presentedhere is 0.70).

The propensity of residue types to be buried or exposed inglobular proteins was documented a long time ago (Chothia,1976; Janin and Wodak, 1978; Wertz and Scheraga, 1978;Miller et al., 1987). We compared the residue compositionof the surface and the interior of a set of 45 such proteins(Miller et al., 1987) to the values for 24 MPs obtained in thisstudy (Table 2). It appears that the interiors of the two typesof proteins are rather similar in composition. In contrast, thecharacteristics of the surfaces are completely different. Asexpected, the surface of MPs is very hydrophobic, whereasin globular proteins it is mostly polar. The (rare) chargedresidues that occur on the surface of MPs might be involved infunctional properties of the structure, such as protein–proteininteraction (oligomerization), similar to the enrichment in R,W and Y at the interface of globular proteins (Chakrabarti andJanin, 2002). In addition, R and K residues occurring at theterminal regions of the TM have been shown to extend theirside-chain towards the phospholipid-headgroups (Ballesterosand Weinstein, 1992).

Globular proteins are considered to achieve their three-dimensional structure by burying hydrophobic residues in theprotein core, while exposing polar residues on the proteinsurface. In contrast, the interior and exterior amino acid com-positions of MPs appear to be relatively similar. The foldingof an MP is probably less dependent on positioning specificamino acid types into their preferred physio-chemical environ-ments. On the other hand, the tight, specific packing of helicesprobably requires good steric complementarity to compensatefor the lack of a hydrophobic driving force. The abundance ofsmall amino acids in the MP interior might be important forthis steric complementarity.

Previous attempts to predict the interior and exposed facesof TMs involved the use of amino acid propensity scales tocalculate helical moments. Hydrophobic moments were foundto correlate with a solvent accessible surface for some, but notall, TMs in bacteriorhodopsin (Eisenberget al., 1982, 1984).The use of a scale representing lipid-exposure propensities(parameterized from an analysis of sequences of single- andmulti-spanning MPs) led to a higher accuracy prediction ofthe lipid-exposed face of a TM (Pilpelet al., 1999). Onthe other hand, the conservation moment was also found tocorrelate well with the buried face of a TM (Stevens andArkin, 2001). Taken together, these observations suggest thatthe present method, combining conservation criteria with anamino acid propensity scale, improves the accuracy of predic-tion of exposed and buried faces of TMs. The improvementproduced by the use of the conservation criterion in additionto the SP scale is due to the similarity in amino acid com-position between the interior and exterior of MPs. A scalethat accurately describes the amphipathicity of a helix in one

1831

by guest on February 17, 2016http://bioinform

atics.oxfordjournals.org/D

ownloaded from

T.Beuming and H.Weinstein

Fig. 4. Interior and lipid-facing residues in rhodopsin. The TM bundle is shown schematically on a helical net. The residue coloring isaccording to the observed or predicted interior (yellow) or lipid-facing (purple) residue orientation. In the upper panel, all purple residueshave relative SASA>0.10, and are facing the lipid. Yellow residues have a relative SASA<0.10 and are thus defined interior. The TMs inthe middle panel indicate the predictions using both conservation and the knowledge-based scale [Equation (6)]. Lower panel TMs are thepredictions using only conservation [Equation (8)]. Incorrectly predicted positions are indicated with a thick blue circle.

case might fail in another. This explains not only the need forthe added conservation criterion in the prediction algorithm[Equation (6)], but also the success of interpretation of thepredictions in a three-dimensional context (α-helix). In somecases, conservation as a sole criterion is insufficient; onespecific reason is the functional role of the protein interior(e.g. ligand-binding pockets in GPCRs). A prediction methodtaking into account only conservation cannot discriminatebetween exposed variable positions, or variable positions inthe interior that have evolved for subtype specificity.

This new method requires an MSA of sequences similar tothe target sequence. As with all such methods based on anMSA, the particular choice of sequences (both their numberand the level of similarity to the protein of interest) can greatlyinfluence the results. The results presented here for rhodopsin(Table 6) indicate that the MSA should include a large numberof sequences (40–100), but with a high enough similarity toensure that they have a similar structure (>30%).

While we have used in the present study 28 MPs forwhich structures were available, comprising 256 TMs and

1832

by guest on February 17, 2016http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Buried and exposed faces of TM domains

5776 residues, the size of this dataset could still be insufficientfor an accurate description of the amino acid composition ofthe TM surface and interior (perhaps especially for chargedresidues, which do not occur frequently in MPs). To achievean alternative perspective, we also analyzed the structureof rhodopsin in combination with an MSA, thus probingthe surface and interior properties with a larger data set(>50 000 residues). Notably, there is quantitative correspond-ence between the results from the structural analysis and fromthe rhodopsin alignment. The absolute values of the propensit-ies vary somewhat between the two analyses, but the trendsare the same. The rhodopsin analysis might be biased by theoccurrence of highly conserved (polar) residues, but this biasmight be eliminated by including a large number of sequencesin the MSA. Because the use of a large number of sequenceswill decrease the conservation in the set, it will increase theerror due to the different structural characteristics these pro-teins might have. When more structures of MPs in the sameclass become available, further increasing the size of thestructural dataset will decrease the bias due to conservation.

A more significant increase in prediction accuracy using anempirically derived scale could come from the developmentof scales that take into account the position of residues withinthe membrane. This is suggested by our results showing thatinside/outside distributions differ between regions (i.e. centralversus terminal) in the TM domain. However, such a refine-ment of the scale must await further expansion of the numberof solved MP structures, to improve the statistical significanceof the observations. The importance of such improvementsfrom scale-based methods stems from the expectation thatprediction methods based on sequence alone (i.e. conserva-tion) will not increase further in accuracy, given the alreadylarge number of available sequences.

ACKNOWLEDGEMENTSWe thank Drs Jonathan Javitch, Marc Ceruso and FabienCampagne for critical reading of this manuscript. We thankDr Lucy Skrabanek and Piali Mukherjee for help with thedevelopment of ProperTM. The work is supported by NIHgrants DA-124080 and DA-12923.

REFERENCESAdamian,L., Jackups,R., Binkowski,T.A. and Liang,J. (2003)

Higher-order Interhelical spatial interactions in membrane pro-teins.J. Mol. Biol., 327, 251–272.

Adamian,L. and Liang,J. (2001) Helix–helix packing and interfacialpairwise interactions of residues in membrane proteins.J. Mol.Biol., 311, 891–907.

Argos,P., Rao,J.K. and Hargrave,P.A. (1982) Structural prediction ofmembrane-bound proteins.Eur. J. Biochem., 128, 565–575.

Ballesteros,J.A. and Weinstein,H. (1992) Analysis and refinementof criteria for predicting the structure and relative orientations oftransmembranal helical domains.Biophys. J., 62, 107–109.

Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N.,Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The ProteinData Bank.Nucleic Acids Res., 28, 235–242.

Bertaccini,E. and Trudell,J.R. (2002) Predicting the transmembranesecondary structure of ligand-gated ion channels.Protein Eng.,15, 443–454.

Bowie,J.U. (1997) Helix packing in membrane proteins.J. Mol. Biol.,272, 780–789.

Bull,H.B. and Breese,K. (1974) Surface tension of amino acid solu-tions: a hydrophobicity scale of the amino acid residues.Arch.Biochem. Biophys., 161, 665–670.

Chakrabarti,P. and Janin,J. (2002) Dissecting protein–protein recog-nition sites.Proteins, 47, 334–343.

Chen,C.P., Kernytsky,A. and Rost,B. (2002) Transmembrane helixpredictions revisited.Protein Sci., 11, 2774–2791.

Choma,C., Gratkowski,H., Lear,J.D. and DeGrado,W.F. (2000)Asparagine-mediated self-association of a model transmembranehelix. Nat. Struct. Biol., 7, 161–166.

Chothia,C. (1976) The nature of the accessible and buried surfacesin proteins.J. Mol. Biol., 105, 1–12.

Cohn,E.J. and Edsall,J.T. (1943)Protein, Amino Acid, and Peptides.Reinhold, New York.

Dahl,S.G., Kristiansen,K. and Sylte,I. (2002) Bioinformatics: fromgenome to drug targets.Ann. Med., 34, 306–312.

Donnelly,D. and Cogdell,R.J. (1993) Predicting the point at whichtransmembrane helices protrude from the bilayer: a model of theantenna complexes from photosynthetic bacteria.Protein Eng., 6,629–635.

Eilers,M., Patel,A.B., Liu,W. and Smith,S.O. (2002) Comparison ofhelix interactions in membrane and soluble alpha-bundle proteins.Biophys. J., 82, 2720–2736.

Eilers,M., Shekar,S.C., Shieh,T., Smith,S.O. and Fleming,P.J. (2000)Internal packing of helical membrane proteins.Proc. Natl Acad.Sci., USA, 97, 5796–5801.

Eisenberg,D., Weiss,R.M. and Terwilliger,T.C. (1982) The helicalhydrophobic moment: a measure of the amphiphilicity of a helix.Nature, 299, 371–374.

Eisenberg,D., Weiss,R.M. and Terwilliger,T.C. (1984) The hydro-phobic moment detects periodicity in protein hydrophobicity.Proc. Natl Acad. Sci., USA, 81, 140–144.

Engelman,D.M. and Zaccai,G. (1980) Bacteriorhodopsin is aninside-out protein.Proc. Natl Acad. Sci., USA, 77, 5894–5898.

Farrens,D.L., Altenbach,C., Yang,K., Hubbell,W.L. andKhorana,H.G. (1996) Requirement of rigid-body motion oftransmembrane helices for light activation of rhodopsin.Science,274, 768–770.

Garavito,R.M. and White,S.H. (1997) Membrane proteins. Structure,assembly, and function: a panoply of progress.Curr. Opin. Struct.Biol., 7, 533–536.

Goldsack,D.E. and Chalifoux,R.C. (1973) Contribution of the freeenergy of mixing of hydrophobic side chains to the stabil-ity of the tertiary structure of proteins.J. Theor. Biol., 39,645–651.

Gorodkin,J., Heyer,L.J., Brunak,S. and Stormo,G.D. (1997) Display-ing the information contents of structural RNA alignments: thestructure logos.Comput. Appl. Biosci., 13, 583–586.

Hanley,J.A. and McNeil,B.J. (1982) The meaning and use of the areaunder a receiver operating characteristic (ROC) curve.Radiology,143, 29–36.

1833

by guest on February 17, 2016http://bioinform

atics.oxfordjournals.org/D

ownloaded from

T.Beuming and H.Weinstein

Horn,F., Weare,J., Beukers,M.W., Horsch,S., Bairoch,A., Chen,W.,Edvardsen,O., Campagne,F. and Vriend,G. (1998) GPCRDB:an information system for G protein-coupled receptors.NucleicAcids Res., 26, 275–279.

Janin,J. and Wodak,S. (1978) Conformation of amino acid side-chains in proteins.J. Mol. Biol., 125, 357–386.

Jones,D.D. (1975) Amino acid properties and side-chain orienta-tion in proteins: a cross correlation approach.J. Theor. Biol., 50,167–183.

Kaback,H.R., Sahin-Toth,M. and Weinglass,A.B. (2001) Thekamikaze approach to membrane transport.Nat. Rev. Mol. Cell.Biol., 2, 610–620.

Karlin,A. and Akabas,M.H. (1998) Substituted-cysteine accessibilitymethod.Meth. Enzymol., 293, 123–145.

Kawashima,S. and Kanehisa,M. (2000) AAindex: amino acid indexdatabase.Nucleic Acids Res., 28, 374.

Komiya,H., Yeates,T.O., Rees,D.C., Allen,J.P. and Feher,G. (1988)Structure of the reaction center fromRhodobacter sphaeroidesR-26 and 2.4.1: symmetry relations and sequence comparis-ons between different species.Proc. Natl Acad. Sci., USA, 85,9012–9016.

Krogh,A., Larsson,B., von Heijne,G. and Sonnhammer,E.L. (2001)Predicting transmembrane protein topology with a hidden Markovmodel: application to complete genomes.J. Mol. Biol., 305,567–580.

Kyte,J. and Doolittle,R.F. (1982) A simple method for displaying thehydropathic character of a protein.J. Mol. Biol., 157, 105–132.

Lazaridis,T. and Karplus,M. (2000) Effective energy functionsfor protein structure prediction.Curr. Opin. Struct. Biol., 10,139–145.

Liang,J. (2002) Experimental and computational studies of determ-inants of membrane-protein folding.Curr. Opin. Chem. Biol., 6,878–884.

Liu,W., Crocker,E., Siminovitch,D.J. and Smith,S.O. (2003) Roleof side-chain conformational entropy in transmembrane helixdimerization of glycophorin a.Biophys. J., 84, 1263–1271.

Miller,S., Janin,J., Lesk,A.M. and Chothia,C. (1987) Interior andsurface of monomeric proteins.J. Mol. Biol., 196, 641–656.

Monne,M., Hermansson,M. and von Heijne,G. (1999) A turnpropensity scale for transmembrane helices.J. Mol. Biol., 288,141–145.

Norregaard,L., Visiers,I., Loland,C.J., Ballesteros,J., Weinstein,H.and Gether,U. (2000) Structural probing of a microdomain inthe dopamine transporter by engineering of artificial Zn2+bindingsites.Biochemistry, 39, 15836–15846.

Overington,J., Donnelly,D., Johnson,M.S., Sali,A. and Blundell,T.L.(1992) Environment-specific amino acid substitution tables: ter-tiary templates and prediction of protein folds.Protein Sci., 1,216–226.

Pilpel,Y., Ben-Tal,N. and Lancet,D. (1999) kPROT: a knowledge-based scale for the propensity of residue orientation in trans-membrane segments. Application to membrane protein structureprediction.J. Mol. Biol., 294, 921–935.

Popot,J.L. and Engelman,D.M. (1990) Membrane protein foldingand oligomerization: the two-stage model.Biochemistry, 29,4031–4037.

Popot,J.L. and Engelman,D.M. (2000) Helical membrane proteinfolding, stability, and evolution.Annu. Rev. Biochem., 69,881–922.

Rees,D.C., DeAntonio,L. and Eisenberg,D. (1989) Hydrophobicorganization of membrane proteins.Science, 245, 510–513.

Rees,D.C. and Eisenberg,D. (2000) Turning a reference inside-out: commentary on an article by Stevens and Arkin entitled:“Are membrane proteins ’inside-out’ proteins?” Erratum (1999)[Proteins, 36, 135–143.]Proteins, 38, 121–122.

Rosner,B. (2000).Fundamentals of Biostatistics. Pacific Grove,Duxbury.

Samatey,F.A., Xu,C. and Popot,J.L. (1995) On the distribution ofamino acid residues in transmembrane alpha-helix bundles.Proc.Natl Acad. Sci., USA, 92, 4577–4581.

Sankararamakrishnan,R. and Weinstein,H. (2000) Moleculardynamics simulations predict a tilted orientation for the helicalregion of dynorphin A(1–17) in dimyristoylphosphatidylcholinebilayers.Biophys. J., 79, 2331–2344.

Sansom,M.S. and Weinstein,H. (2000) Hinges, swivels and switches:the role of prolines in signalling via transmembrane alpha-helices.Trends Pharmacol. Sci., 21, 445–451.

Senes,A., Gerstein,M. and Engelman,D.M. (2000) Statistical ana-lysis of amino acid patterns in transmembrane helices: the GxxxGmotif occurs frequently and in association with beta-branchedresidues at neighboring positions.J. Mol. Biol., 296, 921–936.

Shi,L., Simpson,M.M., Ballesteros,J.A. and Javitch,J.A. (2001)The first transmembrane segment of the dopamine D2 receptor:accessibility in the binding-site crevice and position in thetransmembrane bundle.Biochemistry, 40, 12339–12348.

Silla,E., Villar,F., Nilsson,O., Pascual-Ahuir,J.L. and Tapia,O.(1990) Molecular volumes and surfaces of biomacromoleculesvia GEPOL: a fast and efficient algorithm.J. Mol. Graph., 8,168–172, 151.

Silverman,B.D. (2003) Hydrophobicity of transmembrane proteins:spatially profiling the distribution.Protein Sci., 12, 586–599.

Stevens,T.J. and Arkin,I.T. (1999) Are membrane proteins “inside-out” proteins?Proteins, 36, 135–143.

Stevens,T.J. and Arkin,I.T. (2000a) Do more complex organismshave a greater proportion of membrane proteins in their genomes?Proteins, 39, 417–420.

Stevens,T.J. and Arkin,I.T. (2000b) Turning an opinion inside-out:Rees and Eisenberg’s commentary [Erattum (2000)Proteins, 38,121–122.] on “Are membrane proteins ‘inside-out’ proteins?”[Erattum (1999)Proteins, 36, 135–143.]Proteins, 40, 463–464.

Stevens,T.J. and Arkin,I.T. (2001) Substitution rates in alpha-helicaltransmembrane proteins.Protein Sci., 10, 2507–2517.

Tekaia,F., Yeramian,E. and Dujon,B. (2002) Amino acid com-position of genomes, lifestyles of organisms, and evolutionarytrends: a global picture with correspondence analysis.Gene, 297,51–60.

Tomii,K. and Kanehisa,M. (1996) Analysis of amino acid indicesand mutation matrices for sequence comparison and structureprediction of proteins.Protein Eng., 9, 27–36.

Ubarretxena-Belandia,I. and Engelman,D.M. (2001) Helical mem-brane proteins: diversity of functions in the context of simplearchitecture.Curr. Opin. Struct. Biol., 11, 370–376.

Ulmschneider,M.B. and Sansom,M.S. (2001) Amino acid distribu-tions in integral membrane protein structures.Biochim. Biophys.Acta., 1512, 1–14.

Visiers,I., Ballesteros,J.A. and Weinstein,H. (2002) Three-dimensional representations of G protein-coupled receptor struc-tures and mechanisms.Meth. Enzymol., 343, 329–371.

1834

by guest on February 17, 2016http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Buried and exposed faces of TM domains

von Heijne,G. (1992) Membrane protein structure prediction. Hydro-phobicity analysis and the positive-inside rule.J. Mol. Biol., 225,487–494.

Wallin,E. and von Heijne,G. (1998) Genome-wide analysis of integ-ral membrane proteins from eubacterial, archaean, and eukaryoticorganisms.Protein Sci., 7, 1029–1038.

Wertz,D.H. and Scheraga,H.A. (1978) Influence of water on proteinstructure. An analysis of the preferences of amino acid residuesfor the inside or outside and for specific conformations in a proteinmolecule.Macromolecules, 11, 9–15.

White,S.H. and Wimley,W.C. (1999) Membrane protein folding andstability: physical principles.Annu. Rev. Biophys. Biomol. Struct.,28, 319–365.

Zhou,F.X., Merianos,H.J., Brunger,A.T. and Engelman,D.M.(2001) Polar residues drive association of polyleucinetransmembrane helices.Proc. Natl Acad. Sci., USA, 98,2250–2255.

Zimmerman,J.M., Eliezer,N. and Simha,R. (1968) The characteriz-ation of amino acid sequences in proteins by statistical methods.J. Theor. Biol., 21, 170–201.

1835

by guest on February 17, 2016http://bioinform

atics.oxfordjournals.org/D

ownloaded from