BIOINFORMATICS METHODS FOR TOPOLOGY PREDICTION OF MEMBRANE ...su.diva-portal.org/smash/get/diva2:1067468/FULLTEXT01.pdf · 1.4 Structural classes of transmembrane proteins ... Bioinformatics

B I O I N F O R M A T I C S M E T H O D S F O R T O P O L O G Y P R E D I C T I O N O F M E M B R A N E P R O T E I N S

Konstantinos Tsirigos

Bioinformatics methods for topology prediction of membrane proteins

Konstantinos Tsirigos

©Konstantinos Tsirigos, Stockholm University, 2017 The cover picture shows my favourite sea in Greece, in Methoni, right next to the castle. The picture is kindly provided by Menelaos Meletzis. ISBN print: 978-91-7649-648-0 ISBN PDF: 978-91-7649-649-7 Printed in Sweden by US-AB, Stockholm, 2017

“There is nothing impossible to him who will try”

Alexander the Great

Contents

List of papers ................................................................................................. 11

Abstract ......................................................................................................... 15

1. Introduction ............................................................................................... 17 1.1 Proteins ................................................................................................................. 17 1.2 Biological membranes and membrane proteins .................................................... 19 1.3 Transmembrane proteins ...................................................................................... 20 1.4 Structural classes of transmembrane proteins ...................................................... 21

1.4.1 Alpha-helical transmembrane proteins .................................................... 21 1.4.2 Beta-barrels ............................................................................................. 22

1.5 Membrane targeting and insertion ......................................................................... 23 1.6 The signal peptide ................................................................................................. 25 1.7 Transmembrane protein topology and topology prediction methods ..................... 25 1.8 Multiple Sequence Alignments and their applications in topology prediction ......... 30

2. Materials and Methods .............................................................................. 33 2.1 Algorithms ............................................................................................................. 33 2.2 Markov Models ...................................................................................................... 33 2.3 Hidden Markov Models ......................................................................................... 34

2.3.1 Hidden Markov Models in membrane protein topology prediction ................. 36 2.4 Profile Hidden Markov Models .............................................................................. 38 2.5 Databases ............................................................................................................. 39

2.5.1 UniProt .......................................................................................................... 39 2.5.2 PFAM ............................................................................................................ 39 2.5.3 OMPdb ......................................................................................................... 40 2.5.4 PDBTM ......................................................................................................... 40

2.6 Reliability measures .............................................................................................. 40

3. Results and Discussion ............................................................................. 43 3.1 Developing effective strategies for the prediction of alpha-helical membrane proteins at genome scale (paper I) .............................................................................. 43 3.2 Improved topology prediction and discrimination of alpha-helical membrane proteins with TOPCONS2 (paper II) ............................................................................ 44 3.3 Creating PSI-BLAST and jackhmmer profiles accurately and fast using PRODRES (paper III) .................................................................................................................... 45

3.4 Improved topology prediction and discrimination of beta-barrel membrane proteins with PRED-TMBB2 (paper IV) ..................................................................................... 45

Sammanfattning på svenska ......................................................................... 47

Acknowledgments ......................................................................................... 49

References .................................................................................................... 53

Abbreviations

ANN Artificial Neural Network BLAST Basic Local Alignment Tool BLOSUM Blocks Substitution Matrix DBN Dynamic Bayesian Network GPCR G-protein-Coupled Receptor HMM Hidden Markov Model MSA Multiple Sequence Alignment ORF Open Reading Frame pHMM Profile Hidden Markov Model PSI-BLAST Position-Specific-Iterated BLAST PSSM Position-Specific Scoring Matrix SVM Support Vector Machine SP TM

Signal Peptide Transmembrane

10

11

List of papers

The following papers are included in this thesis and are referred to by their roman numerals:

PAPER I: A guideline to proteome-wide alpha-helical membrane protein topology predictions. Tsirigos K.D.*, Hennerdal A*. & Elofsson A. Proteomics 14, 2282-2294, (2012).

PAPER II: The TOPCONS web server for consensus prediction of

membrane protein topology and signal peptides. Tsirigos K.D.*, Peters C.*, Shu N.* & Elofsson A. Nucleic Acids Res 43(W1), W401-W407, (2015).

PAPER III: PRODRES: Fast protein searches using a protein domain

reduced database. Pascarelli S.*, Tsirigos K.D.*, Shu N.*, Peters C.* & Elofsson A.

Manuscript PAPER IV: PRED-TMBB2: Improved topology prediction and

detection of beta-barrel outer membrane proteins. Tsirigos K.D., Elofsson A. & Bagos P.G. Bioinformatics 32, i665-i671, (2016).

12

The following publications are not included in this thesis:

PAPER V: Large tilts in transmembrane helices can be induced

during tertiary structure formation. Virkki M., Boekel C., Illergård K., Peters C., Shu N., Tsirigos K.D., Elofsson A., von Heijne G. & Nilsson I. J Mol Biol 426, 2529-2538, (2014).

PAPER VI: Creation, Management and Expansion of Specialized

Protein Databases. Babbitt P.C., Bagos P.G., Bairoch A., Bateman A., Chatonnet A., Chen M.J., Craik D., Finn R.D., Gloriam D., Haft D.H., Henrissat B., Holliday G.L., Isberg V., Kaas Q., Landsman D., Lenfant N., Manning G., Nagano N., Srinivasan N., O’Donovan C., Pruitt K.D., Sowdhamini R., Rawlings N.D., Saier M.H. Jr, Sharman J.L., Spedding M., Tsirigos K.D., Vastermark A. & Vriend G. Database eCollection 2015, bav063, (2015).

PAPER VII: Improved topology prediction using the terminal hy-

drophobic helices rule. Peters C., Tsirigos K.D., Shu N. & Elofsson A. Bioinformatics 32, 1158-1162, (2016).

PAPER VIII: Inclusion of dyad-repeat pattern improves topology

prediction of transmembrane beta-barrel proteins. Hayat S., Peters C., Shu N., Tsirigos K.D. & Elofsson A. Bioinformatics 32, 1571-1573, (2016).

PAPER IX: GWAR: Robust Analysis and Meta-Analysis of Genome-Wide Association Studies. Dimou N.L, Tsirigos K.D., Elofsson A. & Bagos P.G. Bioinformatics, in press (2017).

PAPER X: Disprot 7.0: a major update in the database of disordered proteins. Piovesan D.*, Tabaro F.*, Micetic I., Necci M., Quaglia F., Oldfield C., Aspromonte M.C., Davey N.E., Davidovic R., Dosztanyi Z., Elofsson A., Gasparini A., Hatos A., Kajava A.V., Kalmar L., Leonardi E., Lazar T., Macedo-Ribeiro S., Castillo M.M., Meszaros A., Minervini G., Murvai N., Pujols J., Roche D.B., Salladini E., Schad E., Schramm A., Szabo B., Tantos A., Tonello F., Tsirigos K.D., Veljkovic

13

N., Ventura S., Vranken W., Warholm P., Uversky V.N., Dunker K.A., Longhi S.†, Tompa P.†, Tosatto S.C.E.†.

Nucleic Acids Res 45 (D1), D219-D227, (2017).

* Denotes joint-first authors. † Denotes joint co-corresponding authors. Reprints were made with permissions from publishers.

14

15

Abstract

Membrane proteins are key elements of the cell since they are associated with a variety of very important biological functions crucial to its survival. They are implicated in cellular recognition and adhesion, act as molecular receptors, transport substrates through membranes and exhibit specific en-zymatic activity.

This thesis is focused on integral membrane proteins, most of which con-tain transmembrane segments that form an alpha helix and are composed of mainly hydrophobic residues, spanning the lipid bilayer. A more specialized and less well-studied case, is the case of integral membrane proteins found in the outer membrane of Gram-negative bacteria and (presumably) in the outer envelope of mitochondria and chloroplasts, proteins whose transmembrane segments are formed by amphipathic beta strands that create a closed barrel (beta-barrels).

The importance of transmembrane proteins, as well as the inherent diffi-culties in crystallizing and obtaining three-dimensional structures of these, dictates the need for developing computational algorithms and tools that will allow for a reliable and fast prediction of their structural and functional fea-tures. In order to elucidate their function, we must acquire knowledge about their structure and topology with relation to the membrane. Therefore, a large number of computational methods have been developed in order to predict the transmembrane segments and the overall topology of transmem-brane proteins.

In this thesis, I initially describe a large-scale benchmark of many topolo-gy prediction tools in order to devise a strategy that will allow for better detection of alpha-helical membrane proteins in a proteome. Then, I give a description of construction of improved machine-learning algorithms and computer software for accurate topology prediction of transmembrane pro-teins and discrimination of such proteins from non-transmembrane proteins. Finally, I introduce a fast way to obtain a position-specific scoring matrix, which is essential for modern topology prediction methods.

16

17

1. Introduction

1.1 Proteins Proteins consist of amino acids that are connected with peptide bonds and

form a linear sequence. According to the central dogma of molecular biology [1, 2], they are synthesized based on the genetic information that is embed-ded in the DNA and is transferred to the ribosomic machinery through the RNA (Figure 1.1).

Figure 1.1. The central dogma of molecular biology. Continuous arrows show the flow of genetic information, while the dashed arrow refers to the special case of generating DNA from RNA that is observed in retroviruses.

Since the 1950’s, when the pioneering work of Frederick Sanger opened

up new horizons in protein sequencing [3, 4], information regarding protein sequences has been accumulating at a continually increasing pace. The nu-merous ongoing genome-sequencing programs of the last three decades, led to possibilities to deduce amino acid sequences of millions of proteins and the number continues to grow rapidly [5] (Figure 1.2). Despite that, a large fraction of the proteins determined through Open Reading Frame (ORF) identification, still has an unknown function.

The biological functionality of a protein is primarily determined by its conformation, in other words the way in which the linear amino acid se-quence is folded in space [6]. The three-dimensional (3D) structures of some thousands of proteins have been determined at atomic or subatomic resolu-

18

tion using X-ray crystallography, Nuclear Magnetic Resonance (NMR) spec-troscopy and, more recently, cryo-electron microscopy [7] (Figure 1.3).

Figure 1.2. The exponential growth of protein sequences that are deposited in Uni-Prot, from its early releases up to the end of 2016 [5].

Experimental evidence clearly suggests that all the necessary information

for the folding of a protein in its native state is entirely encoded in its se-quence [8]. Consequently, several attempts have been made towards protein structure prediction directly from amino acid sequence, but with limited success [9, 10].

Figure 1.3. The growth of determined three-dimensional structures of proteins that are deposited in the PDB database, from 1972 up to the end of 2016 [7].

19

1.2 Biological membranes and membrane proteins Biological membranes are structures that can be seen as mechanisms of

isolation, protection and partitioning of the cells in addition to providing an interface for the communication and interaction between the cells and their surrounding environment. According to the prevailing theories, biological membranes are constructed based on the fluid mosaic model [11]. In general terms, membranes consist of a lipid bilayer in and around which several kinds of proteins are constantly interacting (Figure 1.4).

Lipids can be of various kinds (e.g. phospholipids, saccharolipids, sphin-golipids) but exhibit a common behaviour in aqueous systems; the polar heads of the lipids align towards the polar, aqueous environment, while the contact between the hydrophobic tails and the water is minimized and they tend to cluster together. As a consequence, the membrane is impermeable to most polar molecules and many other macromolecules, such as proteins. The specific characteristics of each biological membrane (thickness, permeabil-ity, hydrophobicity) are determined by its composition in lipids as well as by the type and quantity of proteins that it contains. For example, the myelin of some neurons primarily consists of lipids, whereas the plasma membranes of bacteria, mitochondria and chloroplasts have more protein than lipid.

Figure 1.4. Graphical illustration of a typical lipid bilayer, along with the different types of membrane proteins. We can see the phospholipid polar heads and the non-polar fatty acyl chains, cholesterol molecules, integral and peripheral proteins and oligosaccharide chains in the extracellular side (figure created by Mariana Ruiz Villarreal and deposited in Wikipedia).

Membrane proteins can be divided into two groups, namely the trans-

membrane (TM) proteins (integral proteins), that span the lipid bilayer and the extrinsic ones, which form weak interactions with the membrane surface (peripheral proteins) or with lipids (lipid-anchored proteins). The amino acid composition of TM proteins is such that facilitates integration into the lipid bilayer. In contrast, lipid-anchored proteins are bound through recognition of

20

a specific pattern in their sequence by specialized enzymes. Peripheral pro-teins attach through non-covalent interactions to the membrane surface, similarly to the interactions that are observed in globular proteins [12].

1.3 Transmembrane proteins TM proteins constitute the most important and the most widely studied

category of membrane proteins. They typically form about 25-30% of all proteins encoded in a eukaryotic genome and carry out a series of functions crucial to the life of the cells [13]. These include cellular recognition, mo-lecular receptors, passive and active transport of substances via the mem-brane, signal transduction, protein secretion and enzymatic activity (Figure 1.5). They also aid in the regulation of membrane lipid constitution as well as the conservation of membrane and cell shape [14]. Malfunction of TM proteins can therefore result in several kinds of diseases. For example, depression, schizophrenia and other neurological diseases are associated with mutations in genes encoding ion channels [15], whereas cystic fibrosis is the result of a mutation in the gene encoding an ABC transporter protein [16].

Figure 1.5. Some examples of TM proteins’ roles (copyright: Nature Education).

TM proteins are also of great pharmacological importance, since they are targets for drugs that modulate their functions. Nowadays, more than 50% of all prescribed small-molecule drugs target membrane proteins [17-19]. One of the pharmacologically most interesting and studied families of membrane proteins are the G-protein-Coupled Receptors (GPCRs), which are targeted by antipsychotic drugs such as olanzapine, antihistamines like loratadine, losartan to treat hypertension and many others [20]. Understanding the actual molecular structure of TM proteins opens the door for rational drug design and is thus a crucial step towards the development of new and improved drugs.

21

1.4 Structural classes of transmembrane proteins Integral proteins can be broadly classified based on the secondary struc-

ture of their TM segments. There are proteins that span the membrane in the form of alpha helices (in a bundle form or not) and proteins whose TM re-gions are composed of beta strands in the form of anti-parallel closed barrels (Figure 1.6).

Proteins of each group possess distinct characteristics, obviously related to the 3D-structure of their TM segments and the respective folding process that occurs in each case. Some of these characteristics reflect biogenesis of membrane proteins and the respective membranes, as well as the features of the cell-transport machinery and the environmental limitations that are im-posed by the physicochemical properties of the numerous types of lipid bi-layers.

Figure 1.6. Various types of TM proteins. From left to right: (a) a protein whose polypeptide chain spans the membrane once as an alpha helix, (b) a protein which forms multiple transmembrane alpha helices and (c) a protein with several beta strands that form a channel through the membrane (copyright: Dharmesh Patel).

1.4.1 Alpha-helical transmembrane proteins Alpha-helical TM proteins can be primarily classified based on the num-

ber and orientation of their TM segments (Figure 1.7). Type Ι TM proteins possess one TM helix and their N-terminus is located in the extracellular space. Type II TM proteins also have one TM helix, but their N-terminus is in the cytosol. The rest of the proteins, that contain more than one TM seg-ment, belong to the multi-spanning class, which is further divided into sub-categories that reflect not only structural but also functional similarities. For instance, GPCRs constitute a heterogeneous group of receptors [21], which show both structural (number of TM segments and topology) and functional similarities (signal transduction through heterotrimeric G-proteins).

22

Figure 1.7. Alpha helixes are dextral and are stabilized by the presence of hydrogen bonds almost parallel to their axis.These bonds are formed between the hydrogen of the imino group (-NH) of residue i and the oxygen of the carbonyl group of residue i-4. The helix has a characteristic step of 5.4 Å and 3.6 amino acid residues per turn (figure on the left from Wikipedia). On the right, a cartoon image of bovine rhodop-sin [22] is shown (PDB ID: 1f88) (figure created by Jawahar Swaminathan and deposited in Wikipedia).

1.4.2 Beta-barrels A beta-barrel can be defined as a beta sheet that coils and loops forming a

closed structure in the shape of a barrel (Figure 1.8). TM beta-barrel proteins are further divided into several groups, mainly based on their structural simi-larity (the number of the TM segments and the gradient to the membrane level) which, in most cases, reflects also functional similarities.

In contrast to alpha-helical membrane proteins that are abundant in virtu-ally all cellular membranes [23], beta-barrels have only been experimentally observed in the outer membranes of Gram-negative bacteria so far [24]. Weak similarity at the sequence level and computational analyses, often accompanied by low-resolution experimental data, suggest the presence of beta-barrel proteins in the outer membranes of semi-autonomous eukaryotic organelles (mitochondria and chloroplasts) as well. These findings are in agreement with the endosymbiotic theory [25], according to which, some primitive alpha-proteobacteria were the ancestors of mitochondria, whereas some primitive cyanobacteria were the ancestors of chloroplasts [26].

23

Figure 1.8. In the case of beta-barrels, the barrel that spans the membrane is formed by a beta-pleated sheet that coils and loops accordingly. Τhe beta sheet is a form of secondary structure, just like the alpha helix. In this case, hydrogen bonds are formed between the carbonyl and imino groups of different polypeptide chains or between different parts of the same chain that are called beta strands (figure on the left from Wikipedia). On the right, a graphical illustration of the structure of Outer membrane protein G (OmpG - PDB ID: 2iwv) is shown (figure created by Andrei Lomize and deposited in Wikipedia)[27, 28].

1.5 Membrane targeting and insertion The exceptionally efficient system of storing, encoding and inheriting of

the genetic information (nucleic acids), as well as the precision of decoding this information through protein synthesis would not contribute much to the survival of the cell if these processes were not combined with a follow-up system of recognition, sorting and transport of the products (proteins) to the position where they are functional.

The two structural types of TM proteins are handled differently shortly af-ter their synthesis on ribosomes (Figure 1.9). Alpha-helical TM proteins are assembled upon ribosomes that are co-translationally bound to the Sec trans-locon in the membrane and they move laterally from the translocon channel into the surrounding lipid bilayer [29, 30].

Beta-barrel proteins follow a different path; in both Gram-negative bacte-ria and mitochondria, the assembly of these proteins is facilitated by a com-plex of periplasmic chaperone proteins, as well as a highly conserved beta-barrel protein, called Omp85/YaeT/BamA [31] and its homolog Tob55/Sam50 in mitochondria [32, 33]. Initially, beta-barrel proteins are synthesized on cytoplasmic ribosomes with an N-terminal signal peptide that guides them across the inner membrane in a post-translational manner. A soluble cytoplasmic chaperone, which is called SecB, recognizes the pre-cursor protein and brings it to the Sec translocon, which is located in the inner membrane [29, 30, 34]. In the next step, the SecA ATPase is responsi-ble for their translocation through the translocon. Since the constituent beta strands are too short and not hydrophobic enough, they cannot become em-

24

bedded in the inner membrane; instead, with the help of the YeaT complex, they are chaperoned through the periplasm and reach their final destination, the outer membrane. Finally, a signal peptidase cleaves off the signal peptide in the periplasmic side of the inner membrane [35-37].

Figure 1.9. Illustration of the biogenesis of an alpha-helical membrane protein (left) and a beta-barrel (right) in a Gram-negative bacterium (Escherichia coli) (figure taken from [29]).

25

1.6 The signal peptide A signal peptide (SP) is a short stretch of amino acids at the N-terminus of a protein that guides protein sorting into cellular organelles or into the cell membrane. It is cleaved by a protease after the protein is inserted into the endoplasmic reticulum or the corresponding organelle [38].

SPs can be divided into three distinct regions [39]: • The n-region that contains on average 1-5 positively-charged res-

idues. • The h-region, that contains on average 7-15 hydrophobic resi-

dues. • The c-region, that contains on average 3-7 polar, uncharged resi-

dues. The c-region also contains the cleavage site of the signal peptide, where the signal peptidase cuts it off the polypeptide chain. This region is the most conserved part of a signal peptide.

The average length of a SP is between 18 and 30 residues. There are not many clearly conserved residues, except for the position -1 and -3, with re-gards to the cleavage site, which are dominated by small amino acids like alanine, glycine, serine, cysteine, threonine and the position -2, which usual-ly contains an aromatic, charged or bulky polar residue [40].

Type II TM proteins contain another type of SPs, the so-called signal an-chors. These have similar composition, but the difference is that signal an-chors remain uncleaved, because they do not have recognition site by the signal peptidase.

1.7 Transmembrane protein topology and topology prediction methods

The experimental methods that are used to determine the 3D-structure of a protein are expensive, time-consuming and require the use of mono-crystals (X-rays), which are not easy to produce for several protein catego-ries, particularly the TM ones. The most important obstacles in solving the structure of a TM protein are related to its hydrophobic nature. For example, denaturation of the TM protein with detergents results in inability to further solubilize it, which makes crystallization impossible. Recent studies have shown that the progress in the determination of the structure of TM proteins follows an exponential growth, similar to the growth that has been observed for 50 years now, when the first protein structure was presented [41] (Figure 1.10).

We anticipate an even greater increase in the number of TM proteins with known structure within the following years, but, since the delay in solving the first structure of a TM protein was approximately 20 years as compared

26

to the globular ones, the gap between the number of structures of TM pro-teins and globular ones may never be bridged.

Figure 1.10. The growth in TM protein structure determination, since the release of the first structure (figure created at Stephen White’s lab at UC Irvine and can be found at http://blanco.biomol.uci.edu/mpstruc/) [41].

Given the aforementioned difficulties, the need for automated computa-tional tools which will predict the potential structure of a TM protein with high accuracy becomes imperative. These theoretical prediction algorithms aim at creating a topology model of the protein at hand. Such models show the number and relative position of the TM segments, together with the ori-entation with regards to the membrane (Figure 1.11). A noteworthy point is that, a determined 3D-structure of an integral protein, does not always in-form us about the exact boundaries of the TM segments or its orientation in the membrane [42].

Over the years, many topology prediction algorithms for TM proteins have been created. The first ones used simple measurements like the hydro-phobicity of the amino acids [43] in order to detect potential TM segments. Kyte and Doolitle applied a “sliding” window with which they scanned the amino acid sequence and calculated the average hydrophobicity for the ami-no acids included in it. Afterwards, by considering the average hydrophobi-city scales, they set a cut-off in order to determine whether the central amino acid of the window is part of a TM segment or not. This method, along with

http://blanco.biomol.uci.edu/mpstruc/

27

other similar approaches could only detect potential TM regions and did not provide any information regarding the topology of the TM segments with respect to the membrane.

Figure 1.11. A graphical illustration of the topology model for a GPCR protein (GPER1_HUMAN) generated using the software Protter [44] .

A significant improvement towards this direction was accomplished by

the observation that positively-charged amino acids have a tendency to ap-pear more frequently in the cytoplasmic loops rather than in the periplasmic ones, the so-called “positive-inside rule” [45]. This finding was implemented in the TopPred algorithm [46], in an effort to develop a more accurate meth-

28

od for topology prediction. Taking the “positive-inside rule” into account, it was possible to distinguish whether a region with intermediate hydrophobi-city is (part of) a TM domain or a loop.

The MEMSAT algorithm [47] used the same information with methods of statistical optimization and, by combining dynamic programming and pro-pensity scales, produced the best topology. In the following years, more methods were made available to the public, based on amino acid preferences and hydrophobicity [48, 49]. PHD [50] was the first method that incorpo-rated Artificial Neural Networks (ANNs). It uses evolutionary information (in the form of Multiple Sequence Alignments – MSAs) for creating a con-sensus prediction for the target sequence and then finds the topology of the protein using the “positive-inside rule”. In the same context, methods that also use evolutionary information, like PRO-TMHMM, PRODIV-TMHMM and also OCTOPUS were created [51, 52].

Hidden Markov Models (HMMs), which will be discussed in more detail in chapter 2.3, were initially introduced in TMHMM [53] and HMMTOP [54], followed by other methods like HMM-TM [55] and HMMpTM [56]. Furthermore, given that signal peptides are often falsely predicted as TM segments because of their high hydrophobicity, methods that predict the topology of the protein and the presence of a signal peptide at the same time were developed [Phobius [57], Philius [58], MEMSAT-SVM [59] and SPOCTOPUS [60]]. Phobius was further modified, giving birth to PolyPhobius [61], which uses MSAs. Methods like these that contain sub-models for the prediction of signal peptides and the topology are very im-portant because signal peptides and N-terminal TM regions are quite similar and the hydrophobic core of a signal peptide can be wrongly assigned as a putative first TM segment or vice-versa [62].

Support Vector Machines (SVMs) and Dynamic Bayesian Networks (DBNs) have been also used in predictors like MEMSAT-SVM [59] and Philius [58] respectively.

Finally, consensus-based approaches, like TOPCONS [63], MetaTM [64] and CCTOP [65], which combine the outputs from several predictors into a consensus output using dynamic programming, have been quite successful.

Quite recently, a common approach that is followed by some topology prediction methods involves the incorporation of experimental information regarding the topology of several regions of the proteins (e.g. signal peptides or long extracellular loops) prior to the actual prediction (constrained predic-tions). Some of the most efficient algorithms that use this feature are HMM-TM, HMMTOP, Phobius, PRO/PRODIV-TMHMM, SCAMPI/SCAMPI-MSA, TMHMMfix [66] and TOPCONS.

Most benchmark studies report an average performance of about 70% cor-rectly predicted topologies for older methods up to over 80% for newer ones, while the ability to distinguish between TM and non-TM proteins can go even higher, to almost 99% [13]. This is however not the case when it comes

29

to proteome-wide predictions, which can be explained by the fact that the small benchmark datasets usually contain a certain amount of bias in them. In a recent study, Fagerberg et al [67] tried to produce a draft of the predict-ed human membrane proteome using the outputs from several predictors and showed that there was an agreement in only 12% of the predicted membrane proteins among them. This indicates that the prediction accuracies obtained on small sets are not always transferable to complete proteomes.

As more and more structures of alpha-helical TM proteins became availa-ble, researchers observed some interesting features in their topology [29, 68]. More specifically, we now know that TM helices can be interrupted / bent because of proline residues (disrupted helices) [69, 70] or strongly tilted [71-73]. There are also amphipathic helices that run roughly parallel to the mem-brane surface, with parts of them located inside the interface region (interfa-cial helices) [74, 75]. Finally, membrane-penetrating regions that enter and exit the membrane on the same side (re-entrant loops) have been discovered [76, 77]. Such loops are found in TM proteins that act as channels or trans-porters and usually the have a functional role in the protein structure [78-80]. It is actually estimated that ~10% of all TM proteins encoded in a genome contain such loops [81].

Furthermore, there are TM proteins which do not have only one topology and do not follow the “positive-inside rule”. These proteins are termed dual-topology proteins and can adopt either of the two possible orientations [82]. Another interesting finding was that, closely related proteins like RnfA and RnfE [83] or YdgE and YdgF [84] from E. coli, although they have the same number of TM segments and they follow the “positive inside rule”, they actually adopt opposite orientation in the membrane.

Lastly, TM proteins often have dynamically-changing topologies, mean-ing that, besides being TM, they can be found in the cytoplasm and also be secreted in the extracellular space. A typical example is the prion protein, which can be found with one TM segment and both possible orientations, but also as a water-soluble protein in the cytoplasm [85].

Regarding beta-barrel proteins, there are numerous methods that aim at topology prediction. These include methods based on hydrophobicity analy-sis [86], statistical preferences of amino acids [87], remote homology detec-tion [88], HMMs [89-93], feed-forward ANNs [94, 95] and radial basis func-tion ANNs [96].

There are also methods that deal specifically with the problem of identify-ing beta-barrels in proteome-wide analyses. These include BetAware [92], BOMP [97], the Freeman–Wimley beta-Barrel Analyser [98], HHomp [88], PSORTb [99], SSEA-OMP [100], TMB-Hunt [101], SOSUIgramN [102] and TMBETADISC-RBF [96].

30

1.8 Multiple Sequence Alignments and their applications in topology prediction

One of the major challenges of computational sequence analysis is to pre-dict the function and structure of proteins from their sequence alone. This is possible since organisms evolve by mutation, duplication and selection of their genes. Thus, sequence similarity often indicates functional and struc-tural similarity. MSAs are often more useful than single sequences, since, in each position of the alignment, we can observe the (possible) conservation and identify important (conserved) sequence positions/regions. In most cas-es, we need to convert the raw MSAs into matrices that will include a score for the occurrence of each residue, the so-called Position-Specific Scoring Matrices (PSSMs). The evolutionary information which is included in a MSA has been shown to improve the accuracy of many bioinformatics methods tasked with the prediction of membrane protein topology [51, 103], secondary and tertiary structure [10, 104-106], protein-protein interactions [107-109], distant residue contacts [110], coiled-coils [111], protein disorder [112, 113], subcellular localization [114-117], protein fold classification [118], RNA-binding sites [119-121], residue-residue interactions [122] or even drug-target identification [123].

The most widely used tool that creates a PSSM profile is PSI-BLAST [124]. PSI-BLAST is an extension to the BLAST algorithm [125] and has been proven extremely successful in retrieving distantly homologous se-quences. The method works as follows: initially, a BLAST search is per-formed using the BLOSUM-62 substitution matrix [126] with a single query sequence. All hits from this search that have an E-value score lower than a user-defined threshold are kept and represent truly significant matches (real homologs). At the next step, a PSSM is constructed from the multiple align-ment of these hits with the query sequence. This PSSM has the same length as the query sequence, since gaps in the alignment are not considered. In order to create the PSSM, new scores for each position in the alignment must be calculated. Highly conserved positions receive high scores while weakly conserved positions receive low scores.

With PSI-BLAST, one can perform iterative searches, using the results from each search in order to update the PSSM that will be used in a subse-quent search, until no new sequences can be found (convergence) or a num-ber of specified rounds have been performed. The method is very efficient in picking up a large number of homologous proteins, which are too dissimilar to be detected with a standard BLAST search (distant homologs). More re-cently, other methods for this purpose have been presented. These include DELTA-BLAST [127], CS-BLAST [128] and CaBLAST [129]; still, gener-ating PSSMs with PSI-BLAST remains the most popular way and the PSI-BLAST-derived profiles are incorporated in most of the bioinformatics methods that employ evolutionary information.

31

One of the main limitations in bioinformatics software for TM proteins topology prediction is the time needed for the PSSM to be constructed. If we also take into account that most accurate predictions are indeed performed when evolutionary information (i.e. a PSSM profile) is exploited [51], then the problem is clear; performing large-scale analyses, that is, scanning whole proteomes, becomes a cumbersome task.

32

33

2. Materials and Methods

In this chapter I will briefly introduce some of the most important methodol-ogies, tools and databases that have been used in the projects of the thesis.

2.1 Algorithms Machine learning is the field that “gives computers the ability to learn

without being explicitly programmed”, according to Arthur Samuel. Tom Mitchell gave a more formal definition in which he says that “a com-puter program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks T, as measured by P, improves with experience E”. For this reason, we need to construct appropriate probabilistic models that will be based on features learnt by the training data. During recent years, the need for solving more and more complicated biological problems, has led to the exploitation of sophisticated mathematical and computational models. One of the most commonly used machine learning methods is the HMMs that were also used in this thesis.

2.2 Markov Models The concept of the Markov chain is used during the development of mod-els that describe DNA or protein sequences. The Markov chain takes a series of events and makes the assumption that each of the events is only dependent on the previous one, or, in other words, that each event only determines the next one. If this notion is extended to 2, 3,…, n previous events, then we refer to Markov chains of 2nd, 3rd, …, nth order. In the case of biological se-quences, we denote as states the symbols of the sequence – which belong to a finite alphabet (four nucleotides in the case of DNA or 20 amino acids in the case of proteins).

34

If we denote an amino acid sequence of length L by x, such that:

x = x1, x2,..., xL − 1, xL

and consider the amino acid distribution in each position i along the se-quence as a random variable, then we can define the Markov chain as a sto-chastic process with the “Markov property”. In this case, the process is com-prised of the sequence x of random variables, which take values in a “state space” that is defined only by the specific alphabet (the alphabet of amino acids for example). Here, the value xi denotes the state in which the system is at time i.

2.3 Hidden Markov Models HMM is a stochastic linear model that can be viewed as an extension of

Markovian chains. It consists of a set of hidden states, a set of observed symbols and two sets of probabilities, the transition probabilities and the emission probabilities.

Consider a protein sequence x with L residues:

x = x1, x2,..., xL−1, xL

where xi are the observations (one of the 20 amino acids). In a HMM, the observations are detached from the states; in other words, the states are now “hidden” from the observer and are modeled by the probability with which a state can give transition to the next one and is denoted as:

akl = P (πi = l | πi− 1 = k )

Here, k and l are two states that are connected via the transition probability akl, which denotes the probability of the state k to give transition to state l. These two states form a 1st order Markov chain. The connection between the observed and the state symbols is established through the emission probabili-ties:

ek (b) = P (xi = b | πi = k ) ek (b) describes the probability of emitting a particular symbol b (amino acid in the case of proteins) in position i of the sequence, given that the system is at state k. The major difference between a HMM and a simple Markov Mod-el is that, in a HMM, there is not a one-to-one correspondence between sym-bols and states of the model; thus, by looking at a symbol, we cannot simply

35

determine the state which the model visited in order to give this specific result.

Figure 2.1. A graphical illustration of the Independence Model, the Markov Chain Model (1st order Markov Model) and the Hidden Markov Model. The arrows denote the allowed transitions. In a HMM, the Markov property is valid for the states (πi) but not the observations (xi) (figure created by Pantelis Bagos).

Rabiner [130] described the three basic problems that refer to the ΗΜΜs: 1) Given a model θ, how can we calculate the total likelihood that a se-

quence x has originated from the HMM, in other words, how can we com-pute the probability P (x|θ);

2) Given a model θ and a sequence x, how can we find the sequence of states that is most probable to have generated the observed sequence?

3) How can we train the model in order to find the parameters that max-imize the likelihood?

For the first question, the solution is provided by the Forward algorithm [130, 131], which essentially is a dynamic programming algorithm that, in sequential steps, sums up, all the possible paths. For a given model, we can calculate the total probability of a sequence, even if the actual path is un-known, thus we can compare two or more sequences based on their likeli-hood.

36

For the second question, we use the Viterbi algorithm [130, 131], which finds the most probable sequence of states that has emitted a sequence of observations using successive maximizations.

The Baum-Welch (Forward-Backward) algorithm [130, 131] is used in order to answer the third problem and is a special case of the Expectation-Maximization (EM) algorithm [132]. This algorithm calculates, in each step, the expected values of the parameters, sets them as the current values and repeats the process until convergence.

HMMs were initially introduced in speech recognition problems [130]. They quickly became very popular in computational biology and have been applied to problems like gene finding [133], construction of MSAs [134], remote homology detection [135], signal peptide prediction [136, 137], sec-ondary structure prediction [138] and protein topology prediction [13].

A very interesting aspect of the HMMs in bioinformatics is that, apart from their efficiency, many algorithmic ideas and improvements on them have come about in an effort to solve some biological problems. A well-known example is the profile HMMs, which I will discuss in the chapter 2.4.

2.3.1 Hidden Markov Models in membrane protein topology prediction HMMs have been employed in membrane protein topology prediction for many years. There are several reasons why HMMs are found to be superior to other machine learning techniques for this task. First of all, a HMM can capture the differences in the composition of a hydrophobic TM helix and other regions of a protein. Moreover, unlike, for example, the traditional sliding windows approaches of ANNs, HMMs can model features like the length distribution of TM helices. Lastly, it is easy to separate the “inside” (e.g. cytoplasm) from the outside (e.g. extracellular space) using the mem-brane helix and the model can be constructed in such a way that obvious mistakes (e.g. a path of inside → helix → inside) can be avoided. In other words, the HMMs can capture the “grammar” of the biological problem [139]. In Figure 2.2, the HMM of one of the most known topology prediction methods for alpha helical proteins, TMHMM [13], is shown. This HMM consists of three different main locations (core, cap, loop) and seven differ-ent states (cytoplasmic loop, cytoplasmic cap, helix core, non-cytoplasmic cap, short non-cytoplasmic loop, long non-cytoplasmic loop and globular domain).

Another example, regarding beta-barrel proteins, can be found in one of the earliest methods presented, PRED-TMBB [89]. Here, the model consists of 61 states. The architecture was chosen so that it would incorporate the features that the known (at the time) structures shared.

37

Figure 2.2. A graphical illustration of the HMM of TMHMM (adapted from [13]).

The PRED-TMMB model has three "sub-models", which correspond to the TM region (M), the periplasmic region (I) and the extracellular region (O). The TM-model contains states that model the special architecture of the TM strands; these include states that correspond to the “aromatic belt” and also to the core of the strand. Other states correspond to the residues of the external side of the barrel and the residues of the barrel interior which create the hydrophilic pore. Further, the length of the TM strands can vary between 7 and 17 residues, since these were the observed minimum and maximum values at the time of creation. Regarding the external and internal regions, these are modelled by a “ladder” architecture, which also allows for variabil-ity in length (Fig. 2.3).

Figure 2.3. A graphical illustration of the HMM used in PRED-TMBB [89].

38

2.4 Profile Hidden Markov Models A profile Hidden Markov Model (pHMM) is essentially a HMM which

accurately describes a MSA [135]. The major difference between a profile and a typical HMM is that, in the pHMM, each state describes a particular position (column) of the alignment. As a consequence, this model has posi-tion-specific parameters and the direction of the transitions must always be unidirectional (Figure 2.4). For that reason, pHMMs are called left-to-right models, as opposed to the circular ones that allow the model to visit a state more than once.

Figure 2.4. Graphical illustration of a typical pHMM. Match states (Mk) are repre-sented with squares, insertion states (Ik) with diamonds and deletion states (Dk) with circles. All states are connected to the corresponding transition states, which are represented with arrows allowing the generation of emission probabilities (figure adapted from [131]).

Match and insertion states are normal states that are connected to the cor-responding symbols via the emission probabilities (Figure 2.4). Match states correspond to the columns of the alignment that are aligned well, which means that they correspond to a region of similarity; analogously, insertion states correspond to regions where we observe insertion of characters (not good alignment). In the latter case, the regions that are missing from the rest of the sequences in the alignment appear as gaps, which are modeled via the silent deletion states.

The applications of pHMMs are numerous; the most important in bioin-formatics are the creation of MSAs [135] and the construction of models for remote homology detection [140]. Such models are employed also in protein family classification (e.g. in the PFAM database [141]). The most popular software suite for the construction of pHMMs is HMMER [142].

39

2.5 Databases

2.5.1 UniProt UniProtKB (UniProt Knowledgebase) [5] constitutes the primary data-

base of protein sequences worldwide. It is comprised of two parts, namely UniProtKB/SwissProt, which contains the well-annotated proteins and Uni-ProtKB /TrEMBL, which contains the protein sequences that have originated from automated translation of genome sequences. Sequences in UniProtKB /SwissProt have undergone some kind of manual inspection and they are accompanied by literature references, secondary structure elements, cross-references to other biological databases as well as information about their function (if available). UniProtKB /TrEMBL contains proteins that have not been manually curated and, periodically, sequences from TrEMBL are “moved” to SwissProt, when the curators change their annotation based on literature findings and reliable automated computational tools.

2.5.2 PFAM Proteins generally contain one or more distinct functional regions

(domains), which are in many cases also structurally independent. These domains are considered to be able to function and to evolve independently of the rest of the protein. Different combinations of such regions lead to the large variety of proteins in nature. Consequently, being able to detect these domains is an important step towards the functional classification of a pro-tein. And, since structures are more conserved than sequences, databases that contain conserved domains are also important in order to easily identify and classify novel protein sequences, as well as to identify a possible novel pro-tein folding. A key representative of these databases is PFAM [143], which is extensively used in our studies.

PFAM is a large collection of protein families, which are characterized by a unique HMM. This method is much more sensitive for remote homology detection, without, at the same time, loss of speed and efficiency. The data-base contains more than 16,000 families (version 30), providing coverage for ~80% for all proteins in UNIPROT. The key feature that makes PFAM so useful is that, with the use of the HMMs (and the HMMER package that is incorporated in the database), the curators can select a cut-off in such a way that each protein will only belong in one family. However, a lower identity can be found within proteins that belong to different families, and, because of that, the database contains a higher level of organization, the clan level.

40

2.5.3 OMPdb A more specialized protein database that is used in our beta-barrel study is

OMPdb [144]. OMPdb is the largest, most complete and well characterized collection of OMPs from Gram-negative bacteria. The database consists of sequence data, as well as annotation for structural characteristics (such as the TM segments), literature references and links to other public databases, fea-tures that are unique worldwide. OMPdb contains two types of entries, namely protein and family entries. Protein entries include extensive infor-mation such as protein description and classification, sequence, organism name and taxonomy, as well as links to other databases and annotation for the TM segments and signal peptides. The annotation of TM segments is deducted using information from proteins with known 3D-structure com-bined with predictions when experimental evidence is not available. All pro-teins are classified into families based on their function and their sequence similarity. Each family (family entry) is extensively described and the in-formation provided includes the function of its protein members, literature references, a list of proteins with 3D-structure (if available), as well as, the seed and full protein alignments.

2.5.4 PDBTM For the annotation of TM regions in both alpha-helical and beta-barrel

proteins, we relied on PDBTM [145]. This database is built upon the TMDET algorithm [146], which can find the most probable orientation of a TM protein with respect to the membrane, based on the geometric coordi-nates of the respective PDB file. PDBTM contains TM proteins from PDB; therefore all proteins found in PDBTM have a known 3D-structure. The database is updated regularly by checking the PDB database for new structures using the TMDET algorithm.

2.6 Reliability measures Several ways have been proposed in order to assess the performance and

reliability of the predictions that are derived by a topology prediction meth-od. For evaluating a topology prediction at the protein level, we can consider as success only the correct number of TM segments; alternatively, we can request that the overall topology must be correct.

At the residue level, the performance of a method can be measured using the total fraction of residues that has been correctly predicted, where the prediction is regarded in a three-state mode (Q3 metric).

41

Another useful measure is the Segments Overlap (SOV), which is consid-ered to be the most reliable measure for secondary structure prediction and takes continuous values between 0 and 1 [147].

For discrimination purposes (i.e. whether a protein is a beta-barrel or not), a widely-used criterion is the Matthew’s correlation coefficient (MCC) [148], denoted as:

𝑀𝐶𝐶 =(𝑇𝑃 ∗ 𝑇𝑁) − (𝐹𝑃 ∗ 𝐹𝑁)

�(𝑇𝑃 + 𝐹𝑁)(𝑇𝑃 + 𝐹𝑃)(𝑇𝑁 + 𝐹𝑃)(𝑇𝑁 + 𝐹𝑁)

We also include measures like sensitivity (which measures the proportion

of positives that are correctly identified as such) and specificity (which measures the proportion of negatives that are correctly identified as such):

𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =𝑇𝑃

𝑇𝑃 + 𝐹𝑁

𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =𝑇𝑁

𝐹𝑃 + 𝑇𝑁

In all the afore-mentioned equations, we denote as:

1. TP (True Positives): the number of TM residues (or proteins in case of MCC) correctly identified as such

2. TN (True Negatives): the number of non-TM residues (or pro-teins) correctly identified as such

3. FN (False Negatives): the number of TM residues (or proteins) er-roneously identified as non-TM

4. FP (False Positives): the number of non-TM residues (or proteins) erroneously identified as TM

Additionally, we used the measure of reliability of the prediction (reliabil-ity score – S), which is calculated using the following formula [66]:

𝑆 =𝑝1(𝑙𝑎𝑏𝑒𝑙) + 𝑝2(𝑙𝑎𝑏𝑒𝑙) + ⋯+ 𝑝𝑁(𝑙𝑎𝑏𝑒𝑙)

𝑁

where N denotes the length of the amino acid sequence and pi (label) the

posterior probability for a residue i to belong to each of the three possible states, i.e. to be cytoplasmic, periplasmic (extracellular) or transmembrane. It has been shown that, the increase in the reliability score (S) of the predic-tion, is accompanied by increase in the overall prediction accuracy [66].

42

43

3. Results and Discussion

In this section I will briefly summarize the most important findings of each project that is included in the thesis.

3.1 Developing effective strategies for the prediction of alpha-helical membrane proteins at genome scale (paper I)

The first project focuses on studying a number of widely used topology prediction algorithms on traditional benchmark sets as well as on novel, larger datasets in order to examine how they would perform if applied to whole proteomes.

Here, we test 18 topology prediction methods on alpha-helical membrane proteins on several datasets. The predictors differ in terms of the algorithms and technical implementation and they can be grouped into three classes, namely the ones that incorporate a separate module for predicting signal peptides, those that make use of MSAs and, finally, those with which we can perform constrained predictions (by introducing prior topological infor-mation in the prediction process). The benchmark datasets are of various origins: one contains information on the full topology, two are associated with the location of the C-termini of the proteins they include and the fourth provides experimental evidence on the location of a number of glycosylation sites. The fifth and final dataset contains GPCR proteins, which are known to have seven TM regions and their N-termini on the outside.

Out of the five benchmark datasets, the “glycosylation” set had not been used in such benchmark studies before. However, we showed that if we just rely on the number of glycosylation points that are correctly predicted as “outside”, there is strong bias in the results and, consequently, methods that predict more positions of the sequence to be located in the extracellular space climb up in the ranking. To tackle this, we followed a better-than-random approach, which we use in the ranking of prediction methods. In this dataset, the results of the evaluation are not only worse than the other da-tasets, but also show large variations. This is a strong indication that previ-ous reports of topology prediction accuracies up to 80% do not hold in with larger (proteome-wide) datasets.

44

Furthermore, using the performance on all datasets, a ranking scheme for the prediction methods was formulated, measured by Z-scores. TOPCONS was found to be the best method, while methods that utilize MSAs show up at the top of the ranking in all datasets.

Constrained predictions were also tested in our study. Using the results from an independent signal peptide predictor (SignalP 4.0 [149]) as input for constrained topology predictions, we do not observe any significant changes in the performance. However, methods that implement their own signal pep-tide detection module perform better.

Another crucial point to address is the ability to distinguish between non-TM (globular) and TM proteins, especially when it comes to proteome-wide analyses. We tested all methods on cytoplasmic and extracellular proteins (non-TM datasets) and found that the “Phobius”-group of predictors (Phobi-us, PolyPhobius and Philius) succeeds in separation particularly well. More-over, these methods outperform the other methods that allow for constrained predictions.

In conclusion, for whole-genome identification and topology prediction of membrane proteins, we suggest that PolyPhobius can be used for filtering out non-membrane proteins (alternatively TOPCONS in combination with constraints from SignalP 4.0) and TOPCONS, which shows the best overall performance on the benchmark datasets, can be used for topology prediction of the (remaining) predicted membrane proteins.

3.2 Improved topology prediction and discrimination of alpha-helical membrane proteins with TOPCONS2 (paper II)

This project involves the major updates to the TOPCONS [63] webserver, one of the most accurate topology prediction methods for alpha-helical membrane proteins. Here, we addressed several drawbacks of the old im-plementation and we now achieve improved performance both on membrane protein topology predictions and on whole-proteome scanning benchmarks. Using four benchmark datasets (TM proteins, TM proteins with signal pep-tide, globular proteins and globular proteins with signal peptide), we observe that the updated version of TOPCONS [150] is on average 4% better than the next best-scoring method (Philius). Philus, however, is much worse in membrane protein topology predictions and thus not ideal for such use.

Overall, the performance of TOPCONS reaches 87%, while Philius and Phobius 83% and 82% respectively. Regarding membrane protein topology prediction, TOPCONS ranks again first in the benchmark dataset of ~300 TM proteins, with 80% correct topologies, while, methods like the old im-plementation of TOPCONS (79%), SCAMPI (79%) and MEMSAT3 (74%)

45

are found in the next few places. However, all these methods cannot predict signal peptides and are thus not suitable for large-scale analyses.

3.3 Creating PSI-BLAST and jackhmmer profiles accurately and fast using PRODRES (paper III)

In this project, we present the stand-alone version of our method to create PSI-BLAST [124] or jackhmmer profiles in a fast way, without simultaneous loss of the information included in them. The idea for this project came about in an effort to overcome the problem that many bioinformatics appli-cations face, which is the time that is spent on the creation of PSSMs with PSI-BLAST. This becomes a bottleneck especially for large-scale analyses.

PRODRES is essentially built around PFAM and its family classification. The fact that the PFAM coverage on UniProt sequences continues to grow with each update of PFAM, can be proven to be useful in order to avoid searching for homologs in large databases.

We compared the PSSMs we obtain by using PRODRES on a small da-tabase to the ones that are obtained by running PSI-BLAST on UniProt. We found that the two approaches give very similar results on a benchmark dataset of 1,000 randomly selected proteins (Pearson correlation coefficient is more than 90%). These results are much better when compared to other ways of obtaining a PSSM. The PRODRES-approach was implemented during the update of TOPCONS and has helped in being able to scan even whole proteomes at a reasonable time, since the creation of the profile now takes much less time, even with a much larger database.

3.4 Improved topology prediction and discrimination of beta-barrel membrane proteins with PRED-TMBB2 (paper IV)

In this project, the focus was shifted towards the less studied group of be-ta-barrel proteins. We sought to update one of the first and most successful topology prediction algorithms for these proteins, PRED-TMBB [151]. PRED-TMBB was presented for the first time in 2004 and is one of the most cited methods regarding the topology prediction and detection of beta-barrel outer membrane proteins. PRED-TMBB2 [152] contains several new fea-tures that improve its performance significantly. The major difference apart from a larger training dataset and new decoding algorithms is the incorpora-tion of evolutionary information in the form of MSAs, which drastically improves the topology prediction capability and makes it capable of achiev-ing higher performance compared to all other available methods.

46

At the same time, the single-sequence version of PRED-TMBB2 manages to perform better than almost all other methods regarding detection of beta-barrel proteins in large (proteome-wide) datasets, outperforming even meth-ods that use MSAs and are much slower. The combination of single- and multiple-sequence version of PRED-TMBB2 is something unique among the machine-learning methods of its kind.

47

Sammanfattning på svenska

Membranproteiner är en viktig del av cellen eftersom de är associerade med en mängd mycket viktiga biologiska funktioner som är avgörande för dess överlevnad. De är inblandade i cellulär igenkänning och adhesion, de fungerar som molekylära receptorer, de transporterar substrat genom mem-bran och de svarar för specialiserad enzymatisk aktivitet.

Denna avhandling är inriktad mot integrerade membranproteiner där de allra flesta innehåller transmembransegment som bildar en alfa-helix och är sammansatta av i huvudsak hydrofoba aminosyror som spänner över det dubbla lipidskiktet. En mer specialiserad och mindre väl studerat fall är de integrerade membranproteiner som finns i det yttre membranet hos gramnegativa bakterier och (förmodligen) i det yttermembranet hos mi-tokondrier och kloroplaster, dessa proteiner vars transmembransegment är bildade av amfipatiska beta trådar som tillsammans bildar en sluten cylinder (beta-tunnor).

Betydelsen av transmembranproteiner liksom de inneboende svårigheter-na i att kristallisera och få en tredimensionell struktur av dem, dikterar be-hovet av att utveckla beräkningsalgoritmer och verktyg som gör det möjligt av en tillförlitlig och snabb förutsägelse av deras strukturella och funktionel-la egenskaper. För att ytterligare belysa deras funktion måste vi ha kunskap om deras struktur och topologi i förhållande till membranet.. Därför har ett stort antal beräkningsmetoder utvecklats för att förutsäga transmembranseg-ment och den övergripande topologin av transmembranproteiner.

I denna avhandling görs inledningsvis en storskalig benchmark på många olika topologi-prognosverktyg för att utarbeta en strategi som gör det möjligt för bättre detektering av alfa-helix-membranproteiner i ett proteom. Vi fortsätter sedan med att konstruera förbättrade maskininlärningsalgoritmer och programvara för exakt förutsägelse av topologin av transmembranpro-teiner och urskiljning av sådana proteiner i motsats till icke-transmembranproteiner. Slutligen presenterar vi ett snabbt sätt att få en posi-tionsspecifik poängmatris, vilket är viktigt för moderna topologi-prediktionsmetoder.

48

49

Acknowledgments

This chapter is the closing act of the past four years in Sweden. It has been an exceptional experience; this journey, however, would not have been the same without some very special people around me.

First and foremost, I would like to thank my supervisor, Arne, for his be-lief in me. Our paths crossed back in 2009 for the first time when I visited Sweden for about six months, and, already then, I knew I had to be in your lab. I did not regret a single day here and I surely hope you feel the same. Thank you so much for your guidance, persistence, patience, ideas, encour-agement and criticism of my work! You have helped me move forward as a person and as a researcher and for that I will be indebted to you!

Many thanks go to my second supervisor, Gunnar, for always having an open door for me, for his advice and fruitful discussions. He is a role-model to me and it has been great to work in an environment with such great minds. I would also like to thank my mentor, Lena, for her time whenever this was required during the course of my PhD.

Pantelis, I have known you for more than a decade now – yes, it has been that much! We have been through a lot, ups and downs, but, in the end, I believe that we have managed to not only work very efficiently together (even from distance) but also to maintain a close friendship. Thank you for teaching me during my baby steps in research and I sincerely hope that we can “keep up the good work” in the years to come!

I always supported the idea that good research papers originate from team effort. This was also the case in most of my PhD projects and I would like to take the opportunity to acknowledge my (numerous) co-authors for their contributions and excellent collaboration: Christoph, Nanjiang, Aron, Stefano, Lukas, Niki, Sikander, Per, Minttu, it was a pleasure to work with you and learn from you!

I was fortunate enough to overlap with two “generations” in the Elofsson’s group and to meet very nice people that created a friendly yet truly professional work environment; Walter, Per, Karolis, David, Oxana, Sudha, John, Marcin, Minttu, Wiktor, Patrik, Linnea, and Daniel, thank you for everything!

Many thanks to Jens Carlsson and the members of his group; Axel, Ma-riama, Pierre, it was very nice having you around and spending time dis-cussing purely non-scientific matters!

50

Finally, I would like to thank Stefan at DBB for proof-reading the thesis and all the administrative personnel that support us, the students, during our PhD studies.

People “outside” of strict University boundaries, the two “Stefans” that I enjoy going out and having my non-alcoholic cocktails with! Without you I would have not seen so much of the Stockholm night-life! Jan-Willem, so nice to see you (almost) every day at the gym – except for the days that you fail to show up! It is been a pleasure to discuss science and much more with you all this time! Aron and Susan, I met you separately before you came together, so I am proud to be able to say that I have “followed” this relation-ship from day one! And it has been so much fun and I am so happy to see you now with Sonja! Let’s see how life will evolve, maybe we will all be living in Switzerland! On a more personal level now, there are some people that have really made a difference to me here in Sweden and I want to refer to them in a more special manner. Although I already knew that the people I would meet here during my PhD were bound to leave at some point, but I guess it is al-ways up to us to try to keep in touch with people that are important to our lives.

It is true that, sometimes, what one will actually later on call a true friend-ship starts off purely by chance. This was the case with Anirudh. It never occurred to me that he would become my best friend throughout this period and yet he did. We talked a lot together, we ate (a lot) together, we went out a lot together, we went on vacations together, we even shared my place for some months – and it was all so natural! If I knew that you could read this, I would write it in Tamil, but I don’t want to make it difficult to you… Thank you from the bottom of my heart for being there, for making time for me whenever I needed you to, for offering me honest opinions and for not holding anything back! When you left I knew it was not the end, we shall be meeting again in the future (maybe I will be a regular in India)!

David and Tamara, I love you both very much! You opened your house for me so many times in order for me to enjoy some proper and delicious home-made food and also feel like home (south European mentality of course). You were there for me during rough times and I think we helped each other to adjust in the cold winter nights. No way will you avoid seeing me again in the future, I know where you live in Denmark and I will be visit-ing!

Thank you Marco and Enrichetta for also adding to the south European environment all this time! Don’t think that I only hang out with you because you cook so delicious spaghetti carbonara – I could do it myself! You are very dear friends, I am very happy that you have each other and I wish you all the best in science and in your future life together!

51

Christoph and Mirco, thank you for being so nice to me although some-times I did give you a hard time! I really enjoyed working with you Chris-toph and learning from you Mirco. Thank you for inviting me over and I wish all the best to you and your families!

Many people back home – but not only there – have “seen me through” these years and it is with great pleasure that I can call them true friends: Νίκη, Εβίτα, Μανώλη, Βασίλη, Νίκο and Δανάη, thank you all for getting excited (or trying to) when I talk about what I have been doing in Sweden over the past few years! Παναγιώτη and Νίκο that I met here in Sweden and now they have moved to other countries, I am so glad we have remained in touch! Παναγιώτη and Στέλλα, so nice that we found each other again in Sweden and now we are all bound together since I had the privilege of be-coming the godfather to your lovely girl, Μαρία!

I would like to extend a warm “thank you” to the Heusser (and extended) family for taking me in and welcoming me with kindness. It has been an honour and pleasure to meet all of you and I promise to become a profes-sional in skiing and Swiss-German (equally challenging tasks) sooner rather than later.

This course, not only until now, but also for the years to come, would not have been the same without the endless support of the Tsirigos family – my family! They have been there, next to me, in every moment, with everything they had! There is not a day that passes by that I do not miss you now that you are far away. Still, I want you to know that I have you always in my heart and mind and words are poor in trying to express my gratitude to you for what you have given me.

Σας αγαπώ πολύ και σας ευχαριστώ που είστε στο πλάι μου όλα αυτά τα χρόνια!

Lastly, I would like to write a few words for my partner in life, although I

feel that I would need much more space than this entire thesis to describe my feelings for her. Stephanie, your love, compassion and care has transformed me from the day I met you. You are my lucky charm! You taught me to be stronger, to not get mad (or try to), to work harder (the “North-European” mentality), to not be afraid to ski (and ride the lift), to enjoy every day by doing simple things, to hike long distances without any obvious reason, to go to bed earlier than 3 am for when in the future I will get a “normal” (9-5) job, to reason more (and be less afraid of the airplanes) and to give my best in everything I do! You were there for me whenever I needed you and I will always be thankful to you for that! It feels just so right for me to be with you, and, for that (and many, many, many other things):

I love you a lot

52

53

References

1. Crick FH. On protein synthesis, Symp Soc Exp Biol 1958;12:138-163. 2. Crick FH, Barnett L, Brenner S et al. General nature of the genetic code for

proteins, Nature 1961;192:1227-1232. 3. Sanger F. Some peptides from insulin, Nature 1948;162:491. 4. Sanger F. The arrangement of amino acids in proteins, Adv Protein Chem

1952;7:1-67. 5. UniProt: a hub for protein information, Nucleic Acids Res 2015;43:D204-212. 6. Alberts B. Essential cell biology : an introduction to the molecular biology of

the cell. New York ; London: Garland, 1998. 7. Berman HM, Westbrook J, Feng Z et al. The Protein Data Bank, Nucleic Acids

Res 2000;28:235-242. 8. Anfinsen CB. Principles that govern the folding of protein chains, Science

1973;181:223-230. 9. Chou PY, Fasman GD. Prediction of the secondary structure of proteins from

their amino acid sequence, Adv Enzymol Relat Areas Mol Biol 1978;47:45-148. 10. Rost B, Sander C. Prediction of protein secondary structure at better than 70%

accuracy, J Mol Biol 1993;232:584-599. 11. Singer SJ, Nicolson GL. The fluid mosaic model of the structure of cell

membranes, Science 1972;175:720-731. 12. Marsh D, Horvath LI, Swamy MJ et al. Interaction of membrane-spanning

proteins with peripheral and lipid-anchored membrane proteins: perspectives from protein-lipid interactions (Review), Mol Membr Biol 2002;19:247-255.

13. Krogh A, Larsson B, von Heijne G et al. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes, J Mol Biol 2001;305:567-580.

14. von Heijne G. The membrane protein universe: what's out there and why bother?, J Intern Med 2007;261:543-557.

15. Arinaminpathy Y, Khurana E, Engelman DM et al. Computational analysis of membrane proteins: the largest class of drug targets, Drug Discov Today 2009;14:1130-1135.

16. Gadsby DC, Vergani P, Csanady L. The ABC protein turned chloride channel whose failure causes cystic fibrosis, Nature 2006;440:477-483.

17. Davey J. G-protein-coupled receptors: new approaches to maximise the impact of GPCRS in drug discovery, Expert Opin Ther Targets 2004;8:165-170.

18. Bakheet TM, Doig AJ. Properties and identification of human protein drug targets, Bioinformatics 2009;25:451-457.

19. Yildirim MA, Goh KI, Cusick ME et al. Drug-target network, Nat Biotechnol 2007;25:1119-1126.

20. Ma P, Zemmel R. Value of novelty?, Nat Rev Drug Discov 2002;1:571-572. 21. Kristiansen K. Molecular mechanisms of ligand binding, signaling, and

regulation within the superfamily of G-protein-coupled receptors: molecular

54

modeling and mutagenesis approaches to receptor structure and function, Pharmacol Ther 2004;103:21-80.

22. Palczewski K, Kumasaka T, Hori T et al. Crystal structure of rhodopsin: A G protein-coupled receptor, Science 2000;289:739-745.

23. von Heijne G. Recent advances in the understanding of membrane protein assembly and structure, Q Rev Biophys 1999;32:285-307.

24. Schulz GE. Transmembrane beta-barrel proteins, Adv Protein Chem 2003;63:47-70.

25. Margulis L, Bermudes D. Symbiosis as a mechanism of evolution: status of cell symbiosis theory, Symbiosis 1985;1:101-124.

26. Atteia A, van Lis R, van Hellemond JJ et al. Identification of prokaryotic homologues indicates an endosymbiotic origin for the alternative oxidases of mitochondria (AOX) and chloroplasts (PTOX), Gene 2004;330:143-148.

27. Subbarao GV, van den Berg B. Crystal structure of the monomeric porin OmpG, J Mol Biol 2006;360:750-759.

28. Yildiz O, Vinothkumar KR, Goswami P et al. Structure of the monomeric outer-membrane porin OmpG in the open and closed conformation, EMBO J 2006;25:3702-3713.

29. Elofsson A, von Heijne G. Membrane protein structure: prediction versus reality, Annu Rev Biochem 2007;76:125-140.

30. White SH, von Heijne G. The machinery of membrane protein assembly, Curr Opin Struct Biol 2004;14:397-404.

31. Ulrich T, Rapaport D. Biogenesis of beta-barrel proteins in evolutionary context, Int J Med Microbiol 2015;305:259-264.

32. Walther DM, Rapaport D, Tommassen J. Biogenesis of beta-barrel membrane proteins in bacteria and eukaryotes: evolutionary conservation and divergence, Cell Mol Life Sci 2009;66:2789-2804.

33. Paschen SA, Waizenegger T, Stan T et al. Evolutionary conservation of biogenesis of beta-barrel membrane proteins, Nature 2003;426:862-866.

34. Hagan CL, Silhavy TJ, Kahne D. beta-Barrel membrane protein assembly by the Bam complex, Annu Rev Biochem 2011;80:189-210.

35. Luirink J, von Heijne G, Houben E et al. Biogenesis of inner membrane proteins in Escherichia coli, Annu Rev Microbiol 2005;59:329-355.

36. Ruiz N, Kahne D, Silhavy TJ. Advances in understanding bacterial outer-membrane biogenesis, Nat Rev Microbiol 2006;4:57-66.

37. Paetzel M. Structure and mechanism of Escherichia coli type I signal peptidase, Biochim Biophys Acta 2014;1843:1497-1508.

38. Blobel G. Protein targeting, Biosci Rep 2000;20:303-344. 39. von Heijne G. The signal peptide, J Membr Biol 1990;115:195-201. 40. Emanuelsson O, Brunak S, von Heijne G et al. Locating proteins in the cell

using TargetP, SignalP and related tools, Nat Protoc 2007;2:953-971. 41. White SH. The progress of membrane protein structure determination, Protein

Sci 2004;13:1948-1949. 42. Lee AG. Lipid-protein interactions in biological membranes: a structural

perspective, Biochim Biophys Acta 2003;1612:1-40. 43. Kyte J, Doolittle RF. A simple method for displaying the hydropathic character

of a protein, J Mol Biol 1982;157:105-132. 44. Omasits U, Ahrens CH, Muller S et al. Protter: interactive protein feature

visualization and integration with experimental proteomic data, Bioinformatics 2014;30:884-886.

45. von Heijne G. Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule, J Mol Biol 1992;225:487-494.

55

46. Claros MG, von Heijne G. TopPred II: an improved software for membrane protein structure predictions, Comput Appl Biosci 1994;10:685-686.

47. Jones DT, Taylor WR, Thornton JM. A model recognition approach to the prediction of all-helical membrane protein structure and topology, Biochemistry 1994;33:3038-3049.

48. Bernsel A, Viklund H, Falk J et al. Prediction of membrane-protein topology from first principles, Proc Natl Acad Sci U S A 2008;105:7177-7181.

49. Peters C, Tsirigos KD, Shu N et al. Improved topology prediction using the terminal hydrophobic helices rule, Bioinformatics 2016;32:1158-1162.

50. Rost B. PHD: predicting one-dimensional protein structure by profile-based neural networks, Methods Enzymol 1996;266:525-539.

51. Viklund H, Elofsson A. Best alpha-helical transmembrane protein topology predictions are achieved using hidden Markov models and evolutionary information, Protein Sci 2004;13:1908-1917.

52. Viklund H, Elofsson A. OCTOPUS: improving topology prediction by two-track ANN-based preference scores and an extended topological grammar, Bioinformatics 2008;24:1662-1668.

53. Sonnhammer EL, von Heijne G, Krogh A. A hidden Markov model for predicting transmembrane helices in protein sequences, Proc Int Conf Intell Syst Mol Biol 1998;6:175-182.

54. Tusnady GE, Simon I. The HMMTOP transmembrane topology prediction server, Bioinformatics 2001;17:849-850.

55. Bagos PG, Liakopoulos TD, Hamodrakas SJ. Algorithms for incorporating prior topological information in HMMs: application to transmembrane proteins, BMC Bioinformatics 2006;7:189.

56. Tsaousis GN, Bagos PG, Hamodrakas SJ. HMMpTM: improving transmembrane protein topology prediction using phosphorylation and glycosylation site prediction, Biochim Biophys Acta 2014;1844:316-322.

57. Kall L, Krogh A, Sonnhammer EL. A combined transmembrane topology and signal peptide prediction method, J Mol Biol 2004;338:1027-1036.

58. Reynolds SM, Kall L, Riffle ME et al. Transmembrane topology and signal peptide prediction using dynamic bayesian networks, PLoS Comput Biol 2008;4:e1000213.

59. Nugent T, Jones DT. Transmembrane protein topology prediction using support vector machines, BMC Bioinformatics 2009;10:159.

60. Viklund H, Bernsel A, Skwark M et al. SPOCTOPUS: a combined predictor of signal peptides and membrane protein topology, Bioinformatics 2008;24:2928-2929.

61. Kall L, Krogh A, Sonnhammer EL. An HMM posterior decoder for sequence feature prediction that includes homology information, Bioinformatics 2005;21 Suppl 1:i251-257.

62. Lao DM, Arai M, Ikeda M et al. The presence of signal peptide significantly affects transmembrane topology prediction, Bioinformatics 2002;18:1562-1566.

63. Bernsel A, Viklund H, Hennerdal A et al. TOPCONS: consensus prediction of membrane protein topology, Nucleic Acids Res 2009;37:W465-468.

64. Klammer M, Messina DN, Schmitt T et al. MetaTM - a consensus method for transmembrane protein topology prediction, BMC Bioinformatics 2009;10:314.

65. Dobson L, Remenyi I, Tusnady GE. CCTOP: a Consensus Constrained TOPology prediction web server, Nucleic Acids Res 2015;43:W408-412.

66. Melen K, Krogh A, von Heijne G. Reliability measures for membrane protein topology prediction algorithms, J Mol Biol 2003;327:735-744.

56

67. Fagerberg L, Jonasson K, von Heijne G et al. Prediction of the human membrane proteome, Proteomics 2010;10:1141-1149.

68. Tusnady GE, Simon I. Topology prediction of helical transmembrane proteins: how far have we reached?, Curr Protein Pept Sci 2010;11:550-561.

69. Sansom MS. Proline residues in transmembrane helices of channel and transport proteins: a molecular modelling study, Protein Eng 1992;5:53-60.

70. von Heijne G. Proline kinks in transmembrane alpha-helices, J Mol Biol 1991;218:499-503.

71. Park SH, Opella SJ. Tilt angle of a trans-membrane helix is determined by hydrophobic mismatch, J Mol Biol 2005;350:310-318.

72. Virkki M, Boekel C, Illergard K et al. Large tilts in transmembrane helices can be induced during tertiary structure formation, J Mol Biol 2014;426:2529-2538.

73. Yeagle PL, Bennett M, Lemaitre V et al. Transmembrane helices of membrane proteins may flex to satisfy hydrophobic mismatch, Biochim Biophys Acta 2007;1768:530-537.

74. Granseth E, von Heijne G, Elofsson A. A study of the membrane-water interface region of membrane proteins, J Mol Biol 2005;346:377-385.

75. Liang J, Adamian L, Jackups R, Jr. The membrane-water interface region of membrane proteins: structural bias and the anti-snorkeling effect, Trends Biochem Sci 2005;30:355-357.

76. Viklund H, Granseth E, Elofsson A. Structural classification and prediction of reentrant regions in alpha-helical transmembrane proteins: application to complete genomes, J Mol Biol 2006;361:591-603.

77. Yan C, Luo J. An analysis of reentrant loops, Protein J 2010;29:350-354. 78. Van den Berg B, Clemons WM, Jr., Collinson I et al. X-ray structure of a

protein-conducting channel, Nature 2004;427:36-44. 79. Mitsuoka K, Murata K, Walz T et al. The structure of aquaporin-1 at 4.5-A

resolution reveals short alpha-helices in the center of the monomer, J Struct Biol 1999;128:34-43.

80. Dutzler R, Campbell EB, Cadene M et al. X-ray structure of a ClC chloride channel at 3.0 A reveals the molecular basis of anion selectivity, Nature 2002;415:287-294.

81. Granseth E, Viklund H, Elofsson A. ZPRED: predicting the distance to the membrane center for residues in alpha-helical membrane proteins, Bioinformatics 2006;22:e191-196.

82. Rapp M, Granseth E, Seppala S et al. Identification and evolution of dual-topology membrane proteins, Nat Struct Mol Biol 2006;13:112-116.

83. Saaf A, Johansson M, Wallin E et al. Divergent evolution of membrane protein topology: the Escherichia coli RnfA and RnfE homologues, Proc Natl Acad Sci U S A 1999;96:8540-8544.

84. Daley DO, Rapp M, Granseth E et al. Global topology analysis of the Escherichia coli inner membrane proteome, Science 2005;308:1321-1323.

85. Kim SJ, Rahbar R, Hegde RS. Combinatorial control of prion protein biogenesis by the signal sequence and transmembrane domain, J Biol Chem 2001;276:26132-26140.

86. Zhai Y, Saier MH, Jr. The beta-barrel finder (BBF) program, allowing identification of outer membrane beta-barrel proteins encoded within prokaryotic genomes, Protein Sci 2002;11:2196-2207.

87. Wimley WC. Toward genomic identification of beta-barrel membrane proteins: composition and architecture of known structures, Protein Sci 2002;11:301-312.

88. Remmert M, Linke D, Lupas AN et al. HHomp--prediction and classification of outer membrane proteins, Nucleic Acids Res 2009;37:W446-451.

57

89. Bagos PG, Liakopoulos TD, Spyropoulos IC et al. A Hidden Markov Model method, capable of predicting and discriminating beta-barrel outer membrane proteins, BMC Bioinformatics 2004;5:29.

90. Bigelow HR, Petrey DS, Liu J et al. Predicting transmembrane beta-barrels in proteomes, Nucleic Acids Res 2004;32:2566-2577.

91. Hayat S, Peters C, Shu N et al. Inclusion of dyad-repeat pattern improves topology prediction of transmembrane beta-barrel proteins, Bioinformatics 2016;32:1571-1573.

92. Savojardo C, Fariselli P, Casadio R. BETAWARE: a machine-learning tool to detect and predict transmembrane beta-barrel proteins in prokaryotes, Bioinformatics 2013;29:504-505.

93. Martelli PL, Fariselli P, Krogh A et al. A sequence-profile-based HMM for predicting and discriminating beta barrel membrane proteins, Bioinformatics 2002;18 Suppl 1:S46-53.

94. Jacoboni I, Martelli PL, Fariselli P et al. Prediction of the transmembrane regions of beta-barrel membrane proteins with a neural network-based predictor, Protein Sci 2001;10:779-787.

95. Gromiha MM, Ahmad S, Suwa M. Neural network-based prediction of transmembrane beta-strand segments in outer membrane proteins, J Comput Chem 2004;25:762-767.

96. Ou YY, Gromiha MM, Chen SA et al. TMBETADISC-RBF: Discrimination of beta-barrel membrane proteins using RBF networks and PSSM profiles, Comput Biol Chem 2008;32:227-231.

97. Berven FS, Flikka K, Jensen HB et al. BOMP: a program to predict integral beta-barrel outer membrane proteins encoded within genomes of Gram-negative bacteria, Nucleic Acids Res 2004;32:W394-399.

98. Freeman TC, Jr., Wimley WC. A highly accurate statistical approach for the prediction of transmembrane beta-barrels, Bioinformatics 2010;26:1965-1974.

99. Yu NY, Wagner JR, Laird MR et al. PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes, Bioinformatics 2010;26:1608-1615.

100. Yan RX, Chen Z, Zhang Z. Outer membrane proteins can be simply identified using secondary structure element alignment, BMC Bioinformatics 2011;12:76.

101. Garrow AG, Agnew A, Westhead DR. TMB-Hunt: an amino acid composition based method to screen proteomes for beta-barrel transmembrane proteins, BMC Bioinformatics 2005;6:56.

102. Imai K, Asakawa N, Tsuji T et al. SOSUI-GramN: high performance prediction for sub-cellular localization of proteins in gram-negative bacteria, Bioinformation 2008;2:417-421.

103. Jones DT. Improving the accuracy of transmembrane protein topology prediction using evolutionary information, Bioinformatics 2007;23:538-544.

104. Kloczkowski A, Ting KL, Jernigan RL et al. Combining the GOR V algorithm with evolutionary information for protein secondary structure prediction from amino acid sequence, Proteins 2002;49:154-166.

105. Tan CW, Jones DT. Using neural networks and evolutionary information in decoy discrimination for protein tertiary structure prediction, BMC Bioinformatics 2008;9:94.

106. Biasini M, Bienert S, Waterhouse A et al. SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information, Nucleic Acids Res 2014;42:W252-258.

107. Valencia A, Pazos F. Prediction of protein-protein interactions from evolutionary information, Methods Biochem Anal 2003;44:411-426.

58

108. Lopes A, Sacquin-Mora S, Dimitrova V et al. Protein-protein interactions in a crowded environment: an analysis via cross-docking simulations and evolutionary information, PLoS Comput Biol 2013;9:e1003369.

109. Zahiri J, Yaghoubi O, Mohammad-Noori M et al. PPIevo: protein-protein interaction prediction from PSSM based evolutionary information, Genomics 2013;102:237-242.

110. Vicatos S, Reddy BV, Kaznessis Y. Prediction of distant residue contacts with the use of evolutionary information, Proteins 2005;58:935-949.

111. Bartoli L, Fariselli P, Krogh A et al. CCHMM_PROF: a HMM-based coiled-coil predictor with evolutionary information, Bioinformatics 2009;25:2757-2763.

112. Peng K, Vucetic S, Radivojac P et al. Optimizing long intrinsic disorder predictors with protein evolutionary information, J Bioinform Comput Biol 2005;3:35-60.

113. Sethi D, Garg A, Raghava GP. DPROT: prediction of disordered proteins using evolutionary information, Amino Acids 2008;35:599-605.

114. Rashid M, Saha S, Raghava GP. Support Vector Machine-based method for predicting subcellular localization of mycobacterial proteins using evolutionary information and motifs, BMC Bioinformatics 2007;8:337.

115. Wang M, Li A, Xie D et al. Improving prediction of protein subcellular localization using evolutionary information and sequence-order information, Conf Proc IEEE Eng Med Biol Soc 2005;4:4434-4436.

116. Sharma R, Dehzangi A, Lyons J et al. Predict Gram-Positive and Gram-Negative Subcellular Localization via Incorporating Evolutionary Information and Physicochemical Features Into Chou's General PseAAC, IEEE Trans Nanobioscience 2015;14:915-926.

117. Goldberg T, Hamp T, Rost B. LocTree2 predicts localization for all domains of life, Bioinformatics 2012;28:i458-i465.

118. Chen K, Kurgan L. PFRES: protein fold classification by using evolutionary information and predicted secondary structure, Bioinformatics 2007;23:2843-2850.

119. Cheng CW, Su EC, Hwang JK et al. Predicting RNA-binding sites of proteins using support vector machines and evolutionary information, BMC Bioinformatics 2008;9 Suppl 12:S6.

120. Huang YF, Chiu LY, Huang CC et al. Predicting RNA-binding residues from evolutionary information and sequence conservation, BMC Genomics 2010;11 Suppl 4:S2.

121. Kumar M, Gromiha MM, Raghava GP. SVM based prediction of RNA-binding proteins using binding residues and evolutionary information, J Mol Recognit 2011;24:303-313.

122. Ovchinnikov S, Kamisetty H, Baker D. Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information, Elife 2014;3:e02030.

123. Mousavian Z, Khakabimamaghani S, Kavousi K et al. Drug-target interaction prediction from PSSM based evolutionary information, J Pharmacol Toxicol Methods 2016;78:42-51.

124. Altschul SF, Madden TL, Schaffer AA et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res 1997;25:3389-3402.

125. Altschul SF, Gish W, Miller W et al. Basic local alignment search tool, J Mol Biol 1990;215:403-410.

59

126. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks, Proc Natl Acad Sci U S A 1992;89:10915-10919.

127. Boratyn GM, Schaffer AA, Agarwala R et al. Domain enhanced lookup time accelerated BLAST, Biol Direct 2012;7:12.

128. Biegert A, Soding J. Sequence context-specific profiles for homology searching, Proc Natl Acad Sci U S A 2009;106:3770-3775.

129. Loh PR, Baym M, Berger B. Compressive genomics, Nat Biotechnol 2012;30:627-630.

130. Rabiner LR, Rader CM. Digital signal processing. New York: IEEE Press, 1972.

131. Durbin R, Eddy SR, Krogh A et al. Biological sequence analysis : probabilistic models of proteins and nucleic acids.

132. Dempster AP. Elements of Contiuous Multivariate Analysis. 1969. 133. Krogh A, Mian IS, Haussler D. A hidden Markov model that finds genes in E.

coli DNA, Nucleic Acids Res 1994;22:4768-4778. 134. Eddy SR. Multiple alignment using hidden Markov models, Proc Int Conf

Intell Syst Mol Biol 1995;3:114-120. 135. Eddy SR. Profile hidden Markov models, Bioinformatics 1998;14:755-763. 136. Juncker AS, Willenbrock H, Von Heijne G et al. Prediction of lipoprotein

signal peptides in Gram-negative bacteria, Protein Sci 2003;12:1652-1662. 137. Bendtsen JD, Nielsen H, von Heijne G et al. Improved prediction of signal

peptides: SignalP 3.0, J Mol Biol 2004;340:783-795. 138. Asai K, Hayamizu S, Handa K. Prediction of protein secondary structure by

the hidden Markov model, Comput Appl Biosci 1993;9:141-146. 139. Bystroff C, Krogh A. Hidden Markov Models for prediction of protein

features, Methods Mol Biol 2008;413:173-198. 140. Krogh A, Brown M, Mian IS et al. Hidden Markov models in computational

biology. Applications to protein modeling, J Mol Biol 1994;235:1501-1531. 141. Sonnhammer EL, Eddy SR, Durbin R. Pfam: a comprehensive database of

protein domain families based on seed alignments, Proteins 1997;28:405-420. 142. Eddy SR. A new generation of homology search tools based on probabilistic

inference, Genome Inform 2009;23:205-211. 143. Finn RD, Coggill P, Eberhardt RY et al. The Pfam protein families database:

towards a more sustainable future, Nucleic Acids Res 2016;44:D279-285. 144. Tsirigos KD, Bagos PG, Hamodrakas SJ. OMPdb: a database of {beta}-barrel

outer membrane proteins from Gram-negative bacteria, Nucleic Acids Res 2011;39:D324-331.

145. Kozma D, Simon I, Tusnady GE. PDBTM: Protein Data Bank of transmembrane proteins after 8 years, Nucleic Acids Res 2013;41:D524-529.

146. Tusnady GE, Dosztanyi Z, Simon I. TMDET: web server for detecting transmembrane regions of proteins by using their 3D coordinates, Bioinformatics 2005;21:1276-1277.

147. Zemla A, Venclovas C, Fidelis K et al. A modified definition of Sov, a segment-based measure for protein secondary structure prediction assessment, Proteins 1999;34:220-223.

148. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme, Biochim Biophys Acta 1975;405:442-451.

149. Petersen TN, Brunak S, von Heijne G et al. SignalP 4.0: discriminating signal peptides from transmembrane regions, Nat Methods 2011;8:785-786.

150. Tsirigos KD, Peters C, Shu N et al. The TOPCONS web server for consensus prediction of membrane protein topology and signal peptides, Nucleic Acids Res 2015;43:W401-407.

60

151. Bagos PG, Liakopoulos TD, Spyropoulos IC et al. PRED-TMBB: a web server for predicting the topology of beta-barrel outer membrane proteins, Nucleic Acids Res 2004;32:W400-404.

152. Tsirigos KD, Elofsson A, Bagos PG. PRED-TMBB2: improved topology prediction and detection of beta-barrel outer membrane proteins, Bioinformatics 2016;32:i665-i671.

Documents

BIOINFORMATICS METHODS FOR TOPOLOGY PREDICTION OF MEMBRANE ...su.diva-portal.org/smash/get/diva2:1067468/FULLTEXT01.pdf · 1.4 Structural classes of transmembrane proteins ... Bioinformatics