6
Vol.29 No.1/2 JOURNAL OF ELECTRONICS (CHINA) March 2012 THE COMBINATION PREDICTION OF TRANSMEMBRANE REGIONS BASED ON DEMPSTER-SHAFER THEORY OF EVIDENCE 1 Deng Xinyang * Xu Peida ** Deng Yong * ** * (School of Computer and Information Science, Southwest University, Chongqing 400715, China) ** (School of Electronics and Information Technology, Shanghai Jiaotong University, Shanghai 200240, China) Abstract Transmembrane proteins are some special and important proteins in cells. Because of their importance and specificity, the prediction of the transmembrane regions has very important theoretical and practical significance. At present, the prediction methods are mainly based on the physicochemical property and statistic analysis of amino acids. However, these methods are suitable for some envi- ronments but inapplicable for other environments. In this paper, the multi-sources information fusion theory has been introduced to predict the transmembrane regions. The proposed method is test on a data set of transmembrane proteins. The results show that the proposed method has the ability of predicting the transmembrane regions as a good performance and powerful tool. Key words Transmembrane regions; Prediction; Dempster-Shafer theory of evidence; Proteins CLC index TP391 DOI 10.1007/s11767-012-0797-8 I. Introduction Protein is a kind of biomacromolecule with the most content and the most functions in cells. Transmembrane proteins are some special and important kinds of proteins, which are the principal executives of the function of biomembrane. Some important functions, such as energy conversion, substance transportation and information transfer, and so on, are played by transmembrane proteins in cells. Due to the importance and particularity of the transmembrane proteins, the recognition of transmembrane regions and prediction of topology of transmembrane proteins have been a hot study field of the protein structure research. The prediction of topology of transmembrane proteins is the study of locating and counting the transmembrane regions, as well as finding their transmembrane orientation by using the known 1 Manuscript received date: October 19, 2011; revised date: December 20, 2011. Supported by the National Natural Science Foundation of China (No. 60874105, 61174022), the Program for New Century Excellent Talents in University (No. NCET-08- 0345), and the Chongqing Natural Science Foundation (No. CSCT, 2010BA2003). Communication author: Deng Yong, born in 1975, male, Professor. School of Computer and Information Science, Southwest University, Chongqing 400715, China. Email: [email protected]. information of the amino acid sequence. Notice that the transmembrane orientation is represented by the location of amino terminal of the amino acid sequence. At present, the experimental techniques to research the molecular structure of proteins, including transmembrane proteins, are X-ray crys- tal diffraction and nuclear magnetic resonance [1] . Using these techniques, a high precision can be achieved. However, they cannot be applied on a large scale because of the harsh conditions. But in terms of the special chemical and physical proper- ties of proteins, there are some effective methods to study the structure and functions of proteins by using the calculating methods. Some effective methods have been developed to predict the transmembrane regions and their transmembrane orientation of transmembrane proteins. In the ear- liest, Kyte, et al. [2] have proposed a method to predict the transmembrane regions based on the hydrophobicity of amino acids. Later, a TopPred method [3] , based on the new finding of the “posi- tive-inside” rule, has been developed. Recently, many of other algorithms are designed to predict the topology of transmembrane proteins [4–11] . Information fusion theory is a multi-sensors data fusion technology, it can fuse the information coming from different sources to generate a com- prehensive evaluation. With the development of

The combination prediction of transmembrane regions based on Dempster-Shafer theory of evidence

Embed Size (px)

Citation preview

Page 1: The combination prediction of transmembrane regions based on Dempster-Shafer theory of evidence

Vol.29 No.1/2 JOURNAL OF ELECTRONICS (CHINA) March 2012

THE COMBINATION PREDICTION OF TRANSMEMBRANE REGIONS BASED ON DEMPSTER-SHAFER THEORY OF EVIDENCE1

Deng Xinyang* Xu Peida** Deng Yong* ** *(School of Computer and Information Science, Southwest University, Chongqing 400715, China)

**(School of Electronics and Information Technology, Shanghai Jiaotong University, Shanghai 200240, China)

Abstract Transmembrane proteins are some special and important proteins in cells. Because of their importance and specificity, the prediction of the transmembrane regions has very important theoretical and practical significance. At present, the prediction methods are mainly based on the physicochemical property and statistic analysis of amino acids. However, these methods are suitable for some envi-ronments but inapplicable for other environments. In this paper, the multi-sources information fusion theory has been introduced to predict the transmembrane regions. The proposed method is test on a data set of transmembrane proteins. The results show that the proposed method has the ability of predicting the transmembrane regions as a good performance and powerful tool.

Key words Transmembrane regions; Prediction; Dempster-Shafer theory of evidence; Proteins

CLC index TP391

DOI 10.1007/s11767-012-0797-8

I. Introduction Protein is a kind of biomacromolecule with the

most content and the most functions in cells. Transmembrane proteins are some special and important kinds of proteins, which are the principal executives of the function of biomembrane. Some important functions, such as energy conversion, substance transportation and information transfer, and so on, are played by transmembrane proteins in cells. Due to the importance and particularity of the transmembrane proteins, the recognition of transmembrane regions and prediction of topology of transmembrane proteins have been a hot study field of the protein structure research.

The prediction of topology of transmembrane proteins is the study of locating and counting the transmembrane regions, as well as finding their transmembrane orientation by using the known 1 Manuscript received date: October 19, 2011; revised date:

December 20, 2011. Supported by the National Natural Science Foundation of China (No. 60874105, 61174022), the Program for New Century Excellent Talents in University (No. NCET-08- 0345), and the Chongqing Natural Science Foundation (No. CSCT, 2010BA2003). Communication author: Deng Yong, born in 1975, male, Professor. School of Computer and Information Science, Southwest University, Chongqing 400715, China. Email: [email protected].

information of the amino acid sequence. Notice that the transmembrane orientation is represented by the location of amino terminal of the amino acid sequence. At present, the experimental techniques to research the molecular structure of proteins, including transmembrane proteins, are X-ray crys-tal diffraction and nuclear magnetic resonance[1]. Using these techniques, a high precision can be achieved. However, they cannot be applied on a large scale because of the harsh conditions. But in terms of the special chemical and physical proper-ties of proteins, there are some effective methods to study the structure and functions of proteins by using the calculating methods. Some effective methods have been developed to predict the transmembrane regions and their transmembrane orientation of transmembrane proteins. In the ear-liest, Kyte, et al.[2] have proposed a method to predict the transmembrane regions based on the hydrophobicity of amino acids. Later, a TopPred method[3], based on the new finding of the “posi-tive-inside” rule, has been developed. Recently, many of other algorithms are designed to predict the topology of transmembrane proteins[4–11].

Information fusion theory is a multi-sensors data fusion technology, it can fuse the information coming from different sources to generate a com-prehensive evaluation. With the development of

Page 2: The combination prediction of transmembrane regions based on Dempster-Shafer theory of evidence

DENG et al. The Combination Prediction of Transmembrane Regions Based on Dempster-Shafer Theory of Evidence 143

this theory, information fusion has been successful applied to various fields[12–15]. As a theory of rep-resenting and handling the uncertain information in information fusion, Dempster-Shafer theory of evidence[16,17] generates Basic Probability Assign-ment (BPA) to express the uncertainty. To different information from different sources, the Dempster’s rule of combination is used to combine this expe-riential and conditional information. The resulting information is more effective to represent the real world because of the combination of much inde-pendent information.

In the research of topology of transmembrane proteins, even though there are many methods, however, different method, which is may suitable for some environments but inapplicable for other environments, depends on different principle. If seeing these prediction results of different predic-tion methods as evidences coming from different sources, the various prediction results can be com-bined using the Dempster’s rule to determine the transmembrane regions of unknown transmem-brane proteins. As a result, the accuracy of pre-diction can be effectively improved. In this paper, multi-sources information fusion theory is intro-duced, the prediction results of multiple prediction methods are combined by the Dempster’s rule of combination. Using this method, not only the ad-vantages of these prediction methods are fused, and the preference of single method is avoided, but also these prediction results can be verified with each other.

The remainder of this paper is organized as follows. A brief introduction about the Dempster- Shafer theory of evidence and Pignistic Probability Transformation (PPT) is presented in Section II. Then in Section III, the proposed combination prediction method of transmembrane regions is depicted. Section IV is the experimental verifica-tion. Finally, conclusions are given in Section V.

II. Preliminaries 1. Dempster-Shafer theory of evidence

The Dempster-Shafer theory of evidence[16,17], is used to handle uncertain information. It is first proposed by Dempster and then developed by Shafer. This theory needs weaker conditions than Bayesian theory of probability, so it is often re-

garded as an extension of the Bayesian theory. In addition, as a theory of reasoning under the un-certain environment, Dempster-Shafer theory has an advantage of directly expressing the “uncer-tainty”, by assigning the probability to the subsets of the set composed of N objects, rather than to each of the individual objects. In addition, the Dempster-Shafer theory has the ability of com-bining pairs of bodies of evidence or belief functions to derive a new evidence or belief function. In Dempster-Shafer theory, a problem domain is de-noted by a finite nonempty set U of mutually ex-clusive and exhaustive hypotheses, called frame of discernment. Let 2U denote the power set of U, for completeness of the explanation, a few basic con-cepts is introduced in the follow. Definition 1 For a frame of discernment U, a mass function is a mapping m: 2 [0,1],U → which is also called a Basic Probability Assignment (BPA), satisfying

( ) 1A U

m A⊆

=∑ (1)

( ) 0m φ = (2)

where φ is an empty set and A is an element of 2U. If ( ) 0,m A > A is called a focal element, and the union of all focal elements is the core of the mass function. All the related focal elements are collec-tively called the body of evidence.

For any subset A⊆U, the belief function Bel: 2 [0,1],U → is defined as

Bel( ) ( )B A

A m B⊆

= ∑ (3)

The plausibility function Pl: 2 [0,1],U → is de-fined as

( )Pl( ) 1 Bel ( )B A

A A m Bφ∩ ≠

= − = ∑ (4)

Consider two pieces of evidence from different sources, indicated by two BPAs m1 and m2. It is very necessary to develop a method to combine these two BPAs. Fortunately, Dempster’s rule of combination is used to generate a new BPA from two or more BPAs. This rule assumes that these sources are independent. Definition 2 Dempster’s rule of combination, also called orthogonal sum, denoted by 1 2m m m= ⊕ ⊕

Page 3: The combination prediction of transmembrane regions based on Dempster-Shafer theory of evidence

144 JOURNAL OF ELECTRONICS (CHINA), Vol.29 No.1/2, March 2012

,nm⊕ is defined as follows

( )1

1( )

1i

i ii nA A

m A m Ak ≤ ≤∩ =

=− ∑ ∏ (5)

1

( )i

i ii nA

k m Aφ ≤ ≤∩ =

= ∑ ∏ (6)

where k is a normalization constant, called conflict coefficient of two BPAs. Note that the Dempster’s rule of combination is only applicable to such two BPAs which satisfy the condition 1.k <

2. Pignistic Probability Transformation (PPT)

However, when all of BPAs are combined to generate a final BPA, it may still face with some problems that how to make a decision in the real applications, for example decide the future state of event or judge the class of the target. Since a BPA is a mapping from the subset of frame of discern-ment to interval [0,1], the largest focal element of BPA may not be a singleton which only contains one individual object. Fortunately, based on the generalized insufficient reason principle proposed in the Transferable Belief Model (TBM)[18], intro-duced by Smets and Kennes, a probability function derived from a BPA is defined for the purpose of decision making via the so-called PPT[18]. Definition 3 Let m be a BPA on a frame of dis-cernment U, a PPT function BetPm: [0,1]U → associated to m is defined by

,

( )1BetP ( ) , ( ) 1

1 ( )mx A A U

m Ax m

A mφ

φ∈ ⊆

= ≠−∑ (7)

where A is the cardinality of proposition A.

III. The Combination Prediction of Transmembrane Regions

In general, the prediction of topology of trans-membrane proteins is a composite of two aspects, the transmembrane regions and its orientation, respectively. According to the “positive-inside” rule, the transmembrane orientation can be derived well. In this paper, the accuracy of transmembrane re-gions is focused primarily. In detail, several pre-diction algorithms are used to predict the trans-membrane regions on a constructed training set of transmembrane proteins at first. Then individual prediction module can be constructed based on the accuracy of different algorithm which is seen as the

weight of prediction algorithm. Once the prediction modules have been constructed, to an unpredicted transmembrane protein, multiple prediction results will be generated by the modules. The results are represented as a form of BPA according to weight corresponding to these prediction modules. Hence, every residue in the transmembrane proteins has many BPAs to express whether it belongs to the transmembrane proteins or not. Finally, These BPAs are fused according to Dempster’s rule of combination, and a resulting BPA is derived. It can be transformed into a probability distribution, which is the basis of judgment.

1. The construction of individual prediction modules

In this paper, depend on different principle, three prediction algorithms are selected to con-struct the individual modules for the prediction of transmembrane proteins. They are TMpred, HMMTOP, and TopPred. TMpred is on the base of the statistical analysis of TMbase[19] extracted from SWISS-PROT. The prediction is made using a combination of several weight-matrices for scoring, such as the number of transmembrane segments, regions, and flanking information. HMMTOP[20] bases the hypothesis that the localizations of the transmembrane segments and the topology are determined by the difference in the amino acid distributions in various structural parts of these proteins rather than by specific amino acid com-positions of these parts. In this algorithm, to a given protein, a hidden Markov model with special architecture has developed to search transmem-brane topology with the maximum likelihood. TopPred[3] is a program to determine the topology of a transmembrane protein based on hydropho-bicity analysis, automatic generation of a set of possible topologies and ranking of these according to the positive inside rule.

Supposing the weights of these individual pre-diction modules are calculated, the weight of TMpred is w1, the weight of HMMTOP is w2, the weight of TMpred is w3.

2. The generation of BPAs to amino acid residue

In this subsection, the generation process of BPAs to amino acid residue is depicted as follows. Take TMpred algorithm as an example. While the

Page 4: The combination prediction of transmembrane regions based on Dempster-Shafer theory of evidence

DENG et al. The Combination Prediction of Transmembrane Regions Based on Dempster-Shafer Theory of Evidence 145

prediction module of TMpred algorithm is con-structed, to an unpredicted amino acid residue, if the prediction result is that the residue belongs to a transmembrane region, so a BPA can be generated

tm 1

tm 1

({ })

({ , }) 1

m Y w

m Y N w

⎧ =⎪⎪⎪⎨⎪ = −⎪⎪⎩ (8)

To the contrary, if the prediction result of TMpred prediction module supports that this residue does not belong to a transmembrane region, other BPA is derived

tm 1

tm 1

({ })

({ , }) 1

m N w

m Y N w

⎧ =⎪⎪⎪⎨⎪ = −⎪⎪⎩ (9)

In this way, other two BPAs coming from HMMTOP and TopPred prediction modules are obtained, indicated by mhm and mtp, respectively.

3. The combination of BPAs

Once all BPAs of the amino acid residue coming from these prediction modules are generated, a resulting BPA can be derived by using Dempster’s rule of combination to combine them

tm hm tpm m m m= ⊕ ⊕ (10)

where m represents the resulting BPA, it is used to determine whether this amino acid residue belongs to a transmembrane region or not.

4. The transformation of BPA into a probability distribution

However, a BPA is inconvenient to making a decision. An alternative method is the transfor-mation of the BPA into a probability distribution. The PPT approach, proposed by Smets, is usually used to do this transformation. Hence, in this paper, the resulting BPA is transformed into a probability distribution using PPT. The derived probability distribution is indicated by p. The proposition with the highest value determines whether the residue belongs to a transmembrane region or not.

At last, all amino acid residues which belong to transmembrane regions are jointed to form com-plete transmembrane regions.

IV. Experimental Verification At present, there are some transmembrane

proteins datasets, collected by some researchers to

test and verify the prediction method of trans-membrane proteins. Moller, et al.[21] has compiled a database of transmembrane proteins extracted from SWISS-PROT. It is freely available to users at ftp://ftp.ebi.ac.uk/pub/databases/testsets/transmembrane/. The database is classified four parts based on the experimental data available. Among these, Part A which consists of 37 proteins and 119 transmembrane helixes are structure available and it is validated by many programs, such as TMHMM, DAS, etc..

In this paper, Part A is selected as the ex-perimental dataset. Two-thirds of the experimental dataset is used to train the individual prediction module. The remainder one-third of dataset is used to verify the proposed method.

To measure the prediction performance, the evaluation method developed by Tusnady and Simon[20] is introduced in this paper. Due to the precision of the prediction of transmembrane seg-ments is limited, the prediction is considered suc-cessful when the overlapping region of predicted and observed transmembrane segment contains at least 9 amino acids. The total numbers of predicted and real observed transmembrane regions are in-dicated by Nprd and Nobs, respectively. The over-lapping predicted and real observed transmem-brane region is indicated by Ncor. The efficiency of the transmembrane segment prediction is measured by M=Ncor/Nobs and C=Ncor/Nprd. The overall pre-diction power is defined as

p 100%Q MC= × (11)

Tab. 1 shows the prediction results of individual algorithms on the training set. The overall predic-tion accuracy of each algorithm is seen as the as-sociated weight to construct individual prediction module.

After that, the combination prediction method can be used to the test set. Tab. 2 displays the results of various algorithms on the test set for the prediction of transmembrane regions. It shows that these algorithms have good performance for the prediction of transmembrane regions. Moreover, the experiment proves the effectiveness of the proposed combination method, and its prediction accuracy is 97.5%, which is higher than HMMTOP’s and TopPred’s, and equal to TMpred’s.

Page 5: The combination prediction of transmembrane regions based on Dempster-Shafer theory of evidence

146 JOURNAL OF ELECTRONICS (CHINA), Vol.29 No.1/2, March 2012

Tab. 1 The prediction results of individual algorithms on the training set

Method Nobs Nprd Ncor M (%) C (%) Qp (%)

HMMTOP 79 82 76 96.20 92.68 94.43

TMpred 79 83 77 97.47 92.77 95.09

TopPred 79 82 75 94.94 91.46 93.18

Tab. 2 The results on the test set for the prediction of trans-membrane regions

Method Nobs Nprd Ncor M (%) C (%) Qp (%)

HMMTOP 40 41 39 97.50 95.12 96.30

TMpred 40 40 39 97.50 97.50 97.50

TopPred 40 41 39 97.50 95.12 96.30 The proposed

method 40 40 39 97.50 97.50 97.50

In the three individual prediction methods of

HMMTOP, TMpred, and TopPred, the prediction performance of TMpred is the best. According to the experiment result, the result of the proposed method seems to be same to TMpred. From the data in Tab. 2, we can find that these prediction measures of M, C, Qp are same for the proposed method and TMpred, because the derived confu-sion matrixes are same. However, it just represents the number of correct predicted transmembrane regions and observed transmembrane regions are same. Due to the limitation of the precision of the prediction of transmembrane segments, there is a hypothesis that the prediction is considered suc-cessful when the overlapping region of predicted and observed transmembrane segment contains at least 9 amino acids. Therefore, for a transmem-brane region in a membrane protein sequence, it may be correct predicted both by the proposed method and TMpred algorithm, but the prediction of the boundary of the transmembrane region is different in these two methods. So, on this view the results of the proposed method and TMpred algo-rithm are different. Therefore, from the experi-mental result, it shows the performance of the proposed method and TMpred in the prediction of transmembrane regions of transmembrane proteins is neck and neck, the prediction performance of the proposed method has reached a high level. The proposed method would be another good choice for the prediction of transmembrane regions. Moreover, from the view point of methodology, due to the

application of multi-sources information fusion theory, the experiential and conditional informa-tion coming from different sources can be fused to generate a comprehensive evaluation, the resulting information is more effective to represent the real world because of the combination of much inde-pendent information. In addition, because the proposed method is partially basis on the statistical regularity and much more information is used, the performance of proposed method will be better for large amount of dataset. Hence, the proposed method is effective for the prediction of trans-membrane regions of transmembrane proteins.

V. Conclusions In this paper, multi-sources information fusion

theory has been introduced to the prediction of topology of transmembrane proteins. Several in-dependent prediction methods are used to do the prediction at first. Hence, many prediction results are derived, and they are represented as a form of BPA in Dempster-Shafer theory of evidence. After that, these prediction results are fused by using Dempster’s rule of combination. Finally, the transmembrane regions of a transmembrane pro-tein are predicted according to the probability distribution transformed from the combined BPA. An experiment proves the effectiveness of the proposed method. The study provides a new idea to the research of the prediction of transmembrane regions.

References [1] S. Topiol and M. Sabio. X-ray structure break-

throughs in the GPCR transmembrane region. Bio-

chemical Pharmacology, 78(2009)1, 11–20.

[2] J. Kyte and R. F. Doolittle. A simple method for

displaying the hydropathic character of a protein.

Journal of Molecular Biology, 157(1982)1, 105–132.

[3] G. von Heijne. Membrane protein structure prediction:

hydrophobicity analysis and the positive-inside rule.

Journal of Molecular Biology, 225(1992)2, 487–494.

[4] Y. Deng, Q. Liu, and Y. X. Li. Prediction of trans-

membrane segments based on fuzzy cluster analysis of

amino acids. Acta Chimica Sinica, 62(2004)19, 1968–

1972.

[5] Y. Deng. TSFSOM: transmembrane segments pre-

diction by fuzzy self-organizing map. Lecture Notes in

Computer Science, 3973(2006), 728–733.

Page 6: The combination prediction of transmembrane regions based on Dempster-Shafer theory of evidence

DENG et al. The Combination Prediction of Transmembrane Regions Based on Dempster-Shafer Theory of Evidence 147

[6] Q. Liu, Y. S. Zhu, B. H. Wang, and Y. X. Li. A

HMM-based method to predict the transmembrane

regions of beta-barrel membrane proteins. Computa-

tional Biology and Chemistry, 27(2003)1, 69–76.

[7] Y. Deng, Q. Liu, and Y. X. Li. Scoring hidden Markov

models to discriminate beta-barrel membrane proteins.

Computational Biology and Chemistry, 28(2004)3,

189–194.

[8] A. Bernsel, H. Viklund, A. Hennerdal, and A. Elofsson.

TOPCONS: consensus prediction of membrane pro-

tein topology. Nucleic Acids Research, 37(2009), 465–

468.

[9] I. K. Kitsas, L. J. Hadjileontiadis, and S. M. Panas.

Transmembrane helix prediction in proteins using

hydrophobicity properties and higher-order statistics.

Computers in Biology and Medicine, 38(2008)8, 867–

880.

[10] A. Hennerdal and A. Elofsson. Rapid membrane

protein topology prediction. Bioinformatics, 27(2011)

9, 1322–1323.

[11] Y. Wei and C. A. Floudas. Enhanced inter-helical

residue contact prediction in transmembrane proteins.

Chemical Engineering Science, 66(2011)19, 4356–

4369.

[12] Y. Deng, X. Su, D. Wang, and Q. Li. Target recog-

nition based on fuzzy dempster data fusion method.

Defence Science Journal, 60(2010)5, 525–530.

[13] Y. Deng, W. Jiang, and R. Sadiq. Modelling con-

taminant intrusion in water distribution networks: a

new similarity-based DST method. Expert Systems

with Applications, 38(2011)1, 571–578.

[14] Y. Deng and F. T. S. Chan. A new fuzzy dempster

MCDM method and its application in supplier selec-

tion. Expert Systems with Applications, 38(2011)8,

9854–9861.

[15] Y. Deng, F. T. S. Chan, Y. Wu, and D. Wang. A new

linguistic MCDM method based on multiple-criterion

data fusion. Expert Systems with Applications,

38(2011)6, 6985–6993.

[16] A. Dempster. Upper and lower probabilities induced

by a multivalued mapping. Annals of Mathematics

and Statistics, 38(1967)2, 325–339.

[17] G. Shafer. A Mathematical Theory of Evidence.

Princeton, USA, Princeton University Press, 1976.

[18] P. Smets and R. Kennes. The transferable belief

model. Artificial Intelligence, 66(1994)3, 191–243.

[19] K. Hofmann and W. Stoffel. TMbase-A database of

membrane spanning proteins segments. Biological

Chemistry Hoppe-Seyler, 374(1993), 166.

[20] G. E. Tusnady and I. Simon. Principles governing

amino acid composition of integral membrane proteins:

applications to topology prediction. Journal of Mo-

lecular Biology, 283(1998)2, 489–506.

[21] S. Moller, E. V. Kriventseva, and R. Apweiler. A

collection of well characterized integral membrane

proteins. Bioinformatics, 16(2000)12, 1159–1160.