16
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME 142 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR FUNCTION USING PROTEIN Anjan Kumar Payra 1 , Sovan Saha 1 1 Dept. of Computer Science &Engg Dr. Sudhir Chandra Sur Degree Engineering College, DumDum Kolkata, India ABSTRACT Proteins are the most versatile macromolecules in living systems and serve crucial functions in essentially all biological processes. With successful sequencing of several genomes, the challenging problem now is to determine the functions of proteins in post genomic era. Determining protein functions experimentally is a laborious and time- consuming task involving many resources. Therefore, research is going on to predict protein functions using various computational methods since at present there are various diseases whose recovery drugs are still unknown or yet to be discovered and the drug discovery process starts with protein identification because proteins are responsible for many functions required for maintenance of life. So Protein identification further needs determination of protein function. These methods are based on sequence and structure, gene neighborhood, gene fusions, cellular localization, protein-protein interactions etc. In this work, we present an approach to predict functions of unannotated protein pair in an intelligent way based on their protein interaction network. The success rate obtained in our work is 94.4 %. Keywords: Protein interaction network, Unannotated protein pair function prediction, Functional groups, success rate. I. INTRODUCTION Proteins are the building blocks of life. Human body needs protein to repair and maintain itself. So proteins have versatile functions to perform. However the concept of protein function is highly context-sensitive and not very well-defined. In fact, this concept typically acts as an umbrella term for all types of activities that a protein is involved in, be it cellular, molecular or physiological. One such categorization of the types of functions a protein can perform has been suggested by Bork et al. [1998]: INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), pp. 142-157 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com IJCET © I A E M E

16 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 16 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

142

GENERIC APPROACH FOR PREDICTING UNANNOTATED

PROTEIN PAIR FUNCTION USING PROTEIN

Anjan Kumar Payra1, Sovan Saha

1

1

Dept. of Computer Science &Engg

Dr. Sudhir Chandra Sur Degree Engineering College, DumDum

Kolkata, India

ABSTRACT

Proteins are the most versatile macromolecules in living systems and serve crucial

functions in essentially all biological processes. With successful sequencing of several

genomes, the challenging problem now is to determine the functions of proteins in post

genomic era. Determining protein functions experimentally is a laborious and time-

consuming task involving many resources. Therefore, research is going on to predict protein

functions using various computational methods since at present there are various diseases

whose recovery drugs are still unknown or yet to be discovered and the drug discovery

process starts with protein identification because proteins are responsible for many functions

required for maintenance of life. So Protein identification further needs determination of

protein function. These methods are based on sequence and structure, gene neighborhood,

gene fusions, cellular localization, protein-protein interactions etc. In this work, we present an

approach to predict functions of unannotated protein pair in an intelligent way based on their

protein interaction network. The success rate obtained in our work is 94.4 %.

Keywords: Protein interaction network, Unannotated protein pair function prediction,

Functional groups, success rate.

I. INTRODUCTION

Proteins are the building blocks of life. Human body needs protein to repair and

maintain itself. So proteins have versatile functions to perform. However the concept of

protein function is highly context-sensitive and not very well-defined. In fact, this concept

typically acts as an umbrella term for all types of activities that a protein is involved in, be it

cellular, molecular or physiological. One such categorization of the types of functions a

protein can perform has been suggested by Bork et al. [1998]:

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING

& TECHNOLOGY (IJCET)

ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), pp. 142-157 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com

IJCET

© I A E M E

Page 2: 16 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

143

o Molecular function: The biochemical functions performed by a protein, such as ligand

binding, catalysis of biochemical reactions and conformational changes.

o Cellular function: Many proteins come together to perform complex physiological

functions, such as operation of metabolic pathways and signal transduction, to keep the

various components of the organism working well.

o Phenotypic function: The integration of the physiological subsystems, consisting of

various proteins performing their cellular functions, and the interaction of this integrated

system with environmental stimuli determines the phenotypic properties and behavior of the

organism.

In order to predict protein function we have to study the existing data types which can be

broadly classified under 8 sections:

� Amino acid sequences

� Protein structure

� Genome sequences

� Phylogenetic data

� Micro array expression data

� Protein interaction networks and protein complexes

� Biomedical literature

� Combination of multiple data types

� Amino acid sequences: An amino acid sequence is the order that amino acids join

together to form peptide chains, or polypeptides. If the peptide chain is a protein, this

sequence is often called the primary structure of the protein. Due to the structure of amino

acids and how they bond together, the order of the amino acids is only read in one direction

and is specific for the peptide being formed. It can be used to identify a protein or

homologous proteins through searches in databases and also to obtain information about post

translational cleavage points. In addition, the sequence results provide information about the

purity of a preparation. It limits of detectable contamination depend on the sequences of the

analyzed proteins. The central dogma of molecular biology is the conversion of a gene to

protein via the transcription and translation phases as shown in Fig. 1. The result of this

process is a sequence constructed from twenty amino acids, and is known as the protein’s

primary structure. This sequence is the most fundamental form of information available about

the protein since it determines different characteristics of the protein such as its sub-cellular,

localization, structure and function.

Fig. 1 Central dogma of molecular biology

Page 3: 16 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

144

The most popular experimental method for the identification of protein sequences is mass

spectrometry [Sickmann et al. 2003], which, in combination with algorithms such as

ProFound [Zhang and Chait 2000], comes in various flavors, such as peptide mass finger

printing, peptide fragmentation and other comparative methods. However, these methods are

low-throughput, and thus, with the exponential generation of genome sequences, the focus

has shifted to computational approaches that can identify genes from these genomes.

Specifically, techniques that predict protein function from sequence can be categorized into

three classes, namely, sequence homology-based approaches, subsequence-based approaches

and feature-based approaches, which are explained below:

Homology-based approaches: Homologous traits of organism are therefore due to decent

from common ancestor. The homology based search process more sensitive by multiple

means, such as making the search probabilistic and adding evidence from other sources of

data to obtain more accurate and confident annotations for the query proteins.

Subsequence-based approaches: It has been reflected in several studies that often not the

whole sequence, but only some segments of it are important for determining the function of a

given protein. Consequently, the approaches in this category treat these segments or

subsequences as features of a protein sequence and construct models for the mapping of these

features to protein function. These models are then used to predict the function of a query

protein.

Feature-based approaches: The final category of approaches attempts to exploit the

perspective that the amino acid sequence is a unique characterization of a protein, and

determines several of its physical and functional features. These features are used to construct

a predictive model which can map the feature-value vector of a query protein to its function.

� Protein Structure: A protein is an organic biopolymer that is comprised of a set of amino

acids, and assumes a configuration in three-dimensional space due to interactions between

these constituents as shown in Fig. 2. Protein structures may be specified at multiple levels.

Usually, it is specified at three levels, with a fourth level being specified for some cases

[Schulz and Schirmer 1996]. Following is a brief description of these levels:

Primary structure: The primary structure of a protein is simply a sequence of amino acids.

Secondary structure: The sequence of a protein influences its conformation in three

dimensional spaces via the formation of bonds between spatially close amino acids in the

sequence. This process is popularly known as protein folding, and leads to the creation of

substructures such as α-helices, β-sheets, turns and random coils, of which the first two are

the most common, while the last two are formed very rarely. The collection of these

substructures forms the secondary structure of a protein.

Tertiary structure: The attractive and repulsive forces among the substructures caused by

the folding balance each other and provide the protein with a relatively stable, though

complicated, three-dimensional structure. This structure is known as the tertiary structure of

the protein.

Quaternary structure: Some proteins, such as the spectrin protein [Fuller et al.1974] consist

of multiple amino acid sequences, also known as protein subunits. Each of these sequences

folds to form its own tertiary structure, which come together to produce the quarter nary

structure of protein.

Page 4: 16 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

145

The existing approaches in predicting protein functions from protein structure are:

Similarity-based approaches: Given the structure of a protein, these approaches identify the

protein with the most similar structure using structural alignment techniques, and transfer its

functional annotations to the query protein.

Fig. 2 Structure of protein

Motif-based approaches: The approaches in this category attempt to identify three

dimensional motifs, that are substructures conserved in a set of functionally related proteins,

and estimate a mapping between the function of a protein and the structural motifs it contains.

This mapping is then used to predict the functions of unannotated proteins.

Surface-based approaches: It is sometimes necessary to analyze the structure of a protein at

a higher resolution than that of distances between consecutive amino acids. This corresponds

to the modeling of a continuous surface for the structure and identifying features such as

voids or holes in these surfaces. The approaches in this category utilize these features to infer

a protein’s function.

Learning-based approaches: This category of recent approaches employ effective

classification methods, such as SVM and k-nearest neighbor, to identify the most appropriate

functional class for a protein from its most relevant structural features.

� Genomic sequences: Genome sequencing is a laboratory process that determines the

complete DNA sequence of an organism's genome at a single time. This entails sequencing

all of an organism's chromosomal DNA as well as DNA contained in the mitochondria and,

for plants, in the chloroplast. Almost any biological sample containing a full copy of the

DNA—even a very small amount of DNA or ancient DNA—can provide the genetic material

necessary for full genome sequencing.DNA itself is typically a double stranded molecule

,where one of the strands is constituted of four characters, namely A, T , C and G, which

denote the four nucleotides adenosine, guanine, cytosine and thymine, and other strand is

complimentary to the first, owing to the complimentarity of the A−C and T−G nucleotide

pairs as shown in Fig. 3 . Several approaches have been proposed to accomplish the target of

deriving functional associations from genomic data, and possible function prediction

subsequently.

Page 5: 16 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March

These approaches largely fall into one of the following three categories [Marcotte 2000]:

Genome-wide homology-based annotation transfer

use of larger databases for searching proteins homologous to the query proteins, and the

transfer of functional annotation from the closest results.

Gene neighborhood- or gene order

hypothesis that proteins, whose corresponding genes are located “close” to each other in

multiple genomes, are expected to interact functionally. This hypothesis is supported by the

concept of an operon, and its relevance to protein function [Salgado

Gene fusion-based approaches

in one genome that are merged to form a single gene in another genome. The underlying

hypothesis here is that these sets of genes are functionally relat

biochemical and structural evidence [Marcotte et al. 1999].

� Phylogenetic data: A phylogenetic tree or evolutionary tree is a branching diagram or

"tree" showing the inferred evolutionary relationships among various biological speci

other entities based upon similarities and differences in their physical and/or genetic

characteristics. The organisms are joined together in the tree, are implied to have descended

from a ancestor. In a rooted phylogenetic tree, each node with desce

inferred most recent common ancestor of the descendants and the edge lengths in some trees

may be interpreted as time estimates. Each node is called a taxonomic unit. Internal nodes are

generally called hypothetical taxonomic units, a

Phylogenetic profiling is a bioinformatics technique in which the joint presence or joint

absence of two traits across large numbers of species is used to infer a meaningful biological

connection, such as involvement of

is essential to include the evolutionary perspective in any complete understanding of protein

function. As a result, several approaches for predicting protein function using evolution

based data have recently been proposed.

relationships among living organisms

2004]. The phylogenetic profile

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976

6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

146

Fig. 3 DNA molecules

These approaches largely fall into one of the following three categories [Marcotte 2000]:

based annotation transfer: This category consists simply of the

use of larger databases for searching proteins homologous to the query proteins, and the

transfer of functional annotation from the closest results.

or gene order-based approaches: These approaches are based on

hypothesis that proteins, whose corresponding genes are located “close” to each other in

multiple genomes, are expected to interact functionally. This hypothesis is supported by the

, and its relevance to protein function [Salgado et al. 2000].

based approaches: These approaches attempt to discover pairs or sets of genes

in one genome that are merged to form a single gene in another genome. The underlying

hypothesis here is that these sets of genes are functionally related, and is supported by

biochemical and structural evidence [Marcotte et al. 1999].

A phylogenetic tree or evolutionary tree is a branching diagram or

"tree" showing the inferred evolutionary relationships among various biological speci

other entities based upon similarities and differences in their physical and/or genetic

characteristics. The organisms are joined together in the tree, are implied to have descended

from a ancestor. In a rooted phylogenetic tree, each node with descendants represents the

inferred most recent common ancestor of the descendants and the edge lengths in some trees

may be interpreted as time estimates. Each node is called a taxonomic unit. Internal nodes are

generally called hypothetical taxonomic units, as they cannot be directly observed.

Phylogenetic profiling is a bioinformatics technique in which the joint presence or joint

absence of two traits across large numbers of species is used to infer a meaningful biological

connection, such as involvement of two different proteins in the same biological pathway.

is essential to include the evolutionary perspective in any complete understanding of protein

function. As a result, several approaches for predicting protein function using evolution

ve recently been proposed. The field of biology that deals with the evolutionary

relationships among living organisms is also known as phylogenetics [Bittar and Sonderegger

phylogenetic profile of a protein is (generally) a binary vector whose l

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

April (2013), © IAEME

These approaches largely fall into one of the following three categories [Marcotte 2000]:

ory consists simply of the

use of larger databases for searching proteins homologous to the query proteins, and the

: These approaches are based on the

hypothesis that proteins, whose corresponding genes are located “close” to each other in

multiple genomes, are expected to interact functionally. This hypothesis is supported by the

et al. 2000].

: These approaches attempt to discover pairs or sets of genes

in one genome that are merged to form a single gene in another genome. The underlying

ed, and is supported by

A phylogenetic tree or evolutionary tree is a branching diagram or

"tree" showing the inferred evolutionary relationships among various biological species or

other entities based upon similarities and differences in their physical and/or genetic

characteristics. The organisms are joined together in the tree, are implied to have descended

ndants represents the

inferred most recent common ancestor of the descendants and the edge lengths in some trees

may be interpreted as time estimates. Each node is called a taxonomic unit. Internal nodes are

s they cannot be directly observed.

Phylogenetic profiling is a bioinformatics technique in which the joint presence or joint

absence of two traits across large numbers of species is used to infer a meaningful biological

two different proteins in the same biological pathway. It

is essential to include the evolutionary perspective in any complete understanding of protein

function. As a result, several approaches for predicting protein function using evolution-

he field of biology that deals with the evolutionary

also known as phylogenetics [Bittar and Sonderegger

of a protein is (generally) a binary vector whose length is

Page 6: 16 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

147

the number of available genomes. The vector contains a 1 in the ith position if the ith genome

contains a homologue of the corresponding gene, else a 0.In several other studies, a more

extensive representation of evolutionary knowledge is used [Bittar and Sonderegger 2004].

This representation is known as a phylogenetic tree [Baldauf 2003], which is a standard tree

with respect to the graph theoretical definition, but whose nodes and branches carry special

meaning as shown in Fig. 4.

� Micro array expression data: Protein synthesis from genes occurs in prokaryotic

organisms in two phases [Weaver 2002]. In the transcription phase, an mRNA is created from

the original gene by converting the latter to the corresponding RNA code. The protein is then

synthesized from mRNA by translating the RNA code to the corresponding amino acid

sequence according to the codon translation rules. Gene expression experiments are a method

to quantitatively measure the transcription phase of protein synthesis [Nguyen et al. 2002].

The most common category of these experiments uses square-shaped glass chips measuring

as little as 1 inch on either side, also known as cDNA micro arrays. Experiment using Micro

array is shown in Fig. 5. The experiment is carried out in the following stages.

Fig. 4 Constructing a simple phylogenetic tree

In the first stage, the chip is laid out with a matrix of dots of cDNAs, usually several

thousands in number, one corresponding to each of the gene being measured. In parallel,

mRNA is extracted from both the normal as well as the cells of the organism that have been

exposed to the condition being studied. These mRNA are reverse transcripted to cDNA and

colored with green and red colors respectively. These colored cDNAs are then spread on the

micro array chip, leading to a hybridization of the cDNA already on the chip with those

produced by the genes in the two types of cells. This generates a spot of a certain color on the

chip for each gene which denotes its expression level. In the final stage of the experiment, the

intensity of this region is measured by a laser scanners connected to a computer, which

generates a real valued measurement of the expression of each gene as the ratio of the log

intensities of red and blue colors in the region. The result of the experiment thus is a

measurement of the transcription activity of the genes under the specified condition.

Page 7: 16 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

148

Fig. 5 Micro array procedure

Existing approaches in gene expression data are:

Clustering-based approaches: An underlying hypothesis of gene expression analysis is that

functionally similar genes have similar expression profiles, since they are expected to be

activated and repressed under the same conditions. Because clustering is a natural approach

for grouping similar data points, approaches in this category cluster genes on the basis of

their gene expression profiles, and assign functions to the unannotated proteins using the

most dominant function for the respective clusters containing them.

Classification-based approaches: A more direct solution to the problem of predicting

protein function from gene expression profiles is the data mining approach of classification.

Thus, approaches in this category build various types of models for the expression function

mapping using classifiers, such as neural networks, SVMs and the naive Bayes classifier, and

use these models to annotate novel proteins.

Temporal analysis-based approaches: Temporal gene expression experiments measure the

activity of genes at different instances of time, for instance, during a disease. This behavior

can also be used to predict protein function. Thus, approaches in this category derive features

from this temporal data and use classification.

� Protein interaction networks and protein complexes: A protein almost never performs

its function in isolation. Rather, it usually interacts with other proteins in order to accomplish

a certain function. However, in keeping with the complexity of the biological machinery,

these interactions are of various kinds. At the highest level, they can be categorized into

genetic and physical interactions. Genetic interactions occur when the mutations in one gene

cause modifications in the behavior of another gene, which implies that these interactions are

only conceptual and do not occur physically in a genome. In our project we consider the

physical interactions between proteins, since they are more directly related to the process

through which a protein accomplishes its functions. Since a protein generally interacts with

more than one other protein, these interactions can be structured to form a network, and

hence the name protein interaction networks which is shown in Fig. 6 and Fig. 7.

Page 8: 16 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

149

Fig. 6 Organic View (Cytoscape) of our data set

Existing Approaches that attempt to predict function of proteins from a protein interaction

network can be broadly categorized into the following four categories:

Neighborhood-based approaches: These approaches utilize the neighborhood of the query

protein in the interaction network and the most “dominant” annotations among these

neighbors to predict its function.

Fig. 7 Circle View (Cytoscape) of our data set

Global optimization-based approaches: In many cases, the neighborhood of the query

protein may not contain enough information, such as annotated proteins, for determining the

function of the query protein robustly. Under these conditions, it may be advantageous to

consider the structure of the entire network and use the annotations of the proteins indirectly

connected to the query protein also. The approaches in this category are based on this idea,

and in most cases, are based on the optimization of an objective function based on the

annotations of the proteins in the network.

Page 9: 16 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

150

Clustering-based approaches: The approaches in this category were based on the

hypothesis that dense regions in the interaction network represented functional modules,

which are natural units in which proteins perform their function. Thus, these approaches

apply graph clustering algorithms to these networks and then determine the functions of

unannotated proteins in the extracted modules using measures such as majority.

Association-based approaches: Recently, several computationally efficient algorithms have

been proposed for finding frequently occurring patterns in data, in the field of association

analysis in data mining [Tan et al. 2005]. The approaches in this category use these

algorithms to detect frequently occurring sets of interactions in interaction networks of

protein complexes, and hypothesize that these sub graphs denote function modules. Function

prediction from these modules is performed as in the clustering based approaches.

� Biomedical literature: As in all other research communities, researchers in the fields of

biology and medicine publish the results of their research in various journals and conferences.

As a result, over the past, a huge repository of knowledge has been created in the form of

papers, books, reports, theses and other such texts. Clearly, these repositories contain a huge

amount of information about important biological concepts such as protein structure and

function, cancer-causing genes and several others. Thus, there is great utility in the mining of

these repositories and retrieval of useful information as shown in Fig. 8.

Multiple data types: With a plethora of data being generated by a wide spectrum of

proteomics experiments, it may be hypothesized that sometimes what can’t be discovered

from one source of information may become obvious when multiple sources are analyzed

simultaneously. This intuition has been concretized by Kemmeren and Holstege [2003], who

have suggested the following distinct advantages achieved by integrating functional genomics

data:

Fig. 8 Biomedical literature

Page 10: 16 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

151

o Usually, individual biological data sets provide information about complimentary

biological processes, such as gene expression and protein interaction networks. Thus,

combining them provides a global picture of the biological phenomena a set of genes is

involved in.

o Often, data quality varies between different types of data, as well as within different

sources of data of the same type. For instance, studies have shown significant variations

between the qualities of different protein interaction data sets [Deng et al. 2003]. Thus, the

combination of several data sources/types improves the quality of the overall data set, since

the errors in one data set may be corrected in another.

o The most important advantage of the integrative approach is that since only conclusions

valid over a set of data types are accepted, the predictions made by this approach are usually

more confident than those made on the basis of individual data sets.

Hence, now we have a clear idea regarding the different existing data types. So now let

us highlight about our work. Our objective is to assign un-annotated “protein pair” to

different functional groups. So we now focus on discussing the existing computational

techniques that use protein-protein interaction data to predict protein function. Protein

functionality can be predicted by neighborhood property which suggests that the PPI network,

neighbors of a particular protein have similar function. In the work of Schwikowski [1] a

neighborhood-counting method is proposed to assign k functions to a protein by identifying

the k most frequent functional labels among its interacting partners. It is simple and effective,

but the full topology is not considered and no confidence scores are assigned for the

annotations. But in the chi-square method, Hishigaki et al. [2] assigns k functions to a protein

with the k largest chi-square scores. For a protein P, each function f is assigned a score �������

��, where nf is the number of proteins in the n-neighborhood of P that have the function

f; The value ef is the expectation of this number based on the frequency of f among all

proteins in the network. Chen et al. [3] extends this neighborhood property to higher levels in

the network. They speculate the functional similarity between a protein and its neighbors

from the level-1 and level-2. An algorithm developed here is to assign a weight to each of its

level-1 and level-2 neighbors by estimating its functional similarity. Many graph algorithms

have been applied for its functional analysis. Vazquez et al. [4] assign proteins to a function

so as to maximize the connectivity of a protein assigned with the same function. They map

this problem into an optimization problem using simulated annealing where they maximizes

the number of edges that connect proteins ( un-annotated or previously annotated) assigned

with the same function. Karaoz et al. [5] apply a similar approach to a collection of PPI data

and gene expression data. They construct a distinct network for each function in GO. For a

particular state of function of each annotated protein v equals +1 if v has function f and -1 if v

has different function. Nabieva et al. [6] proposes a flow based approach to predict protein

function from the protein interaction network. Considering both the local and global

properties of the graph, this approach assigns function to un-annotated protein based on the

amount of flow it receives during simulation whereas each annotated protein is the source of

functional flow. Deng et al. [7] proposes an approach employing the theory of Markov

random field where they estimates the posterior probability of a protein of interest. Letvsky

and Kasif [8] use loopy belief propagation with the assumption of a binomial model for local

neighbors of protein annotated with a given time. Similarly, Wu et al. [9] propose a related

probabilistic model to annotate functions of unknown proteins and PPI networks based on the

structure of the PPI network. Joshi et al. [10] develop new integrated probabilistic method for

Page 11: 16 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

152

cellular function by combining information from protein-protein interaction, protein

complexes, micro array gene expression profiles and annotations of known protein through an

integrative statistical model. In the work of Samanta et al. [11], a network based statistical

algorithm is proposed, which assumes that if two proteins share significantly larger number of

common interacting partners they share a common functionality. Another application is

UVCLUSTER based on bi-clustering which iteratively explored distance datasets proposed

by Arnau et al. [12].Apart from graph clustering, in the early stage, Bader and Hogue [13]

propose Molecular Complex Detection (MCODE) where dense regions are detected

according to some parameters.Altaf-ul-Amin et al.[14] also use a clustering approach. It starts

from a single node in a graph and clusters are gradually grown until the similarity of every

added node within a cluster and density of clusters reaches a certain limit. Spirin and Mirny

[15] use graph clustering approach where they detect densely connected modules within

themselves as well as sparsely connected with the rest of the network based on super

paramagnetic clustering and Monte Carlo algorithm. Pruzli et al. [16] use graph theoretic

approach where clusters are identified using Leda’s routine components and those clusters are

analyzed by Highly Connected Sub graphs (HCS) algorithm. Later King et al. [17] partition

networks into clusters using a cost function applying Restricted Neighborhood Search

Clustering algorithm (RNCS). Clusters are filtered according to their size, density and

functional homogeneity. Krogan et al. [18] use Markov clustering algorithm to predict

Protein function.

II. PRESENT WORK

o Motivation: Many approaches have been discussed in the previous section over protein-

protein interaction network (PPI).After studying and going through various papers it can be

analyzed that very few assessment had been pursued on PPI considering protein pairs and

interconnection within their PPI network. This analyzation has encouraged us to work over

PPI network and to predict function of unannotated protein pair using a generic approach

which will be discussed in the forward sections.

o Dataset: In this work, the protein-protein interaction data of yeast (Saccharomyces

Cerevisiae) from ftp://ftpmips.gsf.de/yeast/PPI/, is collected which contains 15613 genetic

and physical interactions. Self-interactions are discarded. A set of 12487 unique binary

interactions involving 4648 proteins are taken as data. In our proposed method 15 functional

groups are considered. They are cell cycle control (O1), cell polarity (O2), cell wall

organization and biogenesis (O3), chromatin chromosome structure (O4), co-immuno-

precipitation (O5), co-purification (O6), DNA Repair(O7), lipid metabolism (O8), nuclear-

cytoplasmic transport (O9), pol II transcription (O10), protein folding (O11), protein

modification (O12), protein synthesis(O13), small molecule transport (O14) and vesicular

transport (O15). For each functional group, 90% protein pairs are taken as training samples

and rest (2-8%) among them are considered as test samples.

o Basic terminologies:

Protein interaction network: Protein–protein interactions occur when two or

more proteins bind together, often to carry out their biological function. Many of the most

important molecular processes in the cell such as DNA replication are carried out by large

Page 12: 16 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

153

molecular machines that are built from a large number of protein components organized by

their protein–protein interactions. These protein interactions form a network like structure

which is known as Protein interaction network. Here protein interaction network is

represented as a graph GP which consists of a set of vertex (nodes) V connected by edges

(links) E. Thus GP = (V, E).Here each protein is represented as a node and their

interconnections are represented by edges.

Sub graph: A graph G´P is a sub graph of a graph GP if the vertex set of G´P is a subset of the

vertex set of GP and if the edge set of G´P is a subset of the edge set of GP. That is, if G´P =

(V', E’) and GP= (V, E), then G´P is called as sub graph of GP if V′ V andE′ E. G´P may

be defined as a set of {K � U} where K represents the set of un-annotated protein pair while

U represents the set of annotated protein pair.

Level-1 neighbors: In G´P, the directly connected neighbors of a particular vertex are called

level-1 neighbors.

o Proposed Work: Here the work which has been proposed is to deduce the PPI network of

each individual protein belonging to unannotated protein pair chosen from the original data

set mentioned earlier. Hence afterward identifying the common interaction between those

deduced PPI networks and thereby estimating success rate by using a Generic Approach for

predicting function of unannotated protein pair.

o Method: In this method, given �′�, a sub graph of protein interaction network, consisting

of protein pair as nodes associated with any element of set O= {O1, O2, O3,….,O15} where Oi

represents a particular functional group, this method maps the elements of the set of un-

annotated protein pair U to any element of set O. Steps associated with this method is

described as follows:

Step 1: Take any protein pair as an element from set U.

Step 2: Deduce PPI network for each protein belonging to selected

protein pair in Step 1.

Step 3: Find common interacting pair in between PPI network

deduced in step 2.

Step 4: Count the number of occurrences Si (i=1,..,15) of set O= {O1, O2,O3,….,O15} in between

common interacting pair found in Step 3.

Step 5: Assign Oi of set O= {O1, O2, O3,….,O15} corresponding

Max (Si (i=1,..,15) ) to unannotated protein pair considered

in Step 1.

o Illustration of Method-I with an example:

An un-annotated protein pair YAL011w-YDL181w is taken from our test dataset U, which is

shown in yellow color in Fig 9. From GP,�′������� is taken where its level-1 neighbors are

YDR146c,YCR033w,YDR181c,YDL080c,YDR269w. Similarly, level-1 neighbors are taken

for �′������� ,which are YPL078c,YPL240c,YBR118w,and YER148w respectively. Two

functional groups (i.e., DNA repair and cell polarity) are involved in level-1 which is shown

in Fig 9.

Page 13: 16 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

154

Fig. 9 Sub-graph G´P of Protein pair YAL011w-YDL181w and its level-1 neighbor

Then common interacting pair between �′������� and �′������� is considered. So, In Fig

9, it is seen that there exists only one common interacting pair that is YDL080c-YPL078c

which is marked in green color in Fig 9.By studying our dataset ,it is derived that the protein

pair YDL080c-YPL078c belongs to functional group DNA Repair(O7).Now the number of

occurrences of each functional groups among the common interacting pair is enlisted and

highest number of occurrences of a particular functional group is assigned as the functional

group of unannotated protein pair. So, as in Fig 9, there exists one interacting pair of O7, we

assign O7 to unannotated protein pair YAL011w-YDL181w.

Fig. 10 Sub-graph G´P of Protein pair YMR236w-YHR099w and its level-1 neighbor

.

Page 14: 16 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

155

Another example of sub graph obtained in our work has been highlighted above in Fig. 10 and further the

method for predicting function of YMR236w-YHR099w is same as mentioned earlier. In our work, we select

unannotated protein pairs and predict their functional group using Generic approach as shown in TABLE -I.

Simultaneously, by counting matched and unmatched set of predicting protein pairs, we obtained success rate or

probability of success, as shown in TABLE-II.

TABLE - I

C Unannotated protein pair Original function Predicted function R

1 YNL250w|YKL101w Cell cycle control Cell cycle control �

2 YBR023c|YER111c Cell cycle control Cell cycle control �

3 YPL174c|YLR210w Mitosis Mitosis �

4 YLR229c|YPL161c Two hybrid Two hybrid �

5 YBR023c|YLR370c Cell polarity Cell polarity �

6 YNL233w|YCR009c Cell polarity Cell wall organization and biogenesis ˟

7 YBL061c|YLR342w Cell polarity Cell polarity �

8 YFR036w|YLR127c Coimmunoprecipitation Coimmunoprecipitation �

9 YDR108w|YML077w Coimmunoprecipitation Coimmunoprecipitation �

10 YFR002w|YGR119c two hybrid two hybrid �

11 YBL014c|YML043c Coimmunoprecipitation affinity purification ˟

12 YBR193c|YOL135c Coimmunoprecipitation Coimmunoprecipitation �

13 YBL084c|YDR118w Coimmunoprecipitation Coimmunoprecipitation �

14 YDR145w|YGR252w copurification copurification �

15 YHR099w|YOL148c copurification copurification �

16 YHR099w|YMR236w copurification copurification �

17 YGL112c|YHR099w copurification copurification �

18 YBR081c|YDR392w copurification copurification �

19 YGL097w|YIL063c copurification copurification �

20 YGL097w|YIL063c synthetic lethal synthetic lethal �

21 YDR145w|YDR176w copurification copurification �

22 YDR145w|YLR055c copurification copurification �

23 YNL273w|YGL163c DNA repair DNA repair �

24 YCL061c|YMR190c DNA repair DNA repair �

25 YKL113c|YDR369c DNA repair DNA repair �

26 YGR078c|YFR019w Lipid metabolism Lipid metabolism �

27 YBR023c|YFR019w Lipid metabolism Lipid metabolism �

28 YCL061c|YAR002w Nuclear-cytoplasmic transport Nuclear-cytoplasmic transport �

29 YLR418c|YLR384c Pol II transcription Pol II transcription �

30 YLR418c|YJR140c Pol II transcription Pol II transcription �

31 YPR135w|YGL244w Pol II transcription Pol II transcription �

32 YPR135w|YHR200w Pol II transcription Pol II transcription �

33 YOR070c|YJR032w Protein folding Protein folding �

34 YDR420w|YDR245w Protein modification Protein modification �

35 YLR418c|YDR363w-a Vesicular transport Vesicular transport �

36 YLR039c|YLR360w Vesicular transport Vesicular transport �

TABLE - II

Total no. of Unannotated protein pair Matched Unmatched Success rate

36 34 2 94.4

Page 15: 16 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

156

III. RESULTS& DISCUSSION

The above methods are evaluated by success rate which is defined as

������� �� � ! "#$%&' () *'(+&," -.,' /#"0+,(" *'&1,0+&1 0(''&0+23

+(+.2 "#$%&' () 4".""(+.+&1 *'(+&," -.,'5

In our work, we predict functions of protein pairs using algorithm of Generic Approach and

estimate success rate of 15 considered functional groups, out of which the probability of

success for six functional groups (co-purification (O6), co-immuno-precipitation (O5), pol

II transcription (O10), vesicular transport (O15), DNA Repair (O7), cell polarity (O2)) have

been shown in tabular and pictorial representation, as shown in TABLE-III and Fig. 12

respectively.

TABLE - III

Fig. 12 Pictorial representation of success rate for five functional groups.

Our proposed work adds an extra dimension to existing graph-theoretic methods as it

computes functions of unannotated protein pair instead of single protein considering level-1

neighbors. We hope the performance of generic approach will increase if we consider more a

large interaction network and level-2 neighbors. In future, our aim is to work with more

functional groups and for different organisms also.

0

1

2

3

4

5

6

7

8

9NUMBER OF

UNANNOTATED

PROTEIN PAIR

NUMBER OF

MATCHED PROTEIN

PAIR

PROBABLITY OF

SUCCESS

FUNCTIONAL GROUP

NUMBER OF UNANNOTATED

PROTEIN PAIR

NUMBER OF MATCHED PROTEIN

PAIR

PROBABLITY OF SUCCESS

O6 8 8 1

O5 5 4 0.8

O10 4 4 1

O15 2 2 1

O2 3 2 0.66

O7 3 3 1

Page 16: 16 GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

157

REFERENCES

[1] B. Schwikowski, P. Uetz and S. Fields, A network of protein- protein interactions in yeast.

Nature Biotech.18, 1257-1261, 2000.

[2] H. Hishigaki, K. Nakai, T. Ono, A. Tanigami, and T. Tagaki, Assessment of prediction

accuracy of protein function from Protein- protein interaction data. Yeast 18, 523-531,

2001.

[3] J. Chen, W. Hsu, M. L. Lee, and S. K. Ng. Labeling network motifs in protein

interactomes for protein function prediction. Proc 23rd International Conference on Data

Engineering (ICDE). 546- 555, 2007.

[4] Vazquez, “Global Protein Function Prediction from Protein-Protein Interaction

Networks,” Nature Biotechnology, vol. 21, pp. 0697- 700, June, 2003.

[5] U. Karaoz, T. M. Murali, S. Letovsky, Y. Zheng, C. Ding, C. R. Cantor, and S. Kasif.

Whole-genome annotation by using evidence Integration in functional-linkage.

[6] E. Nabieva, K. Jim, A. Agarwal, B. Chazelle, M. Singh. Whole Proteome prediction of

protein functions via graph-theoretic analysis of interaction maps. Bioinformatics 21

(Suppl 1): i302– i310, 2005.

[7] M. Deng, Inferring domain-domain interactions from protein protein interactions.

Genome Res. 12(10):1540-8, 2002.

[8] S. Letovsky, S. Kasif. Predicting protein function from protein protein interaction data: a

probabilistic approach. Bioinformatics.19 (Suppl 1): i197–i204, 2003.

[9] D. D. Wu, X. Hu, An efficient approach to detect a protein community from a seed. 2005

IEEE Symposium on Computational Intelligence in Bioinformatics and Computational

Biology (CIBCB2005).La Jolla CA, USA: IEEE pp. 135–141, 2005.

[10] Vazquez, “Global Protein Function Prediction from Protein-Protein Interaction

Networks,” Nature Biotechnology, vol. 21, pp. 697- 700, June 2003.

[11] M. P. Samanta,S. Liang, Predicting protein functions from

redundancies in large scale protein interaction networks. ProcNatlAcadSci USA 100:

12579–12583, 2003.

[12] V. Arnau, S. Mars, Marin I Iterative cluster analysis of protein interaction data.

Bioinformatics 21: 364–378, 2005.

[13] G. D. Bader,C. W. Hogue, An automated method for finding molecular complexes in

large protein interaction networks.BMC Bioinformatics 4: 2,2003.

[14] M. Altaf-Ul-Amin,Y. Shinbo,K. Mihara,K. Kurokawa,S. Kanaya Development and

implementation of an algorithm for detection of protein complexes in large interaction

networks. BMC bioinformatics 7: 207, 2006.

[15] V. Spirin, L. A. Mirny, Protein complexes and functional modules in molecular

networks. ProcNatlAcadSci USA 100:12123–12128, 2003.

[16] A. D. King, N. Przulj, I. Jurisica, Protein complex prediction via cost-based clustering.

Bioinformatics 20: 3013–3020, 2004.

[17] S. Asthana, O. D. King, F. D. Gibbons, F. P. Roth, Predicting protein complex

membership using probabilistic network reliability. Genome Res 14: 1170–1175, 2004.

[18] N. J. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, Global

landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440: 637–

643, 2006.

[19] Deepalakshmi. R and Jothi Venkateswaran C, “A Survey on Mining Methods for

Protein Sequence Analysis: An Aerial View”, International journal of Computer

Engineering & Technology (IJCET), Volume 3, Issue 2, 2012, pp. 28 - 34, ISSN Print:

0976 – 6367, ISSN Online: 0976 – 6375.