Abstract Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis

Abstract

• Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis vs ToppGene (functional prioritization method).

• Results: For the first time, the PageRank and HITS algorithms and the K-Step Markov method used in Web and social network analysis, are applied to a PPIN to prioritize disease candidate genes.

• Conclusion: PPIN-based candidate gene prioritization performs better than all others gene features or annotation. It can be successfully used for disease candidate gene prioritization.

Background-1

• Most of the current disease candidate gene identification and prioritization methods rely on functional annotations from different data sources: GO, Pathways,Domains, Expressions..

• In their recent work, the authors used a functional prioritization method named ToppGene: they integrated functional data with Mouse Phenotype data. ToppGene outperforms better than the other published functional prioritization methods.

• In these methods there is a limitation, with regard to the coverage of the gene functional annotation:

- only a fraction of human genome is annotated with pathways and phenotypes - 2/3 of all genes are annotated by at least one functional annotation

- 1/3 is yet to be annotated

Background-2

Different approach

• In this study, for the first time, they applied to a PPIN , social and Web-network analysis-based algorithms to prioritize disease candidate genes

• PPIN represented as unweighted, undirected, simple graph G (V, E);

genes are nodes, interactions are edges, V all genes, E all

interactions.

The set of known disease genes (seeds) is denoted as R.

• Prioritization approaches are based on the methods of White and Smyth whose framework of four successive problem formulations defines the approach to rank nodes in the unweighted graph G (V,E).

Methods-1 White and Smyth problem formulations:

1. Given G, where t and r are both nodes in G, compute the Importance I(t|r) of the node t respect to the root r2. Given G and a root node r in G, rank all vertices in T, a subset of vertices

in G and for each node in t in T compute I(t|r)3. Given G and a set of root node R in G, rank all vertices in T. The I(t|R) is the average sum of importance of each node in R:

I(t|R) = (1/|R|)(sum(I(t|r))

4. Given G, rank all nodes where R=T=V

• The solution of the formulation 3 is what is needed in this study: here the problem is to prioritize a set of genes in the network based on

their importance to a set of root genes (genes known to be associated with a disease). • The importance of a gene to the set of root genes is just the average

sum of its importance towards each individual root gene.

Methods-2

• The solution is to find I(t|r), the importance of the node t with respect to a root node r.

• They used the three algorithms from White and Smyth methods:

1. PageRank

2. HITS

3. K-Step Markov

Methods-3

Human protein interactions network

• The Human protein-protein interactions were extracted from the NCBI Entrez Gene FTP site with 8340 nodes and 27250 edges (BIND, BioGRID, HPRD). Evaluation of PPIN for gene prioritization

• they used the same training data, from their previous study, comprising 19 diseases on OMIM (Online Mendelian Inheritance in Man) and GAD (Genetic Association Database) databases.

A total ol 693 associated genes. 589 genes were used in the cross validation.

Cardiac septal defect candidate gene prioritization

• From NCBI’s OMIM databse: 166 OMIM records were extracted; they had the label “atrial septal defect”. 81 genes were mapped on these records and used as the training set.

431 genes (from interactions) used for ranking (test set).

Results-1

Cross validation

13 conditions with 3 algorithmsdifferent parameter settingsrepeated 5 times

Rank-based ROC curves were plotted, and AUC values were used to quantitativelymeasure the performance.

Results-2

Results-3

Top 20 ranked genes

*Genes associated with cardiac development or malformation: 15 ToppGen, 14 PPIN-based method#(hash) genes associated with septal defects: 6 ToppGene, 3 PPIN-based method

A combined functional annotations and PPIN-based methods are more effective in identifying and ranking of disease candidate genes

Mouse embryos lacking p300 protein (EP300 gene)show ventricular septal defects

Truncated CBP protein (CREBBP gene)leads, in mice, atrial and ventricular septal defects

Mice with deletion of Erbb2 showventricular septal defects (VSD)Suggesting that the human ortholog ERBB2 could be a potential candiadte gene for VSD

Results-4

Prioritized candidate genes of cardiac septal defects using both functional annotation- and PPIN- based methods.

Results-5

AUC of different feature sets. Red bars indicate the AUC scores based on each feature set, and blue bars are the corresponding random controls.

Conclusions-1

• PageRank, HITS, K-Step Markov algorithms were applied on a Literature-based and manually curated protein interactions network.

• Goal: to prioritize disease candidate genes.

Known disease-related genes was used as a training set ("seeds"), and the candidate genes were ranked.

• Network-based methods are generally not as effective as the integrated functional annotation-based methods.

• By comparing PPIN-based methods to the individual functional annotation

features, network-based methods are better than all annotations.

• Therefore, PPINs can be a good feature for disease candidate gene

prioritization, especially when the genes lack all other functional

annotations or are sparsely annotated.

Conclusions-2

• Limitations: Just like functional annotation-based methods, the performance depends on the quality of interaction data (missing interactions and false positives).

Solutions:• better fit with biological networks (e.g., using weighted nodes - genes or

proteins - or edges – interactions-).

• integrate the method with other methods (e.g., combining results from functional annotation-based methods and expression profiles with network-based approaches).

• It is expected that using both functional annotations and PPIN-based topological parameters may better facilitate the discovery and prioritization of disease genes.

Documents

Abstract Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis