9
Interdisciplinary Seminar on Biological Networks Gunnar Klau 7 Feb 2012 - 12 June 2012 http://www.cwi.nl/biological-networks-uu-2012 INTRODUCTION Round of introduction Rest of today Rules Short introduction of possible topics Website http://www.cwi.nl/biological-networks-uu-2012. Will always be updated I am always available for questions, discussion etc. at CWI or in Utrecht (Tuesdays) RULES I: REGISTRATION Formal registration: Utrecht students. There is a vakcode (Osiris) External students: check with study advisor Bureaucracy (from my side) will be as low as possible Wednesday, February 8, 12

Interdisciplinary Seminar on Biological Networks - CWI

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Interdisciplinary Seminar on Biological Networks - CWI

InterdisciplinarySeminar onBiological Networks

Gunnar Klau7 Feb 2012 - 12 June 2012http://www.cwi.nl/biological-networks-uu-2012

INTRODUCTION

■ Round of introduction

■ Rest of today• Rules• Short introduction of possible topics

■Website http://www.cwi.nl/biological-networks-uu-2012. Will always be updated

■ I am always available for questions, discussion etc. at CWI or in Utrecht (Tuesdays)

RULES I: REGISTRATION

■ Formal registration: Utrecht students. There is a vakcode (Osiris)

■ External students: check with study advisor

■ Bureaucracy (from my side) will be as low as possible

Wednesday, February 8, 12

Page 2: Interdisciplinary Seminar on Biological Networks - CWI

RULES II: CREDITS AND GRADE

■ 7.5 credits

■ Grade• 40% presentation• 30% participation• 30% project

RULES III: PROCEDURE

■ Everybody has read material before the meeting (will be on website, login/password by e-mail)

■ Presentation of topic (45-60 min)

■ Last slide(s): interesting questions to stimulate discussion

■ Discussion (~30 min)

■Min. 1 week before meeting discuss slides with me. Best face-to-face (e.g., before preceding seminar)

RULES IV: MORE RULES

■ Interdisciplinarity• different cultures• respect• learn from each other

Wednesday, February 8, 12

Page 3: Interdisciplinary Seminar on Biological Networks - CWI

POSSIBLE SEMINAR DATES1 7-2 Introduction, organizational issues Gunnar Klau

14-2 no seminar (GWK conference)2 21-2 cell biology, biological networks Oleksandr Ivanov3 28-2 network properties and models Martin van Meerkerk4 6-35 13-36 20-37 27-38 3-49 10-4

10 17-424-4 no seminar (GWK conference)

11 1-512 8-513 15-514 22-5

29-5 no seminar (pentecost)15 5-6 project presentations16 12-6 project presentations

BIOLOGICAL NETWORKS

■ Coordinated interactions between gene products comprise and control many fundamental cellular processes• protein complexes

• metabolism[Nature Publishing Group, 2007]

■ signaling pathways

■ synthesis and activity of other gene products

[http://encognitive.com]

[Christopher Bickel, AAAS]

[http://www.stemcellschool.org/ig-transcription-factor.html ncognitive.com]

BIOLOGICAL QUESTIONSCOMPUTATIONAL CHALLENGES

■ understand network organization/structural design principles/evolution

■ functional annotation

■ discover (perturbed) cellular mechanisms

■ find disease-causing genes

■ detect orthologous genes

■ classify phenotypes

■ ...

■ characterize network properties (connectivity, robustness, degree distribution, ...), network model

■ find overrepresented motifs

■ find modules/complexes

■ find active/deregulated subnetworks

■ predict node/edge attributes

■ find commonalities (alignment)

■ integrate network/data

■ infer, dynamics, stochasticity

■ ...

Wednesday, February 8, 12

Page 4: Interdisciplinary Seminar on Biological Networks - CWI

POSSIBLE TOPICS

■ Own suggestion (appropriate)

■ Or choose from following suggestions.• Personal taste. Mix of “important” and “interesting”

papers• Chosen because of method ➞ more generally

applicable • Can give more information upon request• Many topics covered in Roded Sharan’s lecture

“Analysis of biological networks” (Tel Aviv University)

■ Finish schedule next week (14 February!)

SUGGESTION 1

■ Network motifs

■ = significantly over-represented interaction patterns➞ functional importance

■Milo et al., Science, 2002 (analyzed different networks)

■ Zhang et al., J. Biol., 2005 (analyzed integrated network)

Figure 1: Source [20]. (A) The different types of networks tested by Milo et al. [20]. (B) All the possibledirected motifs for n = 3.

1.1.1 Simulated Annealing

Simulated Annealing is based on an analogy to the cooling of heated metals. In any heated metal sample,the probability of some cluster of atoms at a position, r

i

, exhibiting a specific energy state, E(ri

), at sometemperature, T , is defined by the Boltzmann probability factor:

P (E(ri

)) = e�E(ri)

kBT

where kB

is Boltzmann’s constant. A metal is slowly cooled until it approach a ground state, a highlyordered form in which there is very little probability for the existence of a high energy state throughoutthe material. If the Energy function of this physical system is replaced by an objective function, f(x), thenthe slow progression forwards an ordered ground state is representative of a progression to a global optimum.

The randomization procedure used to ensure the properties that were priviously mentioned, is based onthe Simulated Annealing algorithm ([11], sections 10.5.1 and 10.5.3):

Start with the system in a known configuration, at a known energy ET = hot; frozen = falsewhile (! frozen) {

repeat{Perturb system slightly (e.g. edge-swapping)Compute �E - the change in energy due to the perturbationif (�E < 0)

accept this perturbation, this is the new system configurationelse accept with probability = exp(��E/KT )

} until the system is at thermal equilibrium at this Tif (�E still decreasing over the last few temperatures)

T = 0.9T //cool the temperature; do more perturbationselse frozen=true

}return final configuration as low energy solution

2

SUGGESTION 2

■ PPI networks

■What are they and how good are they?• experimental techniques and their quality• prediction techniques• quality assessment• Sharan Lec. 3, Sect. 1, 2

Figure 10: Source [30]: Quantitative comparison of interaction data sets. The various data sets are bench-marked against a reference set of 10,907 trusted interactions, which are derived from protein complexesannotated manually at MIPS and YPD. Coverage and accuracy are lower limits owing to incompleteness ofthe reference set. Each dot in the graph represents an entire interaction data set, and its position specifiescoverage and accuracy (on a loglog scale). For the combined evidence, only interactions supported by anagreement of two (or three) of any of the methods shown were considered. For most data sets, raw andfiltered data are shown, demonstrating the trade-off between coverage and accuracy achieved by filtering.

Figure 11: Source [30]: Accuracy and coverage rates for various high scale methods (corresponds to thedata in Figure 10).

In large scale experiments the results for identical spatial localization for the two interacting proteins varyfrom 40% to 80%. When considering overlapping datasets from high throughput experiments, the localiza-tion level is a little higher. These results indicate that a large percentage of the interactions obtained are falsepositives.

2.3.3 Correlation with Functional Similarity

An independent measure to assess the ’quality of experiment’ is the functional correlation between theinteracting proteins. This measure of assessment can be visualized using a functional correlation matrix.Each axis of the matrix represents 13 functional categories. Each of the yeast genes has been categorizedinto a single functional category using MIPS. The color in each box represents the amount of interactionsthat occur between proteins from the appropriate functional categories. Proteins with related functions

13

Wednesday, February 8, 12

Page 5: Interdisciplinary Seminar on Biological Networks - CWI

SUGGESTION 3

■ PPI networks for functional protein annotation

■ proteins involved in same function are more likely to interact

• local methods (Schwikowski, Hishigaki)

REVIEW

Network-based prediction of protein function

Roded Sharan1, Igor Ulitsky1 and Ron Shamir*

School of Computer Science, Tel Aviv University, Tel Aviv, Israel* Corresponding author. School of Computer Science, Tel Aviv University,Tel Aviv 69978, Israel. Tel.: þ 972 3 6405383; Fax: þ 972 3 6405384;E-mail: [email protected]

1 These authors contributed equally to this work.

Received 20.9.06; accepted 9.1.07

Functional annotation of proteins is a fundamental problemin the post-genomic era. The recent availability of proteininteraction networks for many model species has spurredon the development of computational methods for inter-preting such data in order to elucidate protein function. Inthis review, we describe the current computationalapproaches for the task, including direct methods, whichpropagate functional information through the network, andmodule-assisted methods, which infer functional moduleswithin the network and use those for the annotation task.Although a broad variety of interesting approaches hasbeen developed, further progress in the field will depend onsystematic evaluation of the methods and their dissemina-tion in the biological community.Molecular Systems Biology 13 March 2007;doi:10.1038/msb4100129Subject Categories: computational methods; proteinsKeywords: data integration; function prediction; protein interactionnetworks; protein modules

Introduction

The past decade has seen a revolution in sequencingtechnologies, resulting in hundreds of sequenced genomes.A fundamental challenge of the post-genomic era is theinterpretation of this wealth of data to elucidate proteinfunction. To date, even for the most well-studied organismssuch as yeast, about one-fourth of the proteins remainuncharacterized (Figure 1).Classical computational approaches to gene annotation

collect for each protein a set of features characterizing it, andapply machine-learning algorithms to infer annotation rulesbased on those features (Pavlidis et al, 2001). The newlyavailable large-scale networks of molecular interactions with-in the cell have made it possible to go beyond these one-dimensional approaches, and study protein function in thecontext of a network. In particular, novel high-throughputtechnologies for protein–protein interaction (PPI) measure-ments (Aebersold and Mann, 2003; Fields, 2005) have createdlarge-scale data on protein interaction across human and mostmodel species. These data are commonly represented as

networks, with nodes representing proteins and edgesrepresenting the detected PPIs.In this review, we survey the growing body of works on

functional annotation of proteins via their network ofinteractions (summarized in Table I). We distinguish twotypes of approaches (Figure 2): direct annotation schemes,which infer the function of a protein based on its connectionsin the network, and module-assisted schemes, which firstidentify modules of related proteins and then annotate eachmodule based on the known functions of its members.Naturally, the presented methods and the emphasis onparticular ones reflect the opinions of the authors.

Direct methods

The common principle underlying all direct methods forfunctional annotation is that proteins that lie closer to oneanother in the PPI network are more likely to have similarfunction. As can be seen in Figure 3, there is an evidentcorrelation between network distance and functional distance,that is, the closer the two proteins are in the network the moresimilar are their functional annotations. The methods de-scribed below differ in the way they capture and exploit thiscorrelation. In the following, we denote the PPI network as agraph G¼(V,E) (see Box 1 for graph-theoretic definitions).

Neighborhood counting

The simplest and most direct method for function predictiondetermines the function of a protein based on the knownfunction of proteins lying in its immediate neighborhood.Schwikowski et al (2000) predict for a given protein up to threefunctions that are most common among its neighbors.Although simple and effective, the obvious caveats of thisapproach are that associations are not assigned any signifi-cance values and the full topology of the network is not takeninto account in the annotation process.Hishigaki et al (2001) try to tackle the first problem by

computing w2-like scores for function assignment. In detail,they examine the n-neighborhood of a protein (Box 1). For aprotein p, each function f is assigned a score (nf -ef)

2/ef, wherenf is the number of proteins in the n-neighborhood of p thathave the function f and ef is the expectation of this numberbased on the frequency of f among the network’s proteins. Ashortcoming of this approach is that within the n-neighbor-hood, proteins at different distances from p are treated in thesame way. Chua et al (2006) try to tackle the second problemby investigating the relation between network distance andfunctional similarity. They focus on the 1- and 2-neighbor-hoods of a protein, and devise a functional similarity score thatgives different weights to proteins according to their distancesfrom the target protein.

& 2007 EMBO and Nature Publishing Group Molecular Systems Biology 2007 1

Molecular Systems Biology 3; Article number 88; doi:10.1038/msb4100129Citation: Molecular Systems Biology 3:88& 2007 EMBO and Nature Publishing Group All rights reserved 1744-4292/07www.molecularsystemsbiology.com

SUGGESTION 4

■ PPI networks for functional protein annotation

■ proteins involved in same function are more likely to interact

• global methods (Vazquez, ILP, functional flow method, Markov random field)

REVIEW

Network-based prediction of protein function

Roded Sharan1, Igor Ulitsky1 and Ron Shamir*

School of Computer Science, Tel Aviv University, Tel Aviv, Israel* Corresponding author. School of Computer Science, Tel Aviv University,Tel Aviv 69978, Israel. Tel.: þ 972 3 6405383; Fax: þ 972 3 6405384;E-mail: [email protected]

1 These authors contributed equally to this work.

Received 20.9.06; accepted 9.1.07

Functional annotation of proteins is a fundamental problemin the post-genomic era. The recent availability of proteininteraction networks for many model species has spurredon the development of computational methods for inter-preting such data in order to elucidate protein function. Inthis review, we describe the current computationalapproaches for the task, including direct methods, whichpropagate functional information through the network, andmodule-assisted methods, which infer functional moduleswithin the network and use those for the annotation task.Although a broad variety of interesting approaches hasbeen developed, further progress in the field will depend onsystematic evaluation of the methods and their dissemina-tion in the biological community.Molecular Systems Biology 13 March 2007;doi:10.1038/msb4100129Subject Categories: computational methods; proteinsKeywords: data integration; function prediction; protein interactionnetworks; protein modules

Introduction

The past decade has seen a revolution in sequencingtechnologies, resulting in hundreds of sequenced genomes.A fundamental challenge of the post-genomic era is theinterpretation of this wealth of data to elucidate proteinfunction. To date, even for the most well-studied organismssuch as yeast, about one-fourth of the proteins remainuncharacterized (Figure 1).Classical computational approaches to gene annotation

collect for each protein a set of features characterizing it, andapply machine-learning algorithms to infer annotation rulesbased on those features (Pavlidis et al, 2001). The newlyavailable large-scale networks of molecular interactions with-in the cell have made it possible to go beyond these one-dimensional approaches, and study protein function in thecontext of a network. In particular, novel high-throughputtechnologies for protein–protein interaction (PPI) measure-ments (Aebersold and Mann, 2003; Fields, 2005) have createdlarge-scale data on protein interaction across human and mostmodel species. These data are commonly represented as

networks, with nodes representing proteins and edgesrepresenting the detected PPIs.In this review, we survey the growing body of works on

functional annotation of proteins via their network ofinteractions (summarized in Table I). We distinguish twotypes of approaches (Figure 2): direct annotation schemes,which infer the function of a protein based on its connectionsin the network, and module-assisted schemes, which firstidentify modules of related proteins and then annotate eachmodule based on the known functions of its members.Naturally, the presented methods and the emphasis onparticular ones reflect the opinions of the authors.

Direct methods

The common principle underlying all direct methods forfunctional annotation is that proteins that lie closer to oneanother in the PPI network are more likely to have similarfunction. As can be seen in Figure 3, there is an evidentcorrelation between network distance and functional distance,that is, the closer the two proteins are in the network the moresimilar are their functional annotations. The methods de-scribed below differ in the way they capture and exploit thiscorrelation. In the following, we denote the PPI network as agraph G¼(V,E) (see Box 1 for graph-theoretic definitions).

Neighborhood counting

The simplest and most direct method for function predictiondetermines the function of a protein based on the knownfunction of proteins lying in its immediate neighborhood.Schwikowski et al (2000) predict for a given protein up to threefunctions that are most common among its neighbors.Although simple and effective, the obvious caveats of thisapproach are that associations are not assigned any signifi-cance values and the full topology of the network is not takeninto account in the annotation process.Hishigaki et al (2001) try to tackle the first problem by

computing w2-like scores for function assignment. In detail,they examine the n-neighborhood of a protein (Box 1). For aprotein p, each function f is assigned a score (nf -ef)

2/ef, wherenf is the number of proteins in the n-neighborhood of p thathave the function f and ef is the expectation of this numberbased on the frequency of f among the network’s proteins. Ashortcoming of this approach is that within the n-neighbor-hood, proteins at different distances from p are treated in thesame way. Chua et al (2006) try to tackle the second problemby investigating the relation between network distance andfunctional similarity. They focus on the 1- and 2-neighbor-hoods of a protein, and devise a functional similarity score thatgives different weights to proteins according to their distancesfrom the target protein.

& 2007 EMBO and Nature Publishing Group Molecular Systems Biology 2007 1

Molecular Systems Biology 3; Article number 88; doi:10.1038/msb4100129Citation: Molecular Systems Biology 3:88& 2007 EMBO and Nature Publishing Group All rights reserved 1744-4292/07www.molecularsystemsbiology.com

SUGGESTION 5

■Modules in PPI networks• module = set of genes/proteins

• performing a distinct biological function in an autonomous manner

• showing similar interaction/expression pattern • MCODE (Bader, Hogue, Bioinformatics, 2003)

• finds dense subnetworks• Modularity/community structure (Newman, PNAS,

2006)

Wednesday, February 8, 12

Page 6: Interdisciplinary Seminar on Biological Networks - CWI

SUGGESTION 6

■Modules (integrative approaches)

SUGGESTION 7

■Modules (integrative approaches)

[20:14 18/6/03 Bioinformatics-btn161.tex] Page: i223 i223–i231

BIOINFORMATICS Vol. 24 ISMB 2008, pages i223–i231doi:10.1093/bioinformatics/btn161

Identifying functional modules in protein–protein interactionnetworks: an integrated exact approachMarcus T. Dittrich1,2,∗,†, Gunnar W. Klau3,4,∗,†, Andreas Rosenwald5,Thomas Dandekar1 and Tobias Müller1,∗1Department of Bioinformatics, Biocenter, University of Würzburg, Am Hubland, 97074 Würzburg, 2Institute ofClinical Biochemistry, University of Würzburg, Josef-Schneider-Str. 2, 97080 Würzburg, 3Mathematics in LifeSciences Group, Department of Mathematics and Computer Science, Freie Universität Berlin, Arnimallee 3, 14195Berlin, 4DFG Research Center MATHEON, Berlin and 5Institute of Pathology, University of Würzburg,Josef-Schneider-Str. 2, 97080 Würzburg, Germany

ABSTRACTMotivation: With the exponential growth of expression and protein–protein interaction (PPI) data, the frontier of research in systemsbiology shifts more and more to the integrated analysis of theselarge datasets. Of particular interest is the identification of functionalmodules in PPI networks, sharing common cellular function beyondthe scope of classical pathways, by means of detecting differentiallyexpressed regions in PPI networks. This requires on the one handan adequate scoring of the nodes in the network to be identifiedand on the other hand the availability of an effective algorithm to findthe maximally scoring network regions. Various heuristic approacheshave been proposed in the literature.Results: Here we present the first exact solution for this problem,which is based on integer-linear programming and its connectionto the well-known prize-collecting Steiner tree problem fromOperations Research. Despite the NP-hardness of the underlyingcombinatorial problem, our method typically computes provablyoptimal subnetworks in large PPI networks in a few minutes.An essential ingredient of our approach is a scoring function definedon network nodes. We propose a new additive score with twodesirable properties: (i) it is scalable by a statistically interpretableparameter and (ii) it allows a smooth integration of data from varioussources.

We apply our method to a well-established lymphoma microarraydataset in combination with associated survival data and the largeinteraction network of HPRD to identify functional modules bycomputing optimal-scoring subnetworks. In particular, we find afunctional interaction module associated with proliferation over-expressed in the aggressive ABC subtype as well as modules derivedfrom non-malignant by-stander cells.Availability: Our software is available freely for non-commercialpurposes at http://www.planet-lisa.net.Contact: [email protected]

1 INTRODUCTIONConstruction and analysis of large biological networkshave become major research topics in systems biology(Aittokallio and Schwikowski, 2006). Various aspects have beenanalyzed including the inference of cellular networks from gene

∗To whom correspondence should be addressed.

†The authors wish it to be known that, in their opinion, the first two authors

should be regarded as joint First Authors.

expression (Friedman, 2004), network alignments (Flannick et al.,2006; Kelley et al., 2003; Sharan and Ideker, 2006) and otherrelated strategies as reviewed by Srinivasan et al. (2007). Atthe same time, well-established microarray technologies providea wealth of information on gene expression in various tissuesand under diverse experimental conditions. Integrating protein–protein interaction (PPI) and gene-expression data generates ameaningful biological context in terms of functional association fordifferentially expressed genes.

Frequently, large scale expression profiling studies investigatemany experimental conditions simultaneously, thereby generatingmultiple P-values. Especially in tumor biology expression profilinghas become a well-established tool for the classification of differenttumors and tumor subtypes. Furthermore, in the clinical context,various patient-associated data are available that—in conjunctionwith expression data—provide valuable information of the influenceof specific genes on disease-specific pathophysiology. In particularthe analysis of survival data allows to establish gene expressionsignatures to make predictions about the prognosis and to assessthe disease relevance of certain genes. However, the cellularfunction of an individual gene cannot be understood on thelevel of isolated components alone, but needs to be studiedin the context of its interplay with other gene products. Thecombined analysis of expression profiles and PPI data thus allowsthe detection of previously unknown dysregulated modules ininteraction networks not recognizable by the analysis of a prioridefined pathways.

Ideker et al. (2002) have proposed to identify interaction modulesin this setting by devising firstly an adequate scoring functionon networks and secondly an algorithm to find the high-scoringsubnetworks. The underlying combinatorial problem has beenproven to be NP-hard for additive score functions defined onthe nodes of the network. The authors proposed a heuristicstrategy based on simulated annealing and developed a scoreto measure the significance of a subnetwork that includes theintegration of multivariate P-values. This score has been extendedby Rajagopalan and Agarwal (2005) to incorporate an adjustmentparameter in order to obtain smaller subgraphs in conjunctionwith a greedy search algorithm. This approach however, excludesthe possibility to combine multiple P-values. Variants of greedysearch strategies have also been used by Nacu et al. (2007) andSohler et al. (2004). Subsequently Cabusora et al. (2005) proposedan edge score by adapting the scoring concept of Ideker et al. (2002).

© 2008 The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/)which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

SUGGESTION 8

■Modules (integrative approaches)

DEGAS: De Novo Discovery of Dysregulated Pathways inHuman DiseasesIgor Ulitsky1¤a*, Akshay Krishnamurthy2¤b, Richard M. Karp3, Ron Shamir1

1 Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel, 2 University of California, Berkeley, California, United States of America, 3 International

Computer Science Institute, Berkeley, California, United States of America

Abstract

Background: Molecular studies of the human disease transcriptome typically involve a search for genes whose expression issignificantly dysregulated in sick individuals compared to healthy controls. Recent studies have found that only a smallnumber of the genes in human disease-related pathways show consistent dysregulation in sick individuals. However, thosestudies found that some pathway genes are affected in most sick individuals, but genes can differ among individuals. Whilea pathway is usually defined as a set of genes known to share a specific function, pathway boundaries are frequentlydifficult to assign, and methods that rely on such definition cannot discover novel pathways. Protein interaction networkscan potentially be used to overcome these problems.

Methodology/Principal Findings: We present DEGAS (DysrEgulated Gene set Analysis via Subnetworks), a method foridentifying connected gene subnetworks significantly enriched for genes that are dysregulated in specimens of a disease.We applied DEGAS to seven human diseases and obtained statistically significant results that appear to home in on compactpathways enriched with hallmarks of the diseases. In Parkinson’s disease, we provide novel evidence for involvement ofmRNA splicing, cell proliferation, and the 14-3-3 complex in the disease progression. DEGAS is available as part of theMATISSE software package (http://acgt.cs.tau.ac.il/matisse).

Conclusions/Significance: The subnetworks identified by DEGAS can provide a signature of the disease potentially usefulfor diagnosis, pinpoint possible pathways affected by the disease, and suggest targets for drug intervention.

Citation: Ulitsky I, Krishnamurthy A, Karp RM, Shamir R (2010) DEGAS: De Novo Discovery of Dysregulated Pathways in Human Diseases. PLoS ONE 5(10): e13367.doi:10.1371/journal.pone.0013367

Editor: Timothy Ravasi, King Abdullah University of Science and Technology, Saudi Arabia

Received May 9, 2010; Accepted September 8, 2010; Published October 19, 2010

Copyright: ! 2010 Ulitsky et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: Part of this work was performed when IU was a fellow of the Edmond J. Safra Bioinformatics Program at Tel-Aviv University. This research wassupported in part by the ‘‘GENEPARK: GENomic Biomarkers for PARKinson’s disease’’ project that is funded by the European Commission within its FP6Programme (contract number EU-LSHB-CT-2006-037544) and by the Israeli Science Foundation (grant no. 802/08). RS was supported in part by the EuropeanCommunity’s FP6 Programme (contract EU-LSHB-CT-2006- 0375 for the Genepark project) and FP7 Programme (grant HEALTH-F4-2009-223575 for the TRIREMEproject) and by the Israel Science Foundation (grant 802/08). The funders had no role in study design, data collection and analysis, decision to publish, orpreparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

¤a Current address: Whitehead Institute for Biomedical Research, Cambridge, Massachusetts, United States of America¤b Current address: Computer Science Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America

Introduction

Systems biology has the potential to revolutionize the diagnosisand treatment of complex disease by offering a comprehensiveview of the molecular mechanisms underlying their pathology. Toachieve these goals, biologists need computational methods thatextract mechanistic understanding from the masses of availabledata. To date, the main sources of such data are microarraymeasurements of genome-wide expression profiles, with over400,000 profiles stored in GEO [1] alone as of April 2010. A widevariety of approaches for elucidating molecular mechanisms fromexpression data have been suggested [2,3]. However, most of thesemethods are effective only when using expression profiles obtainedunder diverse conditions and perturbations, while the bulk of datacurrently available from clinical studies are expression profiles ofgroups of diseased individuals and matched controls. These dataare useful for characterizing the molecular signature of a diseasefor diagnostic and prognostic purposes [4,5]. However, using these

expression profiles to obtain a better understanding for thepathogenesis is significantly more difficult. The standard methodsapplied to these data identify the genes that best predict thepathological status of the samples. While these methods aresuccessful in identifying potent signatures for classificationpurposes, the mechanistic insights that can be obtained fromexamining the gene lists they produce are frequently limited [6].

Standard statistical tests, as well as the vast majority of moresophisticated methods utilizing diverse genomic data, look forgenes whose expression is significantly and robustly different in thecase and in the control cohorts. Several recent comprehensivestudies, mostly in the context of cancer, have found that few genesmeet these criteria. Yet, many of the affected individuals werefound to carry dysregulated genes that belong to specific disease-related pathways [7,8,9,10]. In order to identify such pathways,these studies utilized a fixed collection of gene lists based oncurrent biological knowledge. While several computationalmethods have been developed for quantifying the changes in the

PLoS ONE | www.plosone.org 1 October 2010 | Volume 5 | Issue 10 | e13367

Wednesday, February 8, 12

Page 7: Interdisciplinary Seminar on Biological Networks - CWI

SUGGESTION 9

■ Networks in prediction

■ predict treatment response

■ color-coding to find discriminative subnetworks (randomized algorithm)

[20:04 6/6/2011 Bioinformatics-btr245.tex] Page: i205 i205–i213

BIOINFORMATICS Vol. 27 ISMB 2011, pages i205–i213doi:10.1093/bioinformatics/btr245

Optimally discriminative subnetwork markers predict responseto chemotherapyPhuong Dao1,†, Kendric Wang2,3,†, Colin Collins3,4,∗, Martin Ester1,, Anna Lapuk3,∗,‡

and S. Cenk Sahinalp1,∗,‡

1School of Computing Science, Simon Fraser University, 2Bioinformatics Training Program, University of BritishColumbia, 3Vancouver Prostate Centre and 4Department of Urology, University of British Columbia

ABSTRACTMotivation: Molecular profiles of tumour samples have been widelyand successfully used for classification problems. A number ofalgorithms have been proposed to predict classes of tumor samplesbased on expression profiles with relatively high performance.However, prediction of response to cancer treatment has provedto be more challenging and novel approaches with improvedgeneralizability are still highly needed. Recent studies haveclearly demonstrated the advantages of integrating protein–proteininteraction (PPI) data with gene expression profiles for thedevelopment of subnetwork markers in classification problems.Results: We describe a novel network-based classification algorithm(OptDis) using color coding technique to identify optimallydiscriminative subnetwork markers. Focusing on PPI networks,we apply our algorithm to drug response studies: we evaluateour algorithm using published cohorts of breast cancer patientstreated with combination chemotherapy. We show that our OptDismethod improves over previously published subnetwork methodsand provides better and more stable performance compared withother subnetwork and single gene methods. We also show thatour subnetwork method produces predictive markers that are morereproducible across independent cohorts and offer valuable insightinto biological processes underlying response to therapy.Availability: The implementation is available at: http://www.cs.sfu.ca/∼pdao/personal/OptDis.htmlContact: [email protected]; [email protected];[email protected]

1 INTRODUCTIONIn the treatment of cancers, patients presenting tumors with similarclinical characteristics will often respond differentially to the samechemotherapy (van’t Veer and Bernards, 2008). In fact, for manytypes of cancer, only a minority of treated patients will observeregression of tumor growth. This is the case for both conventionalchemotherapeutic agents and newer targeted therapies that affectspecific molecules. To achieve an effective cancer treatment,it is critical to identify the underlying mechanisms that conferchemoresistance in some tumors but not others.

The advent of genome-wide expression profiling technologies hasallowed the discovery of novel biomarkers for cancer diagnosis,

∗To whom correspondence should be addressed.†The authors wish it to be known that in their opinion, the first two authorsshould be regarded as joint First Authors.‡The authors wish it to be known that in their opinion, the last two authorsshould be regarded as joint Last Authors.

prognosis and treatment (van’t Veer and Bernards, 2008). Whilesome progress has been made toward identifying reliable prognosticmarkers for breast and other cancers, development of molecularmarkers predictive of response to chemotherapy has proved to befar more difficult (van’t Veer and Bernards, 2008).

In recent years, a number of studies have used genome-wideexpression profiling to identify genes that could be used as predictorsof drug response in breast cancer (Cleator et al., 2006; Hess, 2006).In these studies, single gene marker methods were used, where eachgene is individually ranked for differential expression and the topgenes were selected as predictors known as single gene markers.Additional study (Lee et al., 2007; Liedtke et al., 2010) requiredsingle gene markers not only to be differentially expressed but alsoto have similar coexpression between the training and test cohorts.While some of these predictive markers have shown promisingresults in a limited number of patient cohorts, many of thesesignatures have failed to achieve similar performance in additionalvalidation studies (Bonnefoi et al., 2009). In addition, single genemarkers developed from different cohorts have been shown tohave very little overlap (Ein-Dor et al., 2006). A further limitationof single gene markers is that they provide relatively limitedinsight into the biological mechanisms underlying response to drugresponse. Thus, predictive markers with robust performance, greaterreproducibility and improved insights into drug action–which arecritical for clinical application–still remains elusive.

Previous studies have observed that gene products associatedwith cancer tend to be highly clustered in coexpression networksand have more ‘interactions’. Inspired by this observation, Chuanget al. (2007) introduced the use of all members of a protein–proteininteraction (PPI) subnetwork as a metagene marker for predictingmetastasis in breast cancer. Chuang et al. (2007) demonstratedthat subnetwork markers are more robust, i.e. their results tendto provide more reproducible results across different cohorts ofpatients. Motivated by the limitations in predicting drug responseusing single gene markers and the better performance promisedby subnetwork markers, this article aims to identify subnetworkmarkers to predict chemotherapeutic response, as detailed below.

1.1 Subnetwork markers in other applicationsChuang et al. (2007) defined subnetwork activity as the aggregateexpression of genes in a given subnetwork. The discriminativescore of a subnetwork—which reflects how well the subnetworkdiscriminates samples of different phenotypes (or classes)–wasderived from mutual information between subnetwork activityand the phenotype. The study presented greedy algorithms foridentifying subnetworks with the highest discriminative scores and

© The Author(s) 2011. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

at CW

I on July 14, 2011bioinform

atics.oxfordjournals.orgD

ownloaded from

SUGGESTION 10

■ Local network alignment • PathBLAST• NetworkBLAST

Conserved pathways within bacteria and yeast asrevealed by global protein network alignmentBrian P. Kelley*, Roded Sharan†, Richard M. Karp†, Taylor Sittler*, David E. Root*, Brent R. Stockwell*,and Trey Ideker*‡

*Whitehead Institute for Biomedical Research, 9 Cambridge Center, Cambridge, MA 02142; and †International Computer Science Institute, 1947 CenterStreet, No. 600, Berkeley, CA 94704

Contributed by Richard M. Karp, July 25, 2003

We implement a strategy for aligning two protein–protein inter-action networks that combines interaction topology and proteinsequence similarity to identify conserved interaction pathways andcomplexes. Using this approach we show that the protein–proteininteraction networks of two distantly related species, Saccharo-myces cerevisiae and Helicobacter pylori, harbor a large comple-ment of evolutionarily conserved pathways, and that a largenumber of pathways appears to have duplicated and specializedwithin yeast. Analysis of these findings reveals many well charac-terized interaction pathways as well as many unanticipated path-ways, the significance of which is reinforced by their presence inthe networks of both species.

Evolution is driven by biological variation at many levels.Mutations and rearrangements in genomic DNA lead to

changes in protein structures, abundances, and modificationstates. Variations at the protein level, in turn, impact howproteins interact with one another, with DNA, and with smallmolecules to form signaling, regulatory, and metabolic networks.Changes in network organization have sweeping implications forcellular function, tissue-level responses, and the behavior andmorphology of whole organisms.

Gene and protein sequences have long received the mostattention as metrics for evolutionary change, both because theyrepresent a fundamental level of biological variation and becausethey are readily available through automated sequencing tech-nology. However, recent technological advances also enable usto characterize networks of protein interactions. Protein inter-actions are crucial to cellular function both in assembling proteincomplexes and in signal transduction cascades. Among the mostdirect and systematic methods for measuring protein interactionsare coimmunoprecipitation (1) and the two-hybrid system (2),which have defined large protein–protein interaction networksfor organisms including Saccharomyces cerevisiae (3–5), Helico-bacter pylori (6), and Caenorhabditis elegans (7). Although thequality of data from these experiments has been mixed, poolingof multiple studies and integration with other data types such asgene expression have been used to reduce the number offalse-positive interactions (8).

The rapid growth of protein network information raises a hostof new questions in evolutionary and comparative biology. Giventhat protein sequences and structures are conserved in andamong species, are networks of protein interactions conserved aswell? Is there some minimal set of interaction pathways requiredfor all species? Can we measure evolutionary distance at the levelof network connectivity rather than at the level of DNA orprotein sequence? Mounting evidence suggests that conservedprotein interaction pathways indeed exist and may be ubiquitous:For example, proteins in the same pathway are typically presentor absent in a genome as a group (9), and several hundredprotein–protein interactions in the yeast network have also beenidentified for the corresponding protein orthologs in worms (10).

To explore interspecies pathway conservation on a globalscale, we performed a series of whole-network comparisonsusing the protein–protein interaction networks of the budding

yeast S. cerevisiae and the bacterial pathogen H. pylori. Com-parative network analysis has proven powerful in a number ofrelated domains including metabolic pathway analysis (11–14),motif finding (15), and correlation of biological networks withgene expression (16). Here we systematically search for andprioritize conserved interaction pathways in yeast vs. bacteria,yeast vs. yeast, and yeast vs. specific ‘‘queries’’ formulated touncover homologous mitogen-activated protein kinase (MAPK)signaling and ubiquitin ligation machinery.

MethodsNetwork Comparison Overview. We developed an efficient com-putational procedure for aligning two protein interaction net-works to identify their conserved interaction pathways.§ Thisprocedure, which we named PATHBLAST because of its concep-tual similarity to sequence alignment algorithms such as BLAST(17), searches for high-scoring pathway alignments involving twopaths, one from each network, in which proteins of the first path!A, B, C, D, . . . " are paired with putative homologs occurringin the same order in the second path !a, b, c, d, . . . " (Fig. 1a).Evolutionary variations and experimental errors in pathwaystructure are accommodated by allowing ‘‘gaps’’ and ‘‘mismatch-es’’ (see also ref. 14). A gap occurs when a protein interaction inone path skips over a protein in the other, whereas a mismatchoccurs when aligned proteins do not share sequence similarity.Because of space limitations, only abbreviated methods are givenin the following sections; full descriptions are available inSupporting Materials and Methods and Figs. 5 and 6, which arepublished as supporting information on the PNAS web site,www.pnas.org.¶

Global Alignment and Scoring. To perform the alignment, the twonetworks are combined into a global alignment graph (Fig. 1b)in which each vertex represents a pair of proteins (one from eachnetwork) having at least weak sequence similarity (BLAST Evalue ! 10#2) and each edge represents a conserved interaction,gap, or mismatch. A path through this graph represents apathway alignment between the two networks. We formulate alog probability score S(P) that decomposes over the vertices vand edges e of a path P through the global alignment graph

Abbreviation: MAPK, mitogen-activated protein kinase.‡To whom correspondence should be sent at the present address: University of Californiaat San Diego, Department of Bioengineering, 9500 Gilman Drive, La Jolla, CA 92093.E-mail: [email protected].

§The term ‘‘pathway’’ has been used broadly within various molecular biological contextsto refer to biochemical reaction chains, signal transduction cascades, gene regulatorysystems, or other sequences of biomolecular events. Here a pathway refers to a sequenceof protein–protein interactions forming a connected path in the network.

¶We have also explored methods for identifying conserved subnetworks as opposed tolinear paths (see Fig. 7, which is published as supporting information on the PNAS website); choosing which approach is most desirable remains an open problem and dependson issues of computational efficiency and whether protein complexes or sequentialpathways such as signal transduction or regulatory cascades are of highest interest.

© 2003 by The National Academy of Sciences of the USA

11394–11399 ! PNAS ! September 30, 2003 ! vol. 100 ! no. 20 www.pnas.org"cgi"doi"10.1073"pnas.1534710100

JOURNAL OF COMPUTATIONAL BIOLOGYVolume 12, Number 6, 2005© Mary Ann Liebert, Inc.Pp. 835–846

Identification of Protein Complexes by ComparativeAnalysis of Yeast and Bacterial Protein

Interaction Data

RODED SHARAN,1 TREY IDEKER,2 BRIAN KELLEY,3 RON SHAMIR,1 andRICHARD M. KARP4

ABSTRACT

Mounting evidence shows that many protein complexes are conserved in evolution. Herewe use conservation to find complexes that are common to the yeast S. cerevisiae and thebacteria H. pylori. Our analysis combines protein interaction data that are available foreach of the two species and orthology information based on protein sequence compari-son. We develop a detailed probabilistic model for protein complexes in a single speciesand a model for the conservation of complexes between two species. Using these models,one can recast the question of finding conserved complexes as a problem of searching forheavy subgraphs in an edge- and node-weighted graph, whose nodes are orthologous pro-tein pairs. We tested this approach on the data currently available for yeast and bacteriaand detected 11 significantly conserved complexes. Several of these complexes match verywell with prior experimental knowledge on complexes in yeast only and serve for validationof our methodology. The complexes suggest new functions for a variety of uncharacterizedproteins. By identifying a conserved complex whose yeast proteins function predominantlyin the nuclear pore complex, we propose that the corresponding bacterial proteins functionas a coherent cellular membrane transport system. We also compare our results to twoalternative methods for detecting complexes and demonstrate that our methodology obtainsa much higher specificity.

Key words: protein interaction network, probabilistic model, heavy subgraph.

1. INTRODUCTION

With the sequences of dozens of genomes at hand and the accumulating information on thetranscriptomes and proteomes of different organisms, a new research paradigm is emerging in

molecular biology. At the core of this paradigm is the comparative analysis of biological properties of

1School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel.2Dept. of Bioengineering, University of California San-Diego, 9500 Gilman Drive, La Jolla, CA 92093.3Whitehead Institute for Biomedical Research, 9 Cambridge Ctr., Cambridge, MA 02142.4International Computer Science Institute, 1947 Center St., Berkeley, CA 94704.

835

• probabilistic scoring

SUGGESTION 11

■ Local network alignment • MaWish• duplication/divergence evolutionary model

JOURNAL OF COMPUTATIONAL BIOLOGYVolume 13, Number 2, 2006© Mary Ann Liebert, Inc.Pp. 182–199

Pairwise Alignment of Protein Interaction Networks

MEHMET KOYUTÜRK,1 YOHAN KIM,2 UMUT TOPKARA,1SHANKAR SUBRAMANIAM,2,3 WOJCIECH SZPANKOWSKI,1 and ANANTH GRAMA1

ABSTRACT

With an ever-increasing amount of available data on protein–protein interaction (PPI) net-works and research revealing that these networks evolve at a modular level, discovery ofconserved patterns in these networks becomes an important problem. Although availabledata on protein–protein interactions is currently limited, recently developed algorithms havebeen shown to convey novel biological insights through employment of elegant mathemat-ical models. The main challenge in aligning PPI networks is to define a graph theoreticalmeasure of similarity between graph structures that captures underlying biological phe-nomena accurately. In this respect, modeling of conservation and divergence of interactions,as well as the interpretation of resulting alignments, are important design parameters. Inthis paper, we develop a framework for comprehensive alignment of PPI networks, whichis inspired by duplication/divergence models that focus on understanding the evolution ofprotein interactions. We propose a mathematical model that extends the concepts of match,mismatch, and gap in sequence alignment to that of match, mismatch, and duplicationin network alignment and evaluates similarity between graph structures through a scor-ing function that accounts for evolutionary events. By relying on evolutionary models, theproposed framework facilitates interpretation of resulting alignments in terms of not onlyconservation but also divergence of modularity in PPI networks. Furthermore, as in thecase of sequence alignment, our model allows flexibility in adjusting parameters to quan-tify underlying evolutionary relationships. Based on the proposed model, we formulate PPInetwork alignment as an optimization problem and present fast algorithms to solve thisproblem. Detailed experimental results from an implementation of the proposed frameworkshow that our algorithm is able to discover conserved interaction patterns very effectively,in terms of both accuracies and computational cost.

Key words: protein–protein interactions, network alignment, evolutionary models.

1. INTRODUCTION

Increasing availability of experimental data relating to biological sequences, coupled with efficienttools such as BLAST and CLUSTAL, have contributed to fundamental understanding of a variety of

1Department of Computer Sciences, Purdue University, West Lafayette, IN 47907.2Department of Chemistry and Biochemistry, University of California at San Diego, La Jolla, CA 92093.3Department of Bioengineering, University of California at San Diego, La Jolla, CA 92093.

182

Wednesday, February 8, 12

Page 8: Interdisciplinary Seminar on Biological Networks - CWI

SUGGESTION 12

■ Global network alignment• IsoRank• Adaptation of Google’s pagerank algorithm

Global alignment of multiple protein interactionnetworks with application to functionalorthology detectionRohit Singh*, Jinbo Xu†, and Bonnie Berger*‡§

*Computer Science and Artificial Intelligence Laboratory and ‡Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139;and †Toyota Technological Institute at Chicago, Chicago, IL 60637

Communicated by Silvio Micali, Massachusetts Institute of Technology, Cambridge, MA, July 16, 2008 (received for review March 12, 2008)

Protein–protein interactions (PPIs) and their networks play a cen-tral role in all biological processes. Akin to the complete sequenc-ing of genomes and their comparative analysis, complete de-scriptions of interactomes and their comparative analysis isfundamental to a deeper understanding of biological processes. Afirst step in such an analysis is to align two or more PPI networks.Here, we introduce an algorithm, IsoRank, for global alignment ofmultiple PPI networks. The guiding intuition here is that a proteinin one PPI network is a good match for a protein in anothernetwork if their respective sequences and neighborhood topolo-gies are a good match. We encode this intuition as an eigenvalueproblem in a manner analogous to Google’s PageRank method.Using IsoRank, we compute a global alignment of the Saccharo-myces cerevisiae, Drosophila melanogaster, Caenorhabditis el-egans, Mus musculus, and Homo sapiens PPI networks. We dem-onstrate that incorporating PPI data in ortholog prediction resultsin improvements over existing sequence-only approaches and overpredictions from local alignments of the yeast and fly networks.Previous methods have been effective at identifying conserved,localized network patterns across pairs of networks. This worktakes the further step of performing a global alignment of multiplePPI networks. It simultaneously uses sequence similarity and net-work data and, unlike previous approaches, explicitly models thetradeoff inherent in combining them. We expect IsoRank—with itssimultaneous handling of node similarity and network similari-ty—to be applicable across many scientific domains.

biological networks ! graph isomorphism ! network alignment !protein–protein interactions ! functional coherence

A fundamental goal of biology is to understand the cell as asystem of interacting components. In particular, the discovery

and understanding of interactions between proteins has receivedsignificant attention in recent years. Toward this goal, high-throughput experimental techniques [e.g., yeast two-hybrid (1, 2)and coimmunoprecipitation (3)] have been invented to discoverprotein–protein interactions (PPIs) . The data from these tech-niques, which are still being perfected, are being supplemented byhigh-confidence computational predictions and analyses of PPIs(4–6). A powerful way of representing and analyzing this vastcorpus of data is the PPI network: A network where each nodecorresponds to a protein and an edge indicates a direct physicalinteraction between the proteins.

As the size of PPI datasets for various species rapidly increases,comparative analysis of PPI networks across species is proving to bea valuable tool. Such analysis is similar in spirit to traditionalsequence-based comparative genomic analyses; it also promisescommensurate insights. As a phylogenetic tool, it offers a function-oriented perspective that complements traditional sequence-basedmethods. Comparative network analysis also enables us to identifyconserved functional components across species (7) and performhigh-quality ortholog prediction (i.e., identifying genes in differentspecies derived from the same ancestral region). Solving theseproblems is crucial for transferring insights and information across

species, allowing us to perform experiments in (say) yeast or fly andapply those insights toward understanding mechanisms of humandiseases (8). Indeed, Bandyopadhyay et al. (9) have demonstratedthat the use of PPI networks in computing orthologs producesorthology mappings that better conserve protein function acrossspecies (i.e., functional orthologs).

Previous work on PPI network alignment has almost exclusivelyfocused on the local network alignment problem (see Global vs.Local Network Alignment) and has thus far targeted only pairwisealignments. The pioneering work of Kelley et al. (10, 11) describedhow BLAST similarity scores and PPI network information couldbe used to identify conserved functional motifs. Koyuturk et al. (12)proposed another method, motivated by biological models ofduplication and deletion. Recently, Flannick et al. (7) proposed anew efficient approach, using modules of proteins to infer thealignment. Berg and Lassig (13) have proposed a Bayesian ap-proach to this problem. Many of these methods limit the set ofpossible node-pairings based on sequence-based similarity scores ororthology predictions, and then add in network data to infer thealignment. This approach helps reduce the problem complexity, butlacks the flexibility of producing node-pairings that diverge fromsequence-only predictions.

We note here that the graph alignment problem has also beenstudied in other domains. For example, in computer vision, theproblem of matching a query image to an existing image in thedatabase has often been formulated as a graph-matching problem,each image represented as a graph. Some of the solutions proposedin that domain use spectral techniques, i.e., they use eigenvaluescomputed based on each graph (14, 15). Our approach, which alsoconstructs an eigenvalue problem (although, not for individualgraphs) may be relevant in this domain as well.

In this article, we introduce an approach to comparative analysisof PPI networks to address the problem of finding the optimal globalalignment between two or more PPI networks, aiming to find acorrespondence between nodes and edges of the input networksthat maximizes the overall ‘‘match’’ between the networks. Wepropose the IsoRank algorithm for multiple network alignment.The IsoRank algorithm simultaneously uses both PPI network dataand sequence similarity data in an eigenvalue-based framework to

Preliminary versions of this work were presented at the RECOMB 2007, PSB 2008, and SODA2008 Conferences.

Author contributions: R.S. and B.B. designed research; R.S. and B.B. performed research;R.S. and J.X. contributed new reagents/analytic tools; R.S. analyzed data; and R.S., J.X., andB.B. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Freely available online through the PNAS open access option.§To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/cgi/content/full/0806627105/DCSupplemental.

© 2008 by The National Academy of Sciences of the USA

www.pnas.org"cgi"doi"10.1073"pnas.0806627105 PNAS ! September 2, 2008 ! vol. 105 ! no. 35 ! 12763–12768

COM

PUTE

RSC

IEN

CES

SUGGESTION 13

■ Global network alignment• natalie• Integer Linear Programming, Lagrangian relaxation• Bounds

Lagrangian Relaxation Applied toSparse Global Network Alignment

Mohammed El-Kebir1,2,3, Jaap Heringa2,3,4, and Gunnar W. Klau1,3

1 Centrum Wiskunde & Informatica, Life Sciences Group, Science Park 123,1098 XG Amsterdam, The Netherlands{m.el-kebir,gunnar.klau}@cwi.nl

2 Centre for Integrative Bioinformatics VU (IBIVU), VU University Amsterdam,De Boelelaan 1081A, 1081 HV Amsterdam, The Netherlands

[email protected] Netherlands Institute for Systems Biology, Amsterdam, The Netherlands

4 Netherlands Bioinformatics Centre, The Netherlands

Abstract. Data on molecular interactions is increasing at a tremendouspace, while the development of solid methods for analyzing this networkdata is lagging behind. This holds in particular for the field of compara-tive network analysis, where one wants to identify commonalities betweenbiological networks. Since biological functionality primarily operates atthe network level, there is a clear need for topology-aware comparisonmethods. In this paper we present a method for global network align-ment that is fast and robust, and can flexibly deal with various scoringschemes taking both node-to-node correspondences as well as networktopologies into account. It is based on an integer linear programmingformulation, generalizing the well-studied quadratic assignment problem.We obtain strong upper and lower bounds for the problem by improv-ing a Lagrangian relaxation approach and introduce the software toolnatalie 2.0, a publicly available implementation of our method. In anextensive computational study on protein interaction networks for sixdifferent species, we find that our new method outperforms alternativestate-of-the-art methods with respect to quality and running time. Anextended version of this paper including proofs and pseudo code is avail-able at http://arxiv.org/pdf/1108.4358v1.

1 Introduction

In the last decade, data on molecular interactions has increased at a tremendouspace. For instance, the STRING database [24], which contains protein proteininteraction (PPI) data, grew from 261,033 proteins in 89 organisms in 2003 to5,214,234 proteins in 1,133 organisms in May 2011, more than doubling the num-ber of proteins in the database every two years. The same trends can be observedfor other types of biological networks, including metabolic, gene-regulatory, sig-nal transduction and metagenomic networks, where the latter can incorporatethe excretion and uptake of organic compounds through, for example, a micro-bial community [21, 12]. In addition to the plethora of experimentally derived

M. Loog et al. (Eds.): PRIB 2011, LNBI 7036, pp. 225–236, 2011.c© Springer-Verlag Berlin Heidelberg 2011

SUGGESTION 14

■ Prediction of disease-causing genes

■ genes causing similar diseases lie close together in PPI network

■ global method: PRINCE

Associating Genes and Protein Complexes with Diseasevia Network PropagationOron Vanunu1., Oded Magger1., Eytan Ruppin1, Tomer Shlomi2, Roded Sharan1*

1 School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel, 2 Department of Computer Science, Technion, Haifa, Israel

Abstract

A fundamental challenge in human health is the identification of disease-causing genes. Recently, several studies havetackled this challenge via a network-based approach, motivated by the observation that genes causing the same or similardiseases tend to lie close to one another in a network of protein-protein or functional interactions. However, most of theseapproaches use only local network information in the inference process and are restricted to inferring single geneassociations. Here, we provide a global, network-based method for prioritizing disease genes and inferring protein complexassociations, which we call PRINCE. The method is based on formulating constraints on the prioritization function that relateto its smoothness over the network and usage of prior information. We exploit this function to predict not only genes butalso protein complex associations with a disease of interest. We test our method on gene-disease association data,evaluating both the prioritization achieved and the protein complexes inferred. We show that our method outperformsextant approaches in both tasks. Using data on 1,369 diseases from the OMIM knowledgebase, our method is able (in across validation setting) to rank the true causal gene first for 34% of the diseases, and infer 139 disease-related complexesthat are highly coherent in terms of the function, expression and conservation of their member proteins. Importantly, weapply our method to study three multi-factorial diseases for which some causal genes have been found already: prostatecancer, alzheimer and type 2 diabetes mellitus. PRINCE’s predictions for these diseases highly match the known literature,suggesting several novel causal genes and protein complexes for further investigation.

Citation: Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R (2010) Associating Genes and Protein Complexes with Disease via Network Propagation. PLoSComput Biol 6(1): e1000641. doi:10.1371/journal.pcbi.1000641

Editor: Wyeth W. Wasserman, University of British Columbia, Canada

Received August 6, 2009; Accepted December 14, 2009; Published January 15, 2010

Copyright: ! 2010 Vanunu et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This study was supported in part by a fellowship from the Edmond J. Safra Bioinformatics program at Tel-Aviv University (http://safrabio.cs.tau.ac.il).This study was also supported by a German-Israel Foundation grant. The funders had no role in study design, data collection and analysis, decision to publish, orpreparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

. These authors contributed equally to this work.

Introduction

Associating genes with diseases is a fundamental challenge inhuman health with applications to understanding disease mech-anisms, diagnosis and therapy. Linkage studies are often used toinfer genomic intervals that are associated with a disease ofinterest. Prioritizing genes within these intervals is a formidablechallenge and computational approaches are becoming themethod of choice for such problems.

When one or more genes were already implicated in a givendisease, the prioritization task is often handled by computing thefunctional similarity between a given gene and the known diseasegenes. Such a similarity can be based on sequence [1], functionalannotation [2], protein-protein interactions [3,4] and more (see [5]for a comprehensive review of these methods). When no causalgenes are known, the prioritization is done by exploiting themodular view described above, comparing a candidate gene toother genes that were implicated in similar diseases.

Approaches in the latter category are often based on a measure ofphenotypic similarity (see, e.g., [6,7]) between the disease of interestand other diseases for which causal genes are known. This ismotivated by the observation that genes causing the same or similardiseases often lie close to one another in a protein-proteininteraction network [3,5]. Lage et al. [7] score a candidate protein

with respect to a disease of interest based on the involvement of itsdirect network neighbors in a similar disease. The protein and itshigh-confidence interactors are also suggested to form a putativeprotein complex that is related to the disease. Kohler et al. [8] groupdiseases into families. For a given disease, they employ a randomwalk from known genes in its family to prioritize candidate genes.Finally, Wu et al. [9] score a candidate gene g for a certain disease dbased on the correlation between the vector of similarities of d todiseases with known causal genes, and the vector of closeness in aprotein interaction network of g and those known disease genes. Arecent follow-up work by Wu et al. introduces AlignPI, a methodthat exploits known gene-disease associations to align the pheno-typic similarity network with the human PPI network [10]. Thealignment is used to identify local dense regions of the PPI networkand their associated disease clusters. The authors show the utility oftheir framework in causal gene prediction.

Most of these methods focus on prioritizing independent genes;however, in many cases, mutations at different loci could lead tothe same disease. This genetic heterogeneity may reflect anunderlying molecular mechanism in which the disease-causinggenes form some kind of a functional module (e.g., a multi-proteincomplex or a signaling pathway) [7,11]. For example, Fanconianemia is a heterogeneous syndrome for which seven of its causinggenes are known to form a protein complex which functions in

PLoS Computational Biology | www.ploscompbiol.org 1 January 2010 | Volume 6 | Issue 1 | e1000641

Wednesday, February 8, 12

Page 9: Interdisciplinary Seminar on Biological Networks - CWI

SUGGESTION 15

■ Late-breaking results (RECOMB 2012, ISMB 2012)

■ ISMB: Notification 16 March

■ RECOMB 2012

■ Find multiple deregulated signaling pathways in one go

■ Solve via a message passing algorithm for a Steiner forest problem

Simultaneous reconstruction of multiple signaling

pathways via the prize-collecting Steiner forest problem

Nurcan Tuncbag1, Alfredo Braunstein2,3, Andrea Pagnani3, Shao-Shan Carol Huang1, Jennifer Chayes4, Christian Borgs4, Riccardo Zecchina2,3, Ernest Fraenkel1

1 Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge,

MA 02139, USA. {ntuncbag, shhuang, fraenkel-admin}@mit.edu

2 Department of Applied Science, Politecnico di Torino, C.so Duca degli Abruzzi 24, 10129 Torino, Italy

{alfredo.braunstein, riccardo.zecchina}@polito.it 3 Human Genetics Foundation, Via Nizza 52, 10126 Torino, Italy

[email protected] 4 Microsoft Research New England, One Memorial Drive, Cambridge, MA 02142, USA

{jchayes, borgs}@microsoft.com

Abstract. Signaling networks are essential for cells to control processes such as growth and response to stimuli. Although many “omic” data sources are available to probe signaling pathways, these data are typically sparse and noisy.

Thus, it has been difficult to use these data to discover the cause of the diseases. We overcome these problems and use “omic” data to simultaneously reconstruct multiple pathways that are altered in a particular condition by solving the prize-collecting Steiner forest problem. To evaluate this approach,

we use the well-characterized yeast pheromone response. We then apply the method to human glioblastoma data, searching for a forest of trees each of which is rooted in a different cell surface receptor. This approach discovers both overlapping and independent signaling pathways that are enriched in

functionally and clinically relevant proteins, which could provide the basis for new therapeutic strategies.

Keywords: prize-collecting Steiner forest, signaling pathways, multiple network reconstruction

1 Introduction

High-throughput technologies including mass spectrometry, chromatin immunoprecipitation followed by sequencing (CHIP-Seq), RNA sequencing (RNA-seq), microarray and screening methods have the potential to provide dramatically new insights into biological processes. By providing a relatively comprehensive view of the changes that occur for a specific type of molecule or perturbation, these approaches can uncover previously unrecognized processes in a system of interest. However, interpreting these data types together to provide a coherent view of the biological processes is still a challenging task. In order to discover how changes in

ASSIGNMENT OF TOPICS

■ Groups of students (depending on number of participants): up to two

■ Until 14 February (next week):• Either ranked list of three of proposed topics• Proposal of appropriate own topic

• state why interesting• propose material

PROJECT

■ Can be anything• Survey of methods• Application of implementations to dataset• Implement and apply method• Describe/implement own method

■ Group work possible

■Written report

■ Short presentation in last slot (12 June)

Wednesday, February 8, 12