24
UNIT 8.25 Network-Based Approaches for Pathway Level Analysis Tin Nguyen, 1 Cristina Mitrea, 2 and Sorin Draghici 2,3 1 Department of Computer Science and Engineering, University of Nevada, Reno, Nevada 2 Department of Computer Science, Wayne State University, Detroit, Michigan 3 Department of Obstetrics and Gynecology, Wayne State University, Detroit, Michigan Identification of impacted pathways is an important problem because it allows us to gain insights into the underlying biology beyond the detection of differ- entially expressed genes. In the past decade, a plethora of methods have been developed for this purpose. The last generation of pathway analysis methods are designed to take into account various aspects of pathway topology in order to increase the accuracy of the findings. Here, we cover 34 such topology-based pathway analysis methods published in the past 13 years. We compare these methods on categories related to implementation, availability, input format, graph models, and statistical approaches used to compute pathway level statis- tics and statistical significance. We also discuss a number of critical challenges that need to be addressed, arising both in methodology and pathway repre- sentation, including inconsistent terminology, data format, lack of meaningful benchmarks, and, more importantly, a systematic bias that is present in most existing methods. C 2018 by John Wiley & Sons, Inc. Keywords: systems biology pathway topology gene network survey pathway analysis How to cite this article: Nguyen, T., Mitrea, C., & Draghici, S. (2018). Network-based approaches for pathway level analysis. Current Protocols in Bioinformatics, 61, 8.25.1–8.25.24. doi: 10.1002/cpbi.42 INTRODUCTION With rapid advances in high-throughput technologies, various kinds of genomic data have become prevalent in most of biomedical research. Advanced techniques in sequencing (e.g., RNA-Seq, miRNA-Seq, DNA-Seq) and microarray assays (e.g., gene expression, methylation) have transformed bi- ological research by enabling comprehen- sive monitoring of biological systems. Vast amounts of data of all types have accumu- lated in many public repositories, such as Gene Expression Omnibus (GEO; Barrett et al., 2013; Edgar, Domrachev, & Lash, 2002), Ar- ray Express (Brazma et al., 2003; Rustici et al., 2013), The Cancer Genome At- las (https://cancergenome.nih.gov/), and cBio- Portal (Cerami et al., 2012; Gao et al., 2013). However, there is a large gap between the ease of data collection and our ability to extract knowledge from these data. Contributing to this gap is the fact that living organisms are complex systems whose emerging phenotypes are the results of multiple complex interactions taking place on various metabolic and signal- ing pathways. Regardless of the technology being used, a typical comparative analysis (e.g., disease versus control, treated versus not treated, drug A versus drug B, etc.) often yields a set of genes that are differentially expressed (DE) be- tween the two phenotypes. Even though these lists of DE genes are important in identify- ing the genes that may be involved in biologi- cal changes, they fail to reveal the underlying mechanisms. In order to translate these lists of DE genes into a better understanding of bi- ological phenomena, researchers have devel- oped a variety of knowledge bases that map genes to functional modules. Depending on the amount of information that one wishes to include, these modules can be described as Current Protocols in Bioinformatics 8.25.1–8.25.24, March 2018 Published online March 2018 in Wiley Online Library (wileyonlinelibrary.com). doi: 10.1002/cpbi.42 Copyright C 2018 John Wiley & Sons, Inc. Analyzing Molecular Interactions 8.25.1 Supplement 61

Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

UNIT 825Network-Based Approaches for PathwayLevel AnalysisTin Nguyen1 Cristina Mitrea2 and Sorin Draghici23

1Department of Computer Science and Engineering University of Nevada Reno Nevada2Department of Computer Science Wayne State University Detroit Michigan3Department of Obstetrics and Gynecology Wayne State University Detroit Michigan

Identification of impacted pathways is an important problem because it allowsus to gain insights into the underlying biology beyond the detection of differ-entially expressed genes In the past decade a plethora of methods have beendeveloped for this purpose The last generation of pathway analysis methodsare designed to take into account various aspects of pathway topology in orderto increase the accuracy of the findings Here we cover 34 such topology-basedpathway analysis methods published in the past 13 years We compare thesemethods on categories related to implementation availability input formatgraph models and statistical approaches used to compute pathway level statis-tics and statistical significance We also discuss a number of critical challengesthat need to be addressed arising both in methodology and pathway repre-sentation including inconsistent terminology data format lack of meaningfulbenchmarks and more importantly a systematic bias that is present in mostexisting methods Ccopy 2018 by John Wiley amp Sons Inc

Keywords systems biology pathway topology gene network survey

pathway analysis

How to cite this articleNguyen T Mitrea C amp Draghici S (2018) Network-based approaches

for pathway level analysis Current Protocols in Bioinformatics 618251ndash82524 doi 101002cpbi42

INTRODUCTIONWith rapid advances in high-throughput

technologies various kinds of genomicdata have become prevalent in most ofbiomedical research Advanced techniquesin sequencing (eg RNA-Seq miRNA-SeqDNA-Seq) and microarray assays (eg geneexpression methylation) have transformed bi-ological research by enabling comprehen-sive monitoring of biological systems Vastamounts of data of all types have accumu-lated in many public repositories such as GeneExpression Omnibus (GEO Barrett et al2013 Edgar Domrachev amp Lash 2002) Ar-ray Express (Brazma et al 2003 Rusticiet al 2013) The Cancer Genome At-las (httpscancergenomenihgov) and cBio-Portal (Cerami et al 2012 Gao et al 2013)However there is a large gap between the easeof data collection and our ability to extractknowledge from these data Contributing to

this gap is the fact that living organisms arecomplex systems whose emerging phenotypesare the results of multiple complex interactionstaking place on various metabolic and signal-ing pathways

Regardless of the technology being useda typical comparative analysis (eg diseaseversus control treated versus not treated drugA versus drug B etc) often yields a set ofgenes that are differentially expressed (DE) be-tween the two phenotypes Even though theselists of DE genes are important in identify-ing the genes that may be involved in biologi-cal changes they fail to reveal the underlyingmechanisms In order to translate these listsof DE genes into a better understanding of bi-ological phenomena researchers have devel-oped a variety of knowledge bases that mapgenes to functional modules Depending onthe amount of information that one wishes toinclude these modules can be described as

Current Protocols in Bioinformatics 8251ndash82524 March 2018Published online March 2018 in Wiley Online Library (wileyonlinelibrarycom)doi 101002cpbi42Copyright Ccopy 2018 John Wiley amp Sons Inc

AnalyzingMolecularInteractions

8251

Supplement 61

Figure 8251 Pathways are more than gene sets Panel A shows the graphical representation of ERBB signaling pathwayfrom KEGG database while panel B shows the set of genes on the pathway The graph in panel A contains importantinformation regarding gene product (protein) localization gene protein or metabolite interactions and the types of theseinteractions (activation repression etc) the direction of the signal propagation etc Over-representation analysis (ORA)and functional class scoring (FCS) approaches are unable to exploit this information As such for the same molecularmeasurement these approaches would yield exactly the same significance value for this pathway even if the graph wereto be completely redesigned by future discovery In contrast network-based approaches are able to take into account thetopological order of the genes and there interactions

simple gene sets based on a function processor component (eg the Molecular SignaturesDatabase MSigDB Liberzon et al 2011) or-ganized in a hierarchical structure that con-tains information about the relationship be-tween the various modules as found in theGene Ontology (Ashburner et al 2000) ororganized into pathways that describe in de-tail all known interactions between the vari-ous genes that are involved in a certain phe-nomenon Biological processes in which genesare known to interact with each other are ac-cumulated in public databases such as theKyoto Encyclopedia of Genes and Genomes(KEGG) (Kanehisa Furumichi Tanabe Sato

amp Morishima 2017 Kanehisa amp Goto 2000)Reactome (Croft et al 2014) and Biocarta(httpwwwbiocartacom)

Concurrently statistical methods have beendeveloped to identify the functional modulesor pathways that are impacted from the dif-ferential expression evidence They allow usto gain insights into the functional mecha-nisms of cells beyond the detection of dif-ferential expressed genes The earliest ap-proaches use Over-Representation Analysis(ORA Draghici Khatri Martins Ostermeieramp Krawetz 2003 Tavazoie Hughes Camp-bell Cho amp Church 1999) to identify genesets that have more DE genes than expected

Network-BasedApproaches forPathway Level

Analysis

8252

Supplement 61 Current Protocols in Bioinformatics

Table 8251 Pathway Analysis Tools

Method Availabilitya HIPAAb Licensec Coded Year Reference

Signaling pathway analysis using gene expression

MetaCoree Web (httpswwwgenegocommetacorephp)

No ThomsonReuters

Java 2004 NA

Pathway-Express

Standalone (Bioconductor) Web(http vortexcswayneedu)superseded by the ROntoTools

No Freef Java R 2005 (Draghici et al2007 Khatriet al 2007)

PathOlogist Standalone (ftpftp1ncinihgovpubpathologist)

No Free MATLAB 2007 (EfroniSchaefer ampBuetow 2007GreenblumEfroniSchaefer ampBuetow 2011)

iPathwayGuidee

Web (httpswwwadvaitabiocomproductshtml)

Yes AdvaitaCorp

Java R 2009 NA

SPIA Standalone (Bioconductor) No GPL (2) R 2009 (Tarca et al2009)

NetGSA Standalone (httpswwwbiostatwashingtoneduashojaiesoftware

No GPL-2 R 2009 (Shojaie ampMichailidis2009 Shojaieamp Michailidis2010)

PWEA Standalone (httpszlabbueduPWEA)

No Freef C++ 2010 (Hung et al2010)

TopoGSA Web (httpswwwinfobioticsnettopogsa)

No Freef PHP R 2010 (Glaab BaudotKrasnogor ampValencia 2010)

TopologyGSA

Standalone (CRAN) No AGPL-3 R 2010 (MassaChiogna ampRomualdi2010)

DEGraph Standalone (Bioconductor) No GPL-3 R 2010 (JacobNeuvial ampDudoit 2010)

GGEA Standalone (Bioconductor) No Artistic-20

R 2011 (GeistlingerCsaba KuffnerMulder ampZimmer 2011)

BPA Standalone(httpsbumilbounedutrbpa)

No Freef MATLAB 2011 (Isci OzturkJones amp Otu2011)

GANPA Standalone (CRAN) No GPL-2 R 2011 (Fang Tian ampJi 2011)

ROntoToolse

Standalone (Bioconductor) No CC BY-NC-ND4

R 2012 (Voichita et al2012)

BAPA-IGGFD

No implementation available No NA R 2012 (Zhao et al2012)

continued

8253

Current Protocols in Bioinformatics Supplement 61

Table 8251 Pathway Analysis Tools continued

Method Availabilitya HIPAAb Licensec Coded Year Reference

CePa Standalone (CRAN) Web(httpsmcubenjueducncgi-bincepamainpl)

No GPL (2) R 2012 (Gu Liu CaoZhang amp Wang2012)

THINK-Back-DS

Standalone Web(httpseecsumichedudbthinksoftwarehtml)

No Freef Java 2012 (Farfan MaSartorMichailidis ampJagadish 2012)

TBScore No implementation available No NA NA 2012 (Ibrahim JassimCawthorne ampLanglands2012)

ACST Standalone (available as articlesupplemental)

No NA R 2012 (MieczkowskiSwiatek-Machado ampKaminska 2012)

EnrichNet Web (httpswwwenrichnetorg) No free PHP 2012 (Glaab BaudotKrasnogorSchneider ampValencia 2012)

clipper Standalone (httpsromualdibiounipditsoftware)

No AGPL-3 R 2013 (Martini SalesMassa Chiognaamp Romualdi2013)

DEAP Standalone (available as articlesupplemental)

No GNULesserGPL

Python 2013 (HaynesHigdonStanberryCollins ampKolker 2013)

DRAGEN Standalone (httpsbioinfoautsinghuaeducndragen)

No NA C++ 2014 (Ma Jiang ampJiang 2014)

ToPASeqg Standalone (Bioconductor) No AGPL-3 R 2016 (Ihnatova ampBudinska 2015)

pDis Standalone (Bioconductor) No Freef R 2016 (AnsariVoichitaDonato Tagettamp Draghici2017)

SPATIAL No implementation available No NA NA 2016 (BokanizadTagett AnsariHelmi ampDraghici 2016)

BLMAh Standalone (Bioconductor) No GPL (2) R 2017 (Nguyen TagettDonato Mitreaamp Draghici2016 also seeInternetResources)

continued

8254

Supplement 61 Current Protocols in Bioinformatics

Table 8251 Pathway Analysis Tools continued

Method Availabilitya HIPAAb Licensec Coded Year Reference

Signaling pathway analysis using multiple types of data

PARADIGM Standalone (httpssbenzgithubcomParadigm)

No UCSC-CGBfreef

C 2010 (Vaske et al2010)

microGraphite

Standalone(httpsromualdibiounipditsoftware)

No AGPL-3 R 2014 (Calura et al2014)

mirIntegrator Standalone (Bioconductor) No GPL3 R 2016 (Diaz et al2016 Nguyenet al 2016)

Metabolic pathway analysis

ScorePAGE No implementation available No NA NA 2004 (Rahnenfuhreret al 2004)

TAPPA No implementation available No NA NA 2007 (Gao amp Wang2007)

MetPA Web (httpsmetpametabolomicsca) No GPL (2) PHP R 2010 (Xia ampWishart 2010)

MetaboAnalyst

Web (httpsmetpametabolomicsca) No GPL (2) R 2011 (Xia amp Wishart2011 XiaSinelnikovHan ampWishart 2015)

aAvailability is a criterion that describes the implementation of the method as standalone or Web-basedbHIPAA provides information about HIPAA compliancecLicense provides information about the type of the software license GPL is an abbreviation for the GNU General Public License AGPL is anabbreviation for the GNU Affero General Public License CC BY-NC-ND 4 is an abbreviation of Creative Commons AttributionndashNonCommercial-NoDerivatives 40 International Public LicensedCode shows the programming language used for the method implementationeCommercial methodfFree for academic and non-commercial use UCSC-CGB is the University of California Santa Cruz Cancer Genome BrowsergToPASeq provides an R package that runs TopologyGSA DEGraph clipper SPIA TBScore PWEA TAPPAhBLMA provides an R package for bi-level meta-analysis that runs SPIA ORA GSA and PADOG using one or multiple expression datasets

by chance The drawbacks of this type of ap-proach include (i) it only considers the num-ber of DE genes and completely ignores themagnitude of the actual expression changesresulting in information loss (ii) it assumesthat genes are independent which they arenot (since the pathways are graphs that de-scribe precisely how these genes influenceeach other) and (iii) it ignores the interactionsbetween various genes and modules Func-tional Class Scoring (FCS) approaches suchas Gene Set Enrichment Analysis (GSEASubramanian et al 2005) and Gene Set Anal-ysis (GSA Efron amp Tibshirani 2007) havebeen developed to address some of the is-sues raised by ORA approaches The mainimprovement of FCS is the observation thatsmall but coordinated changes in expressionof functionally related genes can have signifi-cant impact on pathways

ORA and FCS approaches are often re-ferred as gene set enrichment methods Com-prehensive lists of gene set analysis ap-proaches as well as comparisons betweenthem can be found in well-developed sur-veys (Emmert-Streib amp Glazko 2011 KelderConklin Evelo amp Pico 2010 Khatri Sirotaamp Butte 2012 Misman et al 2009) Whileuseful for the purpose for which they havebeen developedmdashto analyze sets of genesmdashthese methods completely ignore the topol-ogy and interaction between genes Topology-based approaches which fully exploit allthe knowledge about how genes interact asdescribed by pathways have been devel-oped more recently The first such techniqueswere ScorePAGE (Rahnenfuhrer DominguesMaydt amp Lengauer 2004) for metabolicpathways and Impact Analysis (Draghiciet al 2007 Tarca et al 2009) for signaling

AnalyzingMolecularInteractions

8255

Current Protocols in Bioinformatics Supplement 61

pathways Figure 8251 shows an examplepathway named ERBB signaling pathway inwhich the nodes represent genes and com-pounds while the edges represent the knowninteraction between the compounds Gene setanalysis approaches are not able to accountfor the topological order of genes nor are ableto explain the signal propagation and mecha-nisms of the pathway These approaches wouldyield exactly the same significance value forthis pathway even if the graph were to becompletely redesigned by future discoveryIn contrast network-based pathway analysiscan distinguish between this pathway and anyother pathway with the same proportion of DEgenes

Here we provide a survey of 34network-based methods developed for path-way analysis Our survey of commercialtools for pathway analysis found iPath-wayGuide (Advaita Corporation httpswwwadvaitabiocom) and MetaCore (ThomsonReuters httpswwwthomsonreuterscom) tobe topology based Other commercial toolssuch as Ingenuity Pathway Analysis (httpswwwingenuitycom) or Genomatix (httpswwwgenomatixde) do not use pathway topol-ogy information and thus are not included inthis review The existence of only two com-mercial tools is more evidence of the challengefaced by the developers of such methods dueto lack of standards

In this document we categorize and com-pare the methods based on the followingcriteria the type of input the method ac-cepts graph model and the statistical ap-proaches used to evaluate gene and pathwaychanges Under the heading ldquoExperiment In-put and Pathway Databasesrdquo below we de-scribe the types of input the surveyed meth-ods use Under the heading ldquoGraph Mod-elsrdquo we provide details regarding the graphmodels used for the biological networks Un-der ldquoPathway Scoring Strategiesrdquo we discussthe statistical methods used by the surveyedmethods to assess gene and pathway changesbetween two phenotypes Under ldquoChallengesin Pathway Analysisrdquo we discuss a numberof outstanding challenges that needed to beaddressed in order to improve the reliabil-ity as well as the relevance of the next-generation pathway analysis approaches Wediscuss the key elements and limitations of thesurveyed methods without going into detailsof each technique For a more detailed reviewand technical descriptions of network-basedpathway analysis methods please refer to

Mitrea et al (2013) or referenced manuscripts(Table 8251)

NETWORK-BASED PATHWAYANALYSIS

The term pathway analysis is used in avery broad context in the literature includ-ing biological network construction and in-ference Here we focus on methods that areable to exploit biological knowledge in pub-lic repositories rather than on network in-ference approaches that attempt to infer orreconstruct pathways from molecular mea-surements Table 8251 shows the list of 34pathway analysis approaches together withtheir availability and licensing In this sec-tion we start by describing the availability ofeach method before discussing the input for-mat graph model and underlying statisticalapproaches

Software Availability andImplementation

We often think that the main strength ofan approach lies in its novelty and algorithmefficiency However the implementation andavailability of a tool have become increas-ingly important for several reasons First soft-ware availability and version control are cru-cial for reproducing the experimental resultsthat were used to assess the performance ofthe approach (Sandve Nekrutenko Taylor ampHovig 2013) For this reason many journalsrequest authors to make their software and dataavailable before accepting methods articlesSecond if a software application is not ready-to-run it is very unlikely that the intended au-dience (mostly life scientists) will invest thetime to understand and implement complexalgorithms Practicality user-friendliness out-put format and type of interface are all to beconsidered Depending on the desired avail-ability and intended audience a software pack-age may be implemented as standalone or Webbased Among the 34 approaches there are 30that are available either as a standalone soft-ware package or Web service

Typically standalone tools need to be in-stalled on local machines or servers which of-ten requires some administrative skills Moststandalone tools depend on full or partialcopies of public pathway databases storedlocally and need to be updated periodicallyAdvantages of standalone tools include (i)instant availability that does not require Inter-net access and (ii) the security and privacy ofthe experimental data Web-based tools on the

Network-BasedApproaches forPathway Level

Analysis

8256

Supplement 61 Current Protocols in Bioinformatics

other hand run the analyses on a remote serverproviding computational power and a graph-ical interface The major advantage of Web-based tools is that they are user-friendly and donot require a separate local installation Fromthe accessibility perspective Web-based toolshave the advantage of being available from anylocation as long as there is an Internet connec-tion and a browser available Also the updateis almost transparent to the client This makesthe userrsquos task easy and enables collaborationsince users all over the world can utilize thesame method without the burden of installingit or keeping it up-to-date There are methodsthat provide both Web-based and standaloneimplementations

HIPAA compliance may also be a factor incertain applications that involve data comingfrom patients or data linked to other clinicaldata or clinical records Currently the iPath-wayGuide is the only tool available that cando topological pathway analysis and is HIPAAcompliant

The programming language and style usedfor implementation also play an important rolein the acceptance of a method Software toolsthat are neatly implemented and packaged aremore appealing compared to those that do nothave ready-to-use implementations Many ofthe methods are implemented in the R pro-gramming language and are available as soft-ware packages from Bioconductor CRAN orthe authorrsquos Web site Their popularity amongbiologists and bioinformaticians is due to thefact that many bioinformatics-dedicated pack-ages are available in R In addition the rigor-ous review procedure provided by Bioconduc-tor and CRAN makes the software packagesmore standardized and reliable

Experiment Input and PathwayDatabases

A pathway analysis approach typically re-quires two types of input experiment data andknown biological networks Experiment datais usually collected from high-throughput ex-periments that compare a condition phenotypeto a control phenotype A condition pheno-type can be a disease a drug treatment orthe knock-out (KO) of a gene while a con-trol phenotype can be a healthy state a dif-ferent disease or drug treatment or wild-type(non-KO) samples Experimental data can beobtained from multiple technologies that pro-duce different types of data gene expressionprotein abundance metabolite concentrationmiRNA expression etc Biological networks

or pathways are often represented in the formof graphs that capture our current knowl-edge about the interactions of genes proteinsmetabolites or compounds in an organismThe pathway data is accumulated updatedand refined by amassing knowledge from sci-entific literature describing individual interac-tions or high-throughput experiment results

Table 8252 shows a summary of in-put format and mathematical modeling ofthe surveyed approaches Most pathwayanalysis methods analyze data from high-throughput experiments such as microarraysnext-generation sequencing or proteomicsThey accept either a list of gene IDs or a listof such gene IDs associated with measuredchanges These changes could be measuredwith different technologies and therefore canserve as proxies for different biochemical en-tities For instance one could use gene ex-pression changes measured with microarraysor protein levels measured with a proteomicapproach etc

Different analysis methods use different in-put formats Many methods accept a list of allgenes considered in the experiment togetherwith their expression values Some analysismethods select a subset of genes consideredto be differentially expressed (DE) based on apredefined cut-off The cut-off is typically ap-plied on fold-change p-value or both Thesemethods use the list of DE genes and their cor-responding statistics (fold-change p-value) asinput Other methods use only the list of DEgenes without corresponding expression val-ues because their scoring methods are basedonly on the relative positions of the genes inthe graph

Methods which use cut-offs are sensitive tothe chosen threshold value because a smallchange in the cut-off may drastically changethe number of selected genes (Nam amp Kim2008) In addition they typically use the mostsignificant genes and discard the rest whoseweaker but coordinated changes may also havesignificant impact on pathways Genes withmoderate differential expression may be losteven though they might be important play-ers in the impacted pathways (Ben-ShaulBergman amp Soreq 2005) Furthermore thegenes included in the set of DE genes canvary dramatically if the selection methods arechanged Hence the results of pathway anal-yses based on DE genes may be vastly differ-ent depending on both the selection method aswell as the threshold value (Pan Lih amp Co-hen 2005) Furthermore for the same disease

AnalyzingMolecularInteractions

8257

Current Protocols in Bioinformatics Supplement 61

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

Signaling pathway analysis using gene expression

MetaCore DE genes Proprietary canonicalpathway genome-scalenetwork

Single-type directed Hierarchicallyaggregated

Pathway-Express DE genes change ormeasured geneschanged

KEGG signaling Single-type directed Hierarchicallyaggregated

PathOlogist Measured genesexpression

KEGG Multi-type directed Hierarchicallyaggregated

iPathwayGuide DE genes change ormeasured geneschange

KEGG signalingReactome NCIBioCarta

Single-type directed Hierarchicallyaggregated

SPIA DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

NetGSA Measured genesexpression

KEGG signaling Single-type directed Multivariate analysis

PWEA Measured genesexpression

YeastNet Single-typeundirected

Hierarchicallyaggregated

TopoGSA DE genes PPI network KEGG Single-typeundirected

Hierarchicallyaggregated

TopologyGSA Measured genesexpression

NCI-PID Single-typeundirected

Multivariate analysis

DEGraph Measured genesexpression

KEGG Single-typeundirected

Multivariate analysis

GGEA Measured genesexpression

KEGG Single-type directed Aggregate fuzzysimilarity

BPA Measured genesexpressionmdashwithcut-off

NCI-PID Single-type DAG Bayesian network

GANPA DE genes change ormeasured genesexpression

PPI network KEGGReactome NCI-PIDHumanCyc

Single-typeundirected

Hierarchicallyaggregated

ROntoTools DE genes change ormeasured geneexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

BAPA-IGGFD Measured genesexpression - withcut-off

Literature-basedinteraction databaseKEGG WikiPathwaysReactome MSigDBGO BP PANTHERconstructed geneassociation networkfrom PPIsco-annotation in GOBiological Process (BP)and co-expression inmicroarray data

Single-type DAG Bayesian network

continued

8258

Supplement 61 Current Protocols in Bioinformatics

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

CePa DE genes expressionor measured genesexpression

NCI-PID Single-type directed Hierarchicallyaggregated

THINK-Back-DS DE genes changemeasured genesexpression

KEGG PANTHERBioCarta ReactomeGenMAPP

Single-type directed Hierarchicallyaggregated

TBScore DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

ACST Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

EnrichNet DE genes list PPI network KEGGBioCartaWikiPathwaysReactome NCI-PIDInterPro GO withSTRING 90

Single-typeundirected

Hierarchicallyaggregated

clipper Measured genesexpression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

DEAP Measured genesexpression

KEGG Reactome Single-type directed Hierarchicallyaggregated

DRAGEN Measured genesexpression

RegulonDB M3DHTRIdb ENCODEMSigDB

Single-type directed Linear regression

ToPASeqe Measured genesexpression

KEGG Single-type directed Hierarchical ampmultivariate

pDis DE genes change ormeasured geneschanged

KEGG signaling Single-type directed Hierarchicallyaggregated

SPATIAL DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

BLMAf Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

Signaling pathway analysis using multiple types of data

PARADIGM Measured genesexpression copynumber and proteinslevels

Constructed PPInetworks from MIPSDIP BIND HPRDIntAct and BioGRID

Multi-type directed Hierarchicallyaggregated

microGraphite Measured geneexpression andmiRNA expression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

mirIntegrator DE genes changeand DE miRNAchange

KEGG signalingmiRTarBase

Single-type directed Hierarchicallyaggregated

Metabolic pathway analysis

ScorePAGE Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

continued

8259

Current Protocols in Bioinformatics Supplement 61

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

TAPPA Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

MetPA DE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

MetaboAnalyst DE genes change andDE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

aExperiment input shows the experiment data input format for each method ldquoDErdquo means differentially expressed ldquoChangerdquo means fold-change valueor t-statistics when comparing genemetabolites values between two phenotypes ldquoMeasuredrdquo means the list of all the genesmetabolites measured inthe experiment ldquoListrdquo represents a list of genesmetabolites identifiers (eg symbols) ldquoWith cut-offrdquo show methods that take as input the list of allmeasured genes and in the analysis they mark the DE genesbPathway database shows the name of the knowledge source for biological interactionscGraph model is a characteristic that shows if the graph has one or multiple types of nodes as well as if directed or undirecteddThe package ROntoTools (Bioconductor) allows for analysis using expression values of all geneseToPASeq provides an R package that runs TopologyGSA DEGraph clipper SPIA TBScore PWEA TAPPAfBLMA provides an R package for bi-level meta-analysis that runs SPIA ORA GSA and PADOG using one or multiple expression datasets

independent studies or measurements oftenproduce different sets of differentially ex-pressed (DE) genes (Ein-Dor Kela GetzGivol amp Domany 2005 Ein-Dor Zuk amp Do-many 2006 Tan et al 2003) This makes ap-proaches that use DE genes as input appeareven more unreliable

Usually pathways are sets of genes andorgene products that interact with each other ina coordinated way to accomplish a given bi-ological function A typical signaling path-way (in KEGG for instance) uses nodes torepresent genes or gene products and edgesto represent signals such as activation or re-pression that go from one gene to anotherA typical metabolic pathway uses nodes torepresent biochemical compounds and edgesto represent reactions that transform one ormore compound(s) into one or more othercompounds These reactions are usually car-ried out or controlled by enzymes which arein turn coded by genes Hence in a metabolicpathway genes or gene products are associatedwith edges rather than nodes as in a signal-ing pathway The immediate consequence ofthis difference is that many techniques cannotbe applied directly on all available pathwaysThis is why the analysis of metabolic pathwaysis still generally and arguably underdeveloped(only 128 Google Scholar citations to date forthe original ScorePage paper Rahnenfuhreret al 2004) and there are not many othermethods available for metabolic pathway anal-ysis However the topology-based analysis ofsignaling pathways has been very successful

(over 1200 citations to date for the two impactanalysis papers mentioned above Draghiciet al 2007 Tarca et al 2009) and over 30other methods have been developed since

There are other types of biological net-works that incorporate genome-wide inter-actions between genes or proteins such asprotein-protein interaction (PPI) networksThese networks are not restricted to specificbiological functions The main caveat relatedto PPI data is that most such data are obtainedfrom a bait-prey laboratory assays rather thanfrom in vivo or in vitro studies The fact thattwo proteins stick to each other in an assayperformed in an artificial environment can bemisleading since the two proteins may neverbe present at the same time in the same tissueor the same part of the cell

Publicly available curated pathwaydatabases used by methods listed in Table8252 are KEGG (Ogata et al 1999)NCI-PID (Schaefer et al 2009) BioCarta(httpswwwbiocartacom) WikiPathways(Pico et al 2008) PANTHER (Mi et al2005) and Reactome (Joshi-Tope et al2005) These knowledge bases are built bymanually curating experiments performed indifferent cell types under different conditionsThese curated knowledgebases are morereliable than protein interaction networksbut do not include all known genes andtheir interactions As an example despitebeing continuously updated KEGG includesonly about 5000 human genes in signalingpathways while the number of protein-coding

Network-BasedApproaches forPathway Level

Analysis

82510

Supplement 61 Current Protocols in Bioinformatics

Figure 8252 Five biological network graph models used by public databases are displayedThe KEGG database contains both signaling and metabolic networks Signaling networks havegenesgene products as nodes and regulatory signals as edges Types of regulatory signalsinclude activation inhibition phosphorylation and many others Metabolic networks have bio-chemical compounds as nodes and chemical reactions as edges Enzymes (specialized pro-teinsgene products) catalyze biochemical reactions therefore genes are linked to the edges inthese networks The Reactome database is a collection of biochemical reactions that are groupedin functionally related sets to form a pathway There are two types of nodes biochemical com-pounds and reactions Reaction nodes link biochemical compounds as reactants and productsNCI-PID signaling networks also have two types of nodes component nodes and process nodesComponent nodes are usually biomolecular components Process nodes are usually biochemicalreactions or biological processes In these networks process nodes link two or more componentnodes through directed edges Process nodes are assigned one of the following states positiveor negative regulation or involved in Protein-protein interaction (PPI) networks have proteinsas nodes and physical binding as edges or interactions Two-hybrid assays are typically used todetermine protein-protein interactions PPIs can be directed in the bait-prey orientation when thebait-prey relation is considered (top) or undirected when is not (bottom) The Biological PathwayExchange (BioPAX) format has physical entities as nodes and conversions as edges The ad-vantage of this representation is that it is generic and provides a lot of flexibility to accommodatevarious types of interactions In addition it provides a machine-readable standard that can beused for all databases to provide their data in a unified format Network nodes can be genesgene products complexes or non-coding RNA Network edges can be assembly of a complexdisassembly of a complex or biochemical reactions among others

genes is estimated to be between 19000 and20000 (Ezkurdia et al 2014)

The implementation of analysis methodsconstrains the software to accept a specific in-put pathway data format while the underlyinggraph models in the methods are independentof the input format Regardless of the pathwayformat this information must be parsed into acomputer-readable graph data structure beforebeing processed The implementation may in-corporate a parser or this may be up to the userFor instance SPIA accepts any signaling path-way or network if it can be transformed into anadjacency matrix representing a directed graphwhere all nodes are components and all edgesare interactions NetGSA is similarly flexiblewith regard to signaling and metabolic path-ways SPIA provides KEGG signaling path-ways as a set of pre-parsed adjacency matri-ces The methods described in this chapter maybe restricted to only one pathway database ormay accept several

To the best of our knowledge there havebeen no comprehensive approaches to com-

pare the pros and cons of existing knowledgebases It is completely up to researchers tochoose the tools that are able to work withthe database they trust Intuitively methodsthat are able to exploit the complementary in-formation available from different databaseshave an edge over methods that work with onesingle database

Graph ModelsPathway analysis approaches use two major

graph models to represent biological networksobtained from knowledge bases (i) single-type and (ii) multiple-type The first modelallows only one type of node ie a gene orprotein with edges representing molecular in-teractions between the nodes (KEGG signal-ing in Fig 8252) Models that contain di-rected graphs are more suitable for analysesthat include signal propagation of gene pertur-bation The second graph model allows mul-tiple type of nodes such as components andinteractions (NCI-PID in Fig 8252) Multi-type graph models are more complex than

AnalyzingMolecularInteractions

82511

Current Protocols in Bioinformatics Supplement 61

single-type but they are expected to capturemore pathway characteristics For examplesingle-type models are limited when tryingto describe ldquoallrdquo and ldquoanyrdquo relations betweenmultiple components that are involved in thesame interaction Bipartite graphs which con-tain two types of nodes and allow connectiononly between nodes of different types are aparticular case of multi-type graph models

The majority of analysis methods surveyedhere use a single-type graph model Some ap-ply the analysis on a directed or un-directedsingle-type network built using the input path-way while others transform the pathways intographs with specific characteristics An ex-ample of the later is TopologyGSA whichtransforms the directed input pathway intoan undirected decomposable graph which hasthe advantage of being easily broken downinto separate modules (Lauritzen 1996) Inthis method decomposable graphs are usedto find ldquoimportantrdquo submodulesmdashthose thatdrive the changes across the whole pathwayFor each pathway TopologyGSA creates anundirected moral graph from the underlyingdirected acyclic graph (DAG) by connectingthe parents of each child and removing theedge direction [The moral graph of a DAGis the undirected graph created by adding an(undirected) edge between all parents of thesame node (sometimes called marrying) andthen replacing all directed edges by undirectededges The name stems from the fact that ina moral graph two nodes that have a com-mon child are required to be married by shar-ing an edge] The moral graph is then usedto test the hypothesis that the underlying net-work is changed significantly between the twophenotypes If the the research hypothesis isrejected a decomposabletriangulated graph isgenerated from the moral graph by adding newedges This graph is broken into the maximalpossible submodules and the hypothesis is re-tested on each of them

PathOlogist and PARADIGM are thetwo surveyed methods that use multi-typegraph models PathOlogist uses a bipartitegraph model with component and interac-tion nodes PARADIGM conceptually moti-vated by the central dogma of molecular bi-ology takes a pathway graph as input andconverts it into a more detailed graph whereeach component node is replaced by sev-eral more specific nodes biological entitynodes interaction nodes and nodes contain-ing observed experiment data The observedexperiment nodes could in principle containgene-expression and copy-number informa-

tion Biological entity nodes are DNA mRNAprotein and active protein The interactionnodes are transcription translation or pro-tein activation among others Biological en-tity and interaction node values are derivedfrom these data and specify the probability ofthe node being active These are the hiddenstates of the model From the mathematicalmodel perspective models that allow multipletypes of nodesmdashcomponent nodes and inter-action nodesmdashare more flexible and are able tomodel both AND and OR gates which are verycommon when describing cellular processes

Pathway Scoring StrategiesThe goal of the scoring method is to com-

pute a score for each pathway based on the ex-pression change and the graph model resultingin a ranked list of pathways or sub-pathwaysThere are a variety of approaches to quantifythe changes in a pathway Some of the analysismethods use a hierarchically aggregated scor-ing algorithm where on the first level a scoreis calculated and assigned to each node or pairof nodes (component andor interaction) Onthe second level these scores are aggregated tocompute the score of the pathway On the lastlevel the statistical significance of the path-way score is assessed using univariate hypoth-esis testing Another approach assigns a ran-dom variable to each node and a multivariateprobability distribution is calculated for eachpathway The output score can be calculated intwo ways One way is to use multivariate hy-pothesis testing to assess the statistical signif-icance of changes in the pathway distributionbetween the two phenotypes The other way isto estimate the distribution parameters basedon the Bayesian network model and use thisdistribution to compute a probabilistic score tomeasure the changes In this section we pro-vide details regarding the scoring algorithmsof the surveyed methods See Figure 8253 forscoring algorithms categories

Hierarchically aggregated scoringThe workflow of the hierarchically aggre-

gated scoring strategy has three levels nodestatistic computation pathway statistic com-putation and the evaluation of the significancefor the pathway statistic

Node level scoringMost of the surveyed methods incorpo-

rate pathway topology information in the nodescores The node level scoring can be di-vided into four categories (i) graph mea-sures (centrality) (ii) similarity measures

Network-BasedApproaches forPathway Level

Analysis

82512

Supplement 61 Current Protocols in Bioinformatics

Figure 8253 A summary of the statistical models is presented for the surveyed methods Mostmethods use the hierarchically aggregated scoring strategy in which the score is computed atthe node level before being aggregated at the pathway level for significance assessment In theleft panel the rows show a node-level statistical model while the columns show the aggregatingstrategies (linear non-linear or weighted gene set) Approaches in multivariate analysis andBayesian network categories use multivariate modeling and Bayesian network respectively tofind pathways that are most likely to be impacted Both BLMA and ToPASeq provide R packagesthat run multiple algorithms DRAGEN follows a linear regression strategy while GGEA appliesfuzzy modeling to rank the pathways

(iii) probabilistic graphical models and (iv)normalized node value (NNV) Approaches inthe first category use centrality measures or avariation of these measures to score nodes in agiven pathway Centrality measures representthe importance of a node relative to all othernodes in a network There are several central-ity measures that can be applied to networksof genes and their interactions degree central-ity closeness between-ness and eigenvectorcentrality Degree centrality accounts for thenumber of directed edges that enter and leaveeach node Closeness sums the shortest dis-tance from each node to all other nodes inthe network Node between-ness measures theimportance of a node according to the numberof shortest paths that pass through it Eigen-vector centrality uses the network adjacencymatrix of a graph to determine a dominanteigenvector each element of this vector is ascore for the corresponding node Thus eachscore is influenced by the scores of neighbor-ing nodes In the case of directed graphs anode that has many downstream genes hasmore influence and receives a higher score

Methods in this category include MetaCoreSPIA iPathwayGuide MetPA SPATIAL To-poGSA DEAP pDis mirIntegrator Pathway-Express and BLMA

Approaches in the second category usesimilarity measures in their node level scor-ing Similarity measures estimate the co-expression behavioral similarity or co-regulation of pairs of components Theirvalues can be correlation coefficients covari-ances or dot products of the gene expressionprofile across time or samples In these meth-ods the pathways with clusters of highly cor-related genes are considered more significantAt the node level a score is assigned to eachpair of nodes in the network which is the ra-tio of one similarity measure over the shortestpath distance between these nodes Thus thetopology information is captured in the nodescore by incorporating the shortest path dis-tance of the pair Methods in this category in-clude ScorePAGE and PWEA

Approaches in the third category incorpo-rate the topology in the node level scoringusing a probabilistic graphical model In this

AnalyzingMolecularInteractions

82513

Current Protocols in Bioinformatics Supplement 61

model nodes are random variables and edgesdefine the conditional dependency of the nodesthey link For example PARADIGM takes ob-served experimental data and calculates scoresfor all component nodes in both observed andhidden states from the detailed network cre-ated by the method based on the input path-way For each node score a positive or negativevalue denotes how likely it is for the node tobe active or inactive respectively The scoresare calculated to maximize the probability ofthe observed values A p-value is associatedwith each score of each sample such that eachnode can be tagged as significantly activesignificantly inactive or not significant Foreach network a matrix of p-values is outputin which columns are samples and rows arecomponent nodes Methods in this categoryinclude PARADIGM and PathOlogist

Approaches in the fourth category simplycompute the score for each node using the in-formation obtained from the experiment inputFor example TAPPA calculates the score ofeach node as the square root of the normalizedlog gene expressions (node value) while ACSTand TBScore calculate the node level score us-ing a sign statistic and log fold-change

Pathway level scoringThere are three different ways to compute

the pathway score (i) linear (ii) non-linearand iii) weighted gene set Most methods ag-gregate node level statistics to pathway levelstatistics using linear functions such as av-eraging or summation For example iPath-wayGuide computes the scores of pathway asthe sum of all genes while TBScore weightsthe pathway DE genes based on their log foldchange and the number of distinct DE genesdirectly downstream of them using a depth-first search algorithm

Approaches in the second group (non-linear) use a non-linear function to computethe pathway scores For example TAPPA com-putes the pathway score for each sample as aweighted sum of the product of all node pairscores in the pathway The weight coefficientis 0 when there is no edge between a pair Forany connected node pair the weight is a signfunction which represents joint up- or down-regulation of the pair Another example is En-richNet in which pathway scores measure thedifference of the node score distribution for apathway and a background networkgene setwhich consists of all pathways At the nodelevel the distance of all DE genes to the path-way is measured and summarized as a distancedistribution The method assumes that the most

relevant pathway is the one with the greatestdifference between the pathway node scoredistribution and the background score distri-bution The difference between the two distri-butions is measured by the weighted averagingof the difference between the two discretizedand normalized distributions The averagingmethod down-weights the higher distances andemphasizes the lower distance nodes

Methods in the third group (weighted geneset) design scoring techniques that incorpo-rate existing gene set analysis methods suchas GSEA (Subramanian et al 2005) GSA(Efron amp Tibshirani 2007) or LRPath (SartorLeikauf amp Medvedovic 2009) Pathway-levelscores can be calculated using node scoresthat represent the topology characteristic of thepathway as weight adjustments to a gene setanalysis method PWEA GANPA THINK-Back-DS and CePa use this approach and werefer to them as weighted gene set analysismethods

Pathway significance assessmentPathway scores are intended to provide in-

formation regarding the amount of changeincurred in the pathway between two phe-notypes However the amount of change isnot meaningful by itself since any amount ofchange can take place just by chance An as-sessment of the significance of the measuredchanges is thus required

TopoGSA MetPA and EnrichNet outputscores without any significance assessmentleaving it up to the user to interpret the re-sults This is problematic because users do nothave any instrument to distinguish betweenchanges due to noise or random causes andmeaningful changes that are unlikely to occurjust by chance and that therefore are possi-bly related to the phenotype The rest of theanalysis methods perform a hypothesis test foreach pathway The null hypothesis is that thevalue of the observed statistic is due to randomnoise or chance alone The research hypothe-sis is that the observed values are substantialenough that they are potentially related to thephenotype A p-value for calculated score isthen computed and a user-defined thresholdon the p-value is used to decide whether thethe null hypothesis can be rejected or not foreach pathway Finally a correction for multi-ple comparisons should be performed

Typically pathway analysis methods com-pute one score per pathway The distribution ofthis score under the null hypothesis can be con-structed and compared to the observed scoreHowever there are often too few samples to

Network-BasedApproaches forPathway Level

Analysis

82514

Supplement 61 Current Protocols in Bioinformatics

calculate this distribution so it is assumedthat the distribution is known For example inMetaCore and many other techniques whenthe pathway score is the number of DE nodesthat fall on the pathway the distribution isassumed to be hypergeometric However thehypergeometric distribution assumes that thevariables (genes in this case) are independentwhich is incorrect as witnessed by the factthat the pathway graph structure itself is de-signed to reflect the specific ways in which thegenes influence each other Another approachto identify the distribution is to use statisticaltechniques such as the bootstrap (Efron 1979)Bootstrapping can be done either at the sam-ple level by permuting the sample labels orat gene-set level by permuting the the valuesassigned to the genes in the set

Multivariate analysis and Bayesiannetwork

Multivariate analysis methods mostly usemultivariate probability distributions to com-pute pathway-level statistics and these can begrouped into two subcategories Methods inthe first category use multivariate hypothesistesting while methods in the second categoryare based on Bayesian network

NetGSA TopologyGSA DEGraph clip-per and microGraphite are methods based onmultivariate hypothesis testing These anal-ysis methods assume the vectors of geneexpression values in each (sub)pathway arerandom vectors with multivariate normal dis-tributions The network topology informationis stored in the covariance matrix of the cor-responding distribution For a network if thetwo distributions of the gene expression vec-tors corresponding to the two phenotypes aresignificantly different the network is assumedto be significantly impacted when comparingthe two phenotypes The significance assess-ment is done by a multivariate hypothesis testThe definition of the null hypothesis for thestatistical tests and the techniques to calcu-late the parameters of the distributions are themain differences between these three analysismethods

BPA and BAPA-IGGFD are two methodsbased on Bayesian networks In a Bayesiannetwork which is a special case of proba-bilistic graphical models a random variableis assigned to each node of a directed acyclic(DAG) graph The edges in the graph representthe conditional probabilities between nodesso that the children are independent from eachother and the rest of the graph when condi-tioned on the parents In BPA the value of

the Bayesian random variable assigned to eachnode captures the state of a gene (DE or not) Incontrast in BAPA-IGGFD each random vari-able assigned to an edge is the probability thatup or down regulation of the genes at both endsof an interaction are concordant with the typeof interaction which can be activation or inhi-bition In both BPA and BAPA-IGGFD eachrandom variable is assumed to follow a bino-mial distribution whose probability of successfollows a beta distribution However these twomethods use different approaches in represent-ing the multivariate distribution of the corre-sponding random vector BPA assumes thatthe random vector has a multinomial distribu-tion which is the generalization of the bino-mial distribution In this case the vector of thesuccess probability follows the Dirichlet dis-tribution which is the multivariate extensionof the beta distribution In contrast BAPA-IGGFD assumes the random variables areindependent therefore the multivariate distri-butions are calculated by multiplying the dis-tributions of the random variables in the vec-tor It is worth mentioning that the assumptionof independence in BAPA-IGGFD is contra-dicted by evidence specifically in the case ofedges that share nodes

Other approachesThe four methods DRAGEN GGEA

ToPASeq and BLMA follow strategies thatare very different from those of the other 30methods As such BLMA implements a bi-level meta-analysis approach that can be ap-plied in conjunction with any of the four sta-tistical approaches SPIA (Tarca et al 2009)GSA (Efron amp Tibshirani 2007) ORA (Tava-zoie et al 1999) or PADOG (Tarca DraghiciBhatti amp Romero 2012) The package al-lows users to perform pathway analysis withone dataset or with multiple datasets Simi-larly ToPASeq provides an R package thatruns TopologyGSA DEGraph clipper SPIATBScore PWEA and TAPPA This methodmodels the input (biological networks and ex-periment data) in such a way that it can be usedfor any of the seven different analyses

DRAGEN is an analysis method that scoresthe interactions rather then the genes and usesa regression model to detect differential regu-lation DRAGEN fits each edge of a pathwayinto two linear models (for case and control)and then computes a p-value that representsthe difference between the two models Foreach pathway a summary statistic is computedby combining the p-values of the edges us-ing a weighted Fisherrsquos method (Fisher 1925)

AnalyzingMolecularInteractions

82515

Current Protocols in Bioinformatics Supplement 61

GGEA on the other hand uses Petri Net (Mu-rata 1989) to model the pathway The sum-mary statistic named consistency is computedfrom the fuzzy similarity between the observedgene expression and Petri net with fuzzy logic(PNFL Kuffner Petri Windhager amp Zimmer2010) For both DRAGEN and GGEA the p-value of the pathway is calculated by com-paring the observed summary statistic againstthe null distribution that is constructed bypermutation

CHALLENGES IN PATHWAYANALYSIS

Pathway analysis has become the firstchoice for gaining insights into the underly-ing biology of a phenotype due its explanatorypower However there are outstanding annota-tion and methodological limitations that havenot been addressed (Kelder Conklin Eveloamp Pico 2010 Khatri et al 2012) There arethree main limitations of current knowledgebases First existing knowledge bases are un-able to keep up with the information availablein data obtained from recent technologies Forexample RNA-Seq data allows us to iden-tify transcripts that are active under certainconditions Alternatively spliced transcriptseven if they originate from the same genemay have distinct or even opposite functions(Wang et al 2008) However most knowl-edge bases provide pathway annotation onlyat the gene level Second there is a lack ofcondition and cell-specific information ieinformation about cell type conditions andtime points Finally current pathway annota-tions are neither complete nor perfectly accu-rate (Khatri et al 2012 Rhee Wood Dolinskiamp Draghici 2008) For example the number ofgenes in KEGG have remained around 5000despite being updated continuously in the past10 years while there are approximately 19000human genes annotated with at least one GOterm The number of protein-coding genes isestimated to be between 19000 and 20000(Ezkurdia et al 2014) most of which areincluded in DNA microarray assays such asAffymetrix HG U133 plus 20

Another challenge is the oversimplificationthat characterizes many of the models pro-vided by pathway databases In principle eachtype of tissue might have different mecha-nisms so generic organism-level pathwayspresent a somewhat simplistic description ofthe phenomena Furthermore signaling andmetabolic processes can also be different fromone condition to another or even from one

patient to another Understanding the specificpathways that are impacted in a given pheno-type or sub-group of patients should be anothergoal for the next generation of pathway anal-ysis tools See Khatri et al (2012) for a moredetailed discussion of annotation limitationsof existing knowledge bases

Here we focus on challenges of pathwayanalysis from computational perspectives Wedemonstrate that there is a systematic bias inpathway analysis (Nguyen Mitrea Tagett ampDraghici 2017) This leads to the unreliabilityof most if not all pathway analysis approachesWe also discuss the lack of benchmark datasetsor pipelines to assess the performance of ex-isting approaches

Systematic Bias of Pathway AnalysisMethods

Pathway analysis approaches often rely onhypothesis testing to identify the pathways thatare impacted under the effects of different dis-eases Null distributions are used to modelpopulations so that statistical tests can deter-mine whether an observation is unlikely to oc-cur by chance In principle the p-values pro-duced by a sound statistical test must be uni-formly distributed under the null hypothesis(Barton Crozier Lillycrop Godfrey amp Inskip2013 Bland 2013 Fodor Tickle amp Richard-son 2007 Storey amp Tibshirani 2003) For ex-ample the p-values that result from comparingtwo groups using a t-test should be distributeduniformly if the data are normally distributed(Bland 2013) When the assumptions of sta-tistical models do not hold the resulting p-values are not uniformly distributed under thenull hypothesis This makes classical methodssuch as t-test inaccurate since gene expressionvalues do not necessary follow their assump-tions Here we also show that the problem isextended to pathway analysis as the pathwayp-values obtained from statistical approachesare not uniformly distributed under the nullhypothesis This might lead to severe bias to-wards well-studied diseases such as cancerand thus make the results unreliable (Nguyenet al 2017)

Consider three pathway analysis methodsthat represent three different classes ofmethods for pathway analysis Gene SetAnalysis (GSA Efron amp Tibshirani 2007)is a Functional Class Scoring method (Efronamp Tibshirani 2007 Mootha et al 2003Subramanian et al 2005 Tarca et al2012) Down-weighting of OverlappingGenes (PADOG Tarca et al 2012) is anenrichment method (Beiẞbarth amp Speed

Network-BasedApproaches forPathway Level

Analysis

82516

Supplement 61 Current Protocols in Bioinformatics

2004 Draghici et al 2003 Khatri DraghiciOstermeier amp Krawetz 2002) and Sig-naling Pathway Impact Analysis (SPIATarca et al 2009) is a topology-awaremethod (Draghici et al 2007 Tarca et al2009) To simulate the null distribution wedownload and process the data from ninepublic datasets GSE14924 CD4 GSE14924CD8 GSE17054 GSE12662 GSE57194GSE33223 GSE42140 GSE8023 andGSE15061 Using 140 control samples fromthe nine datasets we simulate 40000 datasetsas follows We randomly label 70 samples ascontrol samples and the remaining 70 samplesas disease samples We repeat this procedure10000 times to generate different groups of70 control and 70 disease samples To makethe simulation more general we also create10000 datasets consisting of 10 control and10 disease samples 10000 datasets consistingof 10 control and 20 disease samples and10000 datasets consisting of 20 controland 10 disease samples We then calculatethe p-values of the KEGG human signalingpathways using each of the three methods

The effect of combining control (iehealthy) samples from different experimentsis to uniformly distribute all sources of biasamong the random groups of samples If wecompare groups of control samples based onexperiments there could be true differencesdue to batch effects By pooling them togetherwe form a population which is considered thereference population This approach is simi-lar to selecting from a large group of peoplethat may contain different sub-groups (egdifferent ethnicities gender race life styleor living conditions) When we randomly se-lect samples (for the two random groups tobe compared) from the reference populationwe expect all bias (eg ethnic subgroups) tobe represented equally in both random groupsand therefore we should see no difference be-tween these random groups no matter howmany distinct ethnic subgroups were presentin the population at large Therefore the p-values of a test for difference between the tworandomly selected groups should be equallyprobable between zero and one

Figure 8254 displays the empirical nulldistributions of p-values using PADOG GSAand SPIA The horizontal axes represent p-values while the vertical axes represent p-value densities Green panels (A0-A6) showp-value distributions from PADOG while blue(B0-B6) and purple (C0-C6) panels show p-value distributions from GSA and SPIA re-spectively For each method the larger panel

(A0 B0 C0) shows the cumulative p-valuesfrom all KEGG signaling pathways The smallpanels six per method display extreme ex-amples of nonuniform p-value distributionsfor specific pathways For each method weshow three distributions severely biased to-wards zero (eg A1-A3) and three distri-butions severely biased towards one (egA4-A6)

These results show that contrary to gen-erally accepted beliefs the p-values are notuniformly distributed for the three methodsconsidered Therefore one should expect avery strong and systematic bias in identifyingsignificant pathways for each of these meth-ods Pathways that have p-values biased to-wards zero will often be falsely identified assignificant (false positives) Likewise path-ways that have p-values biased towards one arelikely to rarely meet the significance require-ments even when they are truly implicatedin the given phenotype (false negatives) Sys-tematic bias due to nonuniformity of p-valuedistributions results in failure of the statisti-cal methods to correctly identify the biologicalpathways implicated in the condition and alsoleads to inconsistent and incorrect results

Querying Data from Knowledge BasesIndependent research groups have tried

different strategies to model complex bio-molecular phenomena These independent ef-forts have led to variation among pathwaydatabases complicating the task of devel-oping pathway analysis methods Dependingon the database there may be differencesin information sources experiment interpre-tation models of molecular interactions orboundaries of the pathways Therefore it ispossible that pathways with the same desig-nation and aiming to describe the same phe-nomena may have different topologies in dif-ferent databases As an example one couldcompare the insulin signaling pathways ofKEGG and BioCarta BioCarta includes fewernodes and emphasizes the effect of insulin ontranscription while KEGG includes transcrip-tion regulation as well as apoptosis and otherbiological processes Differences in graphmodels for molecular interactions are partic-ularly apparent when comparing the signal-ing pathways in KEGG and NCI-PID WhileKEGG represents the interaction informationusing the directed edges themselves NCI-PIDintroduces ldquoprocess nodesrdquo to model interac-tions (see Fig 8252) Developers are facingthe challenge of modifying methods to accept

AnalyzingMolecularInteractions

82517

Current Protocols in Bioinformatics Supplement 61

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 2: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

Figure 8251 Pathways are more than gene sets Panel A shows the graphical representation of ERBB signaling pathwayfrom KEGG database while panel B shows the set of genes on the pathway The graph in panel A contains importantinformation regarding gene product (protein) localization gene protein or metabolite interactions and the types of theseinteractions (activation repression etc) the direction of the signal propagation etc Over-representation analysis (ORA)and functional class scoring (FCS) approaches are unable to exploit this information As such for the same molecularmeasurement these approaches would yield exactly the same significance value for this pathway even if the graph wereto be completely redesigned by future discovery In contrast network-based approaches are able to take into account thetopological order of the genes and there interactions

simple gene sets based on a function processor component (eg the Molecular SignaturesDatabase MSigDB Liberzon et al 2011) or-ganized in a hierarchical structure that con-tains information about the relationship be-tween the various modules as found in theGene Ontology (Ashburner et al 2000) ororganized into pathways that describe in de-tail all known interactions between the vari-ous genes that are involved in a certain phe-nomenon Biological processes in which genesare known to interact with each other are ac-cumulated in public databases such as theKyoto Encyclopedia of Genes and Genomes(KEGG) (Kanehisa Furumichi Tanabe Sato

amp Morishima 2017 Kanehisa amp Goto 2000)Reactome (Croft et al 2014) and Biocarta(httpwwwbiocartacom)

Concurrently statistical methods have beendeveloped to identify the functional modulesor pathways that are impacted from the dif-ferential expression evidence They allow usto gain insights into the functional mecha-nisms of cells beyond the detection of dif-ferential expressed genes The earliest ap-proaches use Over-Representation Analysis(ORA Draghici Khatri Martins Ostermeieramp Krawetz 2003 Tavazoie Hughes Camp-bell Cho amp Church 1999) to identify genesets that have more DE genes than expected

Network-BasedApproaches forPathway Level

Analysis

8252

Supplement 61 Current Protocols in Bioinformatics

Table 8251 Pathway Analysis Tools

Method Availabilitya HIPAAb Licensec Coded Year Reference

Signaling pathway analysis using gene expression

MetaCoree Web (httpswwwgenegocommetacorephp)

No ThomsonReuters

Java 2004 NA

Pathway-Express

Standalone (Bioconductor) Web(http vortexcswayneedu)superseded by the ROntoTools

No Freef Java R 2005 (Draghici et al2007 Khatriet al 2007)

PathOlogist Standalone (ftpftp1ncinihgovpubpathologist)

No Free MATLAB 2007 (EfroniSchaefer ampBuetow 2007GreenblumEfroniSchaefer ampBuetow 2011)

iPathwayGuidee

Web (httpswwwadvaitabiocomproductshtml)

Yes AdvaitaCorp

Java R 2009 NA

SPIA Standalone (Bioconductor) No GPL (2) R 2009 (Tarca et al2009)

NetGSA Standalone (httpswwwbiostatwashingtoneduashojaiesoftware

No GPL-2 R 2009 (Shojaie ampMichailidis2009 Shojaieamp Michailidis2010)

PWEA Standalone (httpszlabbueduPWEA)

No Freef C++ 2010 (Hung et al2010)

TopoGSA Web (httpswwwinfobioticsnettopogsa)

No Freef PHP R 2010 (Glaab BaudotKrasnogor ampValencia 2010)

TopologyGSA

Standalone (CRAN) No AGPL-3 R 2010 (MassaChiogna ampRomualdi2010)

DEGraph Standalone (Bioconductor) No GPL-3 R 2010 (JacobNeuvial ampDudoit 2010)

GGEA Standalone (Bioconductor) No Artistic-20

R 2011 (GeistlingerCsaba KuffnerMulder ampZimmer 2011)

BPA Standalone(httpsbumilbounedutrbpa)

No Freef MATLAB 2011 (Isci OzturkJones amp Otu2011)

GANPA Standalone (CRAN) No GPL-2 R 2011 (Fang Tian ampJi 2011)

ROntoToolse

Standalone (Bioconductor) No CC BY-NC-ND4

R 2012 (Voichita et al2012)

BAPA-IGGFD

No implementation available No NA R 2012 (Zhao et al2012)

continued

8253

Current Protocols in Bioinformatics Supplement 61

Table 8251 Pathway Analysis Tools continued

Method Availabilitya HIPAAb Licensec Coded Year Reference

CePa Standalone (CRAN) Web(httpsmcubenjueducncgi-bincepamainpl)

No GPL (2) R 2012 (Gu Liu CaoZhang amp Wang2012)

THINK-Back-DS

Standalone Web(httpseecsumichedudbthinksoftwarehtml)

No Freef Java 2012 (Farfan MaSartorMichailidis ampJagadish 2012)

TBScore No implementation available No NA NA 2012 (Ibrahim JassimCawthorne ampLanglands2012)

ACST Standalone (available as articlesupplemental)

No NA R 2012 (MieczkowskiSwiatek-Machado ampKaminska 2012)

EnrichNet Web (httpswwwenrichnetorg) No free PHP 2012 (Glaab BaudotKrasnogorSchneider ampValencia 2012)

clipper Standalone (httpsromualdibiounipditsoftware)

No AGPL-3 R 2013 (Martini SalesMassa Chiognaamp Romualdi2013)

DEAP Standalone (available as articlesupplemental)

No GNULesserGPL

Python 2013 (HaynesHigdonStanberryCollins ampKolker 2013)

DRAGEN Standalone (httpsbioinfoautsinghuaeducndragen)

No NA C++ 2014 (Ma Jiang ampJiang 2014)

ToPASeqg Standalone (Bioconductor) No AGPL-3 R 2016 (Ihnatova ampBudinska 2015)

pDis Standalone (Bioconductor) No Freef R 2016 (AnsariVoichitaDonato Tagettamp Draghici2017)

SPATIAL No implementation available No NA NA 2016 (BokanizadTagett AnsariHelmi ampDraghici 2016)

BLMAh Standalone (Bioconductor) No GPL (2) R 2017 (Nguyen TagettDonato Mitreaamp Draghici2016 also seeInternetResources)

continued

8254

Supplement 61 Current Protocols in Bioinformatics

Table 8251 Pathway Analysis Tools continued

Method Availabilitya HIPAAb Licensec Coded Year Reference

Signaling pathway analysis using multiple types of data

PARADIGM Standalone (httpssbenzgithubcomParadigm)

No UCSC-CGBfreef

C 2010 (Vaske et al2010)

microGraphite

Standalone(httpsromualdibiounipditsoftware)

No AGPL-3 R 2014 (Calura et al2014)

mirIntegrator Standalone (Bioconductor) No GPL3 R 2016 (Diaz et al2016 Nguyenet al 2016)

Metabolic pathway analysis

ScorePAGE No implementation available No NA NA 2004 (Rahnenfuhreret al 2004)

TAPPA No implementation available No NA NA 2007 (Gao amp Wang2007)

MetPA Web (httpsmetpametabolomicsca) No GPL (2) PHP R 2010 (Xia ampWishart 2010)

MetaboAnalyst

Web (httpsmetpametabolomicsca) No GPL (2) R 2011 (Xia amp Wishart2011 XiaSinelnikovHan ampWishart 2015)

aAvailability is a criterion that describes the implementation of the method as standalone or Web-basedbHIPAA provides information about HIPAA compliancecLicense provides information about the type of the software license GPL is an abbreviation for the GNU General Public License AGPL is anabbreviation for the GNU Affero General Public License CC BY-NC-ND 4 is an abbreviation of Creative Commons AttributionndashNonCommercial-NoDerivatives 40 International Public LicensedCode shows the programming language used for the method implementationeCommercial methodfFree for academic and non-commercial use UCSC-CGB is the University of California Santa Cruz Cancer Genome BrowsergToPASeq provides an R package that runs TopologyGSA DEGraph clipper SPIA TBScore PWEA TAPPAhBLMA provides an R package for bi-level meta-analysis that runs SPIA ORA GSA and PADOG using one or multiple expression datasets

by chance The drawbacks of this type of ap-proach include (i) it only considers the num-ber of DE genes and completely ignores themagnitude of the actual expression changesresulting in information loss (ii) it assumesthat genes are independent which they arenot (since the pathways are graphs that de-scribe precisely how these genes influenceeach other) and (iii) it ignores the interactionsbetween various genes and modules Func-tional Class Scoring (FCS) approaches suchas Gene Set Enrichment Analysis (GSEASubramanian et al 2005) and Gene Set Anal-ysis (GSA Efron amp Tibshirani 2007) havebeen developed to address some of the is-sues raised by ORA approaches The mainimprovement of FCS is the observation thatsmall but coordinated changes in expressionof functionally related genes can have signifi-cant impact on pathways

ORA and FCS approaches are often re-ferred as gene set enrichment methods Com-prehensive lists of gene set analysis ap-proaches as well as comparisons betweenthem can be found in well-developed sur-veys (Emmert-Streib amp Glazko 2011 KelderConklin Evelo amp Pico 2010 Khatri Sirotaamp Butte 2012 Misman et al 2009) Whileuseful for the purpose for which they havebeen developedmdashto analyze sets of genesmdashthese methods completely ignore the topol-ogy and interaction between genes Topology-based approaches which fully exploit allthe knowledge about how genes interact asdescribed by pathways have been devel-oped more recently The first such techniqueswere ScorePAGE (Rahnenfuhrer DominguesMaydt amp Lengauer 2004) for metabolicpathways and Impact Analysis (Draghiciet al 2007 Tarca et al 2009) for signaling

AnalyzingMolecularInteractions

8255

Current Protocols in Bioinformatics Supplement 61

pathways Figure 8251 shows an examplepathway named ERBB signaling pathway inwhich the nodes represent genes and com-pounds while the edges represent the knowninteraction between the compounds Gene setanalysis approaches are not able to accountfor the topological order of genes nor are ableto explain the signal propagation and mecha-nisms of the pathway These approaches wouldyield exactly the same significance value forthis pathway even if the graph were to becompletely redesigned by future discoveryIn contrast network-based pathway analysiscan distinguish between this pathway and anyother pathway with the same proportion of DEgenes

Here we provide a survey of 34network-based methods developed for path-way analysis Our survey of commercialtools for pathway analysis found iPath-wayGuide (Advaita Corporation httpswwwadvaitabiocom) and MetaCore (ThomsonReuters httpswwwthomsonreuterscom) tobe topology based Other commercial toolssuch as Ingenuity Pathway Analysis (httpswwwingenuitycom) or Genomatix (httpswwwgenomatixde) do not use pathway topol-ogy information and thus are not included inthis review The existence of only two com-mercial tools is more evidence of the challengefaced by the developers of such methods dueto lack of standards

In this document we categorize and com-pare the methods based on the followingcriteria the type of input the method ac-cepts graph model and the statistical ap-proaches used to evaluate gene and pathwaychanges Under the heading ldquoExperiment In-put and Pathway Databasesrdquo below we de-scribe the types of input the surveyed meth-ods use Under the heading ldquoGraph Mod-elsrdquo we provide details regarding the graphmodels used for the biological networks Un-der ldquoPathway Scoring Strategiesrdquo we discussthe statistical methods used by the surveyedmethods to assess gene and pathway changesbetween two phenotypes Under ldquoChallengesin Pathway Analysisrdquo we discuss a numberof outstanding challenges that needed to beaddressed in order to improve the reliabil-ity as well as the relevance of the next-generation pathway analysis approaches Wediscuss the key elements and limitations of thesurveyed methods without going into detailsof each technique For a more detailed reviewand technical descriptions of network-basedpathway analysis methods please refer to

Mitrea et al (2013) or referenced manuscripts(Table 8251)

NETWORK-BASED PATHWAYANALYSIS

The term pathway analysis is used in avery broad context in the literature includ-ing biological network construction and in-ference Here we focus on methods that areable to exploit biological knowledge in pub-lic repositories rather than on network in-ference approaches that attempt to infer orreconstruct pathways from molecular mea-surements Table 8251 shows the list of 34pathway analysis approaches together withtheir availability and licensing In this sec-tion we start by describing the availability ofeach method before discussing the input for-mat graph model and underlying statisticalapproaches

Software Availability andImplementation

We often think that the main strength ofan approach lies in its novelty and algorithmefficiency However the implementation andavailability of a tool have become increas-ingly important for several reasons First soft-ware availability and version control are cru-cial for reproducing the experimental resultsthat were used to assess the performance ofthe approach (Sandve Nekrutenko Taylor ampHovig 2013) For this reason many journalsrequest authors to make their software and dataavailable before accepting methods articlesSecond if a software application is not ready-to-run it is very unlikely that the intended au-dience (mostly life scientists) will invest thetime to understand and implement complexalgorithms Practicality user-friendliness out-put format and type of interface are all to beconsidered Depending on the desired avail-ability and intended audience a software pack-age may be implemented as standalone or Webbased Among the 34 approaches there are 30that are available either as a standalone soft-ware package or Web service

Typically standalone tools need to be in-stalled on local machines or servers which of-ten requires some administrative skills Moststandalone tools depend on full or partialcopies of public pathway databases storedlocally and need to be updated periodicallyAdvantages of standalone tools include (i)instant availability that does not require Inter-net access and (ii) the security and privacy ofthe experimental data Web-based tools on the

Network-BasedApproaches forPathway Level

Analysis

8256

Supplement 61 Current Protocols in Bioinformatics

other hand run the analyses on a remote serverproviding computational power and a graph-ical interface The major advantage of Web-based tools is that they are user-friendly and donot require a separate local installation Fromthe accessibility perspective Web-based toolshave the advantage of being available from anylocation as long as there is an Internet connec-tion and a browser available Also the updateis almost transparent to the client This makesthe userrsquos task easy and enables collaborationsince users all over the world can utilize thesame method without the burden of installingit or keeping it up-to-date There are methodsthat provide both Web-based and standaloneimplementations

HIPAA compliance may also be a factor incertain applications that involve data comingfrom patients or data linked to other clinicaldata or clinical records Currently the iPath-wayGuide is the only tool available that cando topological pathway analysis and is HIPAAcompliant

The programming language and style usedfor implementation also play an important rolein the acceptance of a method Software toolsthat are neatly implemented and packaged aremore appealing compared to those that do nothave ready-to-use implementations Many ofthe methods are implemented in the R pro-gramming language and are available as soft-ware packages from Bioconductor CRAN orthe authorrsquos Web site Their popularity amongbiologists and bioinformaticians is due to thefact that many bioinformatics-dedicated pack-ages are available in R In addition the rigor-ous review procedure provided by Bioconduc-tor and CRAN makes the software packagesmore standardized and reliable

Experiment Input and PathwayDatabases

A pathway analysis approach typically re-quires two types of input experiment data andknown biological networks Experiment datais usually collected from high-throughput ex-periments that compare a condition phenotypeto a control phenotype A condition pheno-type can be a disease a drug treatment orthe knock-out (KO) of a gene while a con-trol phenotype can be a healthy state a dif-ferent disease or drug treatment or wild-type(non-KO) samples Experimental data can beobtained from multiple technologies that pro-duce different types of data gene expressionprotein abundance metabolite concentrationmiRNA expression etc Biological networks

or pathways are often represented in the formof graphs that capture our current knowl-edge about the interactions of genes proteinsmetabolites or compounds in an organismThe pathway data is accumulated updatedand refined by amassing knowledge from sci-entific literature describing individual interac-tions or high-throughput experiment results

Table 8252 shows a summary of in-put format and mathematical modeling ofthe surveyed approaches Most pathwayanalysis methods analyze data from high-throughput experiments such as microarraysnext-generation sequencing or proteomicsThey accept either a list of gene IDs or a listof such gene IDs associated with measuredchanges These changes could be measuredwith different technologies and therefore canserve as proxies for different biochemical en-tities For instance one could use gene ex-pression changes measured with microarraysor protein levels measured with a proteomicapproach etc

Different analysis methods use different in-put formats Many methods accept a list of allgenes considered in the experiment togetherwith their expression values Some analysismethods select a subset of genes consideredto be differentially expressed (DE) based on apredefined cut-off The cut-off is typically ap-plied on fold-change p-value or both Thesemethods use the list of DE genes and their cor-responding statistics (fold-change p-value) asinput Other methods use only the list of DEgenes without corresponding expression val-ues because their scoring methods are basedonly on the relative positions of the genes inthe graph

Methods which use cut-offs are sensitive tothe chosen threshold value because a smallchange in the cut-off may drastically changethe number of selected genes (Nam amp Kim2008) In addition they typically use the mostsignificant genes and discard the rest whoseweaker but coordinated changes may also havesignificant impact on pathways Genes withmoderate differential expression may be losteven though they might be important play-ers in the impacted pathways (Ben-ShaulBergman amp Soreq 2005) Furthermore thegenes included in the set of DE genes canvary dramatically if the selection methods arechanged Hence the results of pathway anal-yses based on DE genes may be vastly differ-ent depending on both the selection method aswell as the threshold value (Pan Lih amp Co-hen 2005) Furthermore for the same disease

AnalyzingMolecularInteractions

8257

Current Protocols in Bioinformatics Supplement 61

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

Signaling pathway analysis using gene expression

MetaCore DE genes Proprietary canonicalpathway genome-scalenetwork

Single-type directed Hierarchicallyaggregated

Pathway-Express DE genes change ormeasured geneschanged

KEGG signaling Single-type directed Hierarchicallyaggregated

PathOlogist Measured genesexpression

KEGG Multi-type directed Hierarchicallyaggregated

iPathwayGuide DE genes change ormeasured geneschange

KEGG signalingReactome NCIBioCarta

Single-type directed Hierarchicallyaggregated

SPIA DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

NetGSA Measured genesexpression

KEGG signaling Single-type directed Multivariate analysis

PWEA Measured genesexpression

YeastNet Single-typeundirected

Hierarchicallyaggregated

TopoGSA DE genes PPI network KEGG Single-typeundirected

Hierarchicallyaggregated

TopologyGSA Measured genesexpression

NCI-PID Single-typeundirected

Multivariate analysis

DEGraph Measured genesexpression

KEGG Single-typeundirected

Multivariate analysis

GGEA Measured genesexpression

KEGG Single-type directed Aggregate fuzzysimilarity

BPA Measured genesexpressionmdashwithcut-off

NCI-PID Single-type DAG Bayesian network

GANPA DE genes change ormeasured genesexpression

PPI network KEGGReactome NCI-PIDHumanCyc

Single-typeundirected

Hierarchicallyaggregated

ROntoTools DE genes change ormeasured geneexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

BAPA-IGGFD Measured genesexpression - withcut-off

Literature-basedinteraction databaseKEGG WikiPathwaysReactome MSigDBGO BP PANTHERconstructed geneassociation networkfrom PPIsco-annotation in GOBiological Process (BP)and co-expression inmicroarray data

Single-type DAG Bayesian network

continued

8258

Supplement 61 Current Protocols in Bioinformatics

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

CePa DE genes expressionor measured genesexpression

NCI-PID Single-type directed Hierarchicallyaggregated

THINK-Back-DS DE genes changemeasured genesexpression

KEGG PANTHERBioCarta ReactomeGenMAPP

Single-type directed Hierarchicallyaggregated

TBScore DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

ACST Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

EnrichNet DE genes list PPI network KEGGBioCartaWikiPathwaysReactome NCI-PIDInterPro GO withSTRING 90

Single-typeundirected

Hierarchicallyaggregated

clipper Measured genesexpression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

DEAP Measured genesexpression

KEGG Reactome Single-type directed Hierarchicallyaggregated

DRAGEN Measured genesexpression

RegulonDB M3DHTRIdb ENCODEMSigDB

Single-type directed Linear regression

ToPASeqe Measured genesexpression

KEGG Single-type directed Hierarchical ampmultivariate

pDis DE genes change ormeasured geneschanged

KEGG signaling Single-type directed Hierarchicallyaggregated

SPATIAL DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

BLMAf Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

Signaling pathway analysis using multiple types of data

PARADIGM Measured genesexpression copynumber and proteinslevels

Constructed PPInetworks from MIPSDIP BIND HPRDIntAct and BioGRID

Multi-type directed Hierarchicallyaggregated

microGraphite Measured geneexpression andmiRNA expression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

mirIntegrator DE genes changeand DE miRNAchange

KEGG signalingmiRTarBase

Single-type directed Hierarchicallyaggregated

Metabolic pathway analysis

ScorePAGE Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

continued

8259

Current Protocols in Bioinformatics Supplement 61

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

TAPPA Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

MetPA DE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

MetaboAnalyst DE genes change andDE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

aExperiment input shows the experiment data input format for each method ldquoDErdquo means differentially expressed ldquoChangerdquo means fold-change valueor t-statistics when comparing genemetabolites values between two phenotypes ldquoMeasuredrdquo means the list of all the genesmetabolites measured inthe experiment ldquoListrdquo represents a list of genesmetabolites identifiers (eg symbols) ldquoWith cut-offrdquo show methods that take as input the list of allmeasured genes and in the analysis they mark the DE genesbPathway database shows the name of the knowledge source for biological interactionscGraph model is a characteristic that shows if the graph has one or multiple types of nodes as well as if directed or undirecteddThe package ROntoTools (Bioconductor) allows for analysis using expression values of all geneseToPASeq provides an R package that runs TopologyGSA DEGraph clipper SPIA TBScore PWEA TAPPAfBLMA provides an R package for bi-level meta-analysis that runs SPIA ORA GSA and PADOG using one or multiple expression datasets

independent studies or measurements oftenproduce different sets of differentially ex-pressed (DE) genes (Ein-Dor Kela GetzGivol amp Domany 2005 Ein-Dor Zuk amp Do-many 2006 Tan et al 2003) This makes ap-proaches that use DE genes as input appeareven more unreliable

Usually pathways are sets of genes andorgene products that interact with each other ina coordinated way to accomplish a given bi-ological function A typical signaling path-way (in KEGG for instance) uses nodes torepresent genes or gene products and edgesto represent signals such as activation or re-pression that go from one gene to anotherA typical metabolic pathway uses nodes torepresent biochemical compounds and edgesto represent reactions that transform one ormore compound(s) into one or more othercompounds These reactions are usually car-ried out or controlled by enzymes which arein turn coded by genes Hence in a metabolicpathway genes or gene products are associatedwith edges rather than nodes as in a signal-ing pathway The immediate consequence ofthis difference is that many techniques cannotbe applied directly on all available pathwaysThis is why the analysis of metabolic pathwaysis still generally and arguably underdeveloped(only 128 Google Scholar citations to date forthe original ScorePage paper Rahnenfuhreret al 2004) and there are not many othermethods available for metabolic pathway anal-ysis However the topology-based analysis ofsignaling pathways has been very successful

(over 1200 citations to date for the two impactanalysis papers mentioned above Draghiciet al 2007 Tarca et al 2009) and over 30other methods have been developed since

There are other types of biological net-works that incorporate genome-wide inter-actions between genes or proteins such asprotein-protein interaction (PPI) networksThese networks are not restricted to specificbiological functions The main caveat relatedto PPI data is that most such data are obtainedfrom a bait-prey laboratory assays rather thanfrom in vivo or in vitro studies The fact thattwo proteins stick to each other in an assayperformed in an artificial environment can bemisleading since the two proteins may neverbe present at the same time in the same tissueor the same part of the cell

Publicly available curated pathwaydatabases used by methods listed in Table8252 are KEGG (Ogata et al 1999)NCI-PID (Schaefer et al 2009) BioCarta(httpswwwbiocartacom) WikiPathways(Pico et al 2008) PANTHER (Mi et al2005) and Reactome (Joshi-Tope et al2005) These knowledge bases are built bymanually curating experiments performed indifferent cell types under different conditionsThese curated knowledgebases are morereliable than protein interaction networksbut do not include all known genes andtheir interactions As an example despitebeing continuously updated KEGG includesonly about 5000 human genes in signalingpathways while the number of protein-coding

Network-BasedApproaches forPathway Level

Analysis

82510

Supplement 61 Current Protocols in Bioinformatics

Figure 8252 Five biological network graph models used by public databases are displayedThe KEGG database contains both signaling and metabolic networks Signaling networks havegenesgene products as nodes and regulatory signals as edges Types of regulatory signalsinclude activation inhibition phosphorylation and many others Metabolic networks have bio-chemical compounds as nodes and chemical reactions as edges Enzymes (specialized pro-teinsgene products) catalyze biochemical reactions therefore genes are linked to the edges inthese networks The Reactome database is a collection of biochemical reactions that are groupedin functionally related sets to form a pathway There are two types of nodes biochemical com-pounds and reactions Reaction nodes link biochemical compounds as reactants and productsNCI-PID signaling networks also have two types of nodes component nodes and process nodesComponent nodes are usually biomolecular components Process nodes are usually biochemicalreactions or biological processes In these networks process nodes link two or more componentnodes through directed edges Process nodes are assigned one of the following states positiveor negative regulation or involved in Protein-protein interaction (PPI) networks have proteinsas nodes and physical binding as edges or interactions Two-hybrid assays are typically used todetermine protein-protein interactions PPIs can be directed in the bait-prey orientation when thebait-prey relation is considered (top) or undirected when is not (bottom) The Biological PathwayExchange (BioPAX) format has physical entities as nodes and conversions as edges The ad-vantage of this representation is that it is generic and provides a lot of flexibility to accommodatevarious types of interactions In addition it provides a machine-readable standard that can beused for all databases to provide their data in a unified format Network nodes can be genesgene products complexes or non-coding RNA Network edges can be assembly of a complexdisassembly of a complex or biochemical reactions among others

genes is estimated to be between 19000 and20000 (Ezkurdia et al 2014)

The implementation of analysis methodsconstrains the software to accept a specific in-put pathway data format while the underlyinggraph models in the methods are independentof the input format Regardless of the pathwayformat this information must be parsed into acomputer-readable graph data structure beforebeing processed The implementation may in-corporate a parser or this may be up to the userFor instance SPIA accepts any signaling path-way or network if it can be transformed into anadjacency matrix representing a directed graphwhere all nodes are components and all edgesare interactions NetGSA is similarly flexiblewith regard to signaling and metabolic path-ways SPIA provides KEGG signaling path-ways as a set of pre-parsed adjacency matri-ces The methods described in this chapter maybe restricted to only one pathway database ormay accept several

To the best of our knowledge there havebeen no comprehensive approaches to com-

pare the pros and cons of existing knowledgebases It is completely up to researchers tochoose the tools that are able to work withthe database they trust Intuitively methodsthat are able to exploit the complementary in-formation available from different databaseshave an edge over methods that work with onesingle database

Graph ModelsPathway analysis approaches use two major

graph models to represent biological networksobtained from knowledge bases (i) single-type and (ii) multiple-type The first modelallows only one type of node ie a gene orprotein with edges representing molecular in-teractions between the nodes (KEGG signal-ing in Fig 8252) Models that contain di-rected graphs are more suitable for analysesthat include signal propagation of gene pertur-bation The second graph model allows mul-tiple type of nodes such as components andinteractions (NCI-PID in Fig 8252) Multi-type graph models are more complex than

AnalyzingMolecularInteractions

82511

Current Protocols in Bioinformatics Supplement 61

single-type but they are expected to capturemore pathway characteristics For examplesingle-type models are limited when tryingto describe ldquoallrdquo and ldquoanyrdquo relations betweenmultiple components that are involved in thesame interaction Bipartite graphs which con-tain two types of nodes and allow connectiononly between nodes of different types are aparticular case of multi-type graph models

The majority of analysis methods surveyedhere use a single-type graph model Some ap-ply the analysis on a directed or un-directedsingle-type network built using the input path-way while others transform the pathways intographs with specific characteristics An ex-ample of the later is TopologyGSA whichtransforms the directed input pathway intoan undirected decomposable graph which hasthe advantage of being easily broken downinto separate modules (Lauritzen 1996) Inthis method decomposable graphs are usedto find ldquoimportantrdquo submodulesmdashthose thatdrive the changes across the whole pathwayFor each pathway TopologyGSA creates anundirected moral graph from the underlyingdirected acyclic graph (DAG) by connectingthe parents of each child and removing theedge direction [The moral graph of a DAGis the undirected graph created by adding an(undirected) edge between all parents of thesame node (sometimes called marrying) andthen replacing all directed edges by undirectededges The name stems from the fact that ina moral graph two nodes that have a com-mon child are required to be married by shar-ing an edge] The moral graph is then usedto test the hypothesis that the underlying net-work is changed significantly between the twophenotypes If the the research hypothesis isrejected a decomposabletriangulated graph isgenerated from the moral graph by adding newedges This graph is broken into the maximalpossible submodules and the hypothesis is re-tested on each of them

PathOlogist and PARADIGM are thetwo surveyed methods that use multi-typegraph models PathOlogist uses a bipartitegraph model with component and interac-tion nodes PARADIGM conceptually moti-vated by the central dogma of molecular bi-ology takes a pathway graph as input andconverts it into a more detailed graph whereeach component node is replaced by sev-eral more specific nodes biological entitynodes interaction nodes and nodes contain-ing observed experiment data The observedexperiment nodes could in principle containgene-expression and copy-number informa-

tion Biological entity nodes are DNA mRNAprotein and active protein The interactionnodes are transcription translation or pro-tein activation among others Biological en-tity and interaction node values are derivedfrom these data and specify the probability ofthe node being active These are the hiddenstates of the model From the mathematicalmodel perspective models that allow multipletypes of nodesmdashcomponent nodes and inter-action nodesmdashare more flexible and are able tomodel both AND and OR gates which are verycommon when describing cellular processes

Pathway Scoring StrategiesThe goal of the scoring method is to com-

pute a score for each pathway based on the ex-pression change and the graph model resultingin a ranked list of pathways or sub-pathwaysThere are a variety of approaches to quantifythe changes in a pathway Some of the analysismethods use a hierarchically aggregated scor-ing algorithm where on the first level a scoreis calculated and assigned to each node or pairof nodes (component andor interaction) Onthe second level these scores are aggregated tocompute the score of the pathway On the lastlevel the statistical significance of the path-way score is assessed using univariate hypoth-esis testing Another approach assigns a ran-dom variable to each node and a multivariateprobability distribution is calculated for eachpathway The output score can be calculated intwo ways One way is to use multivariate hy-pothesis testing to assess the statistical signif-icance of changes in the pathway distributionbetween the two phenotypes The other way isto estimate the distribution parameters basedon the Bayesian network model and use thisdistribution to compute a probabilistic score tomeasure the changes In this section we pro-vide details regarding the scoring algorithmsof the surveyed methods See Figure 8253 forscoring algorithms categories

Hierarchically aggregated scoringThe workflow of the hierarchically aggre-

gated scoring strategy has three levels nodestatistic computation pathway statistic com-putation and the evaluation of the significancefor the pathway statistic

Node level scoringMost of the surveyed methods incorpo-

rate pathway topology information in the nodescores The node level scoring can be di-vided into four categories (i) graph mea-sures (centrality) (ii) similarity measures

Network-BasedApproaches forPathway Level

Analysis

82512

Supplement 61 Current Protocols in Bioinformatics

Figure 8253 A summary of the statistical models is presented for the surveyed methods Mostmethods use the hierarchically aggregated scoring strategy in which the score is computed atthe node level before being aggregated at the pathway level for significance assessment In theleft panel the rows show a node-level statistical model while the columns show the aggregatingstrategies (linear non-linear or weighted gene set) Approaches in multivariate analysis andBayesian network categories use multivariate modeling and Bayesian network respectively tofind pathways that are most likely to be impacted Both BLMA and ToPASeq provide R packagesthat run multiple algorithms DRAGEN follows a linear regression strategy while GGEA appliesfuzzy modeling to rank the pathways

(iii) probabilistic graphical models and (iv)normalized node value (NNV) Approaches inthe first category use centrality measures or avariation of these measures to score nodes in agiven pathway Centrality measures representthe importance of a node relative to all othernodes in a network There are several central-ity measures that can be applied to networksof genes and their interactions degree central-ity closeness between-ness and eigenvectorcentrality Degree centrality accounts for thenumber of directed edges that enter and leaveeach node Closeness sums the shortest dis-tance from each node to all other nodes inthe network Node between-ness measures theimportance of a node according to the numberof shortest paths that pass through it Eigen-vector centrality uses the network adjacencymatrix of a graph to determine a dominanteigenvector each element of this vector is ascore for the corresponding node Thus eachscore is influenced by the scores of neighbor-ing nodes In the case of directed graphs anode that has many downstream genes hasmore influence and receives a higher score

Methods in this category include MetaCoreSPIA iPathwayGuide MetPA SPATIAL To-poGSA DEAP pDis mirIntegrator Pathway-Express and BLMA

Approaches in the second category usesimilarity measures in their node level scor-ing Similarity measures estimate the co-expression behavioral similarity or co-regulation of pairs of components Theirvalues can be correlation coefficients covari-ances or dot products of the gene expressionprofile across time or samples In these meth-ods the pathways with clusters of highly cor-related genes are considered more significantAt the node level a score is assigned to eachpair of nodes in the network which is the ra-tio of one similarity measure over the shortestpath distance between these nodes Thus thetopology information is captured in the nodescore by incorporating the shortest path dis-tance of the pair Methods in this category in-clude ScorePAGE and PWEA

Approaches in the third category incorpo-rate the topology in the node level scoringusing a probabilistic graphical model In this

AnalyzingMolecularInteractions

82513

Current Protocols in Bioinformatics Supplement 61

model nodes are random variables and edgesdefine the conditional dependency of the nodesthey link For example PARADIGM takes ob-served experimental data and calculates scoresfor all component nodes in both observed andhidden states from the detailed network cre-ated by the method based on the input path-way For each node score a positive or negativevalue denotes how likely it is for the node tobe active or inactive respectively The scoresare calculated to maximize the probability ofthe observed values A p-value is associatedwith each score of each sample such that eachnode can be tagged as significantly activesignificantly inactive or not significant Foreach network a matrix of p-values is outputin which columns are samples and rows arecomponent nodes Methods in this categoryinclude PARADIGM and PathOlogist

Approaches in the fourth category simplycompute the score for each node using the in-formation obtained from the experiment inputFor example TAPPA calculates the score ofeach node as the square root of the normalizedlog gene expressions (node value) while ACSTand TBScore calculate the node level score us-ing a sign statistic and log fold-change

Pathway level scoringThere are three different ways to compute

the pathway score (i) linear (ii) non-linearand iii) weighted gene set Most methods ag-gregate node level statistics to pathway levelstatistics using linear functions such as av-eraging or summation For example iPath-wayGuide computes the scores of pathway asthe sum of all genes while TBScore weightsthe pathway DE genes based on their log foldchange and the number of distinct DE genesdirectly downstream of them using a depth-first search algorithm

Approaches in the second group (non-linear) use a non-linear function to computethe pathway scores For example TAPPA com-putes the pathway score for each sample as aweighted sum of the product of all node pairscores in the pathway The weight coefficientis 0 when there is no edge between a pair Forany connected node pair the weight is a signfunction which represents joint up- or down-regulation of the pair Another example is En-richNet in which pathway scores measure thedifference of the node score distribution for apathway and a background networkgene setwhich consists of all pathways At the nodelevel the distance of all DE genes to the path-way is measured and summarized as a distancedistribution The method assumes that the most

relevant pathway is the one with the greatestdifference between the pathway node scoredistribution and the background score distri-bution The difference between the two distri-butions is measured by the weighted averagingof the difference between the two discretizedand normalized distributions The averagingmethod down-weights the higher distances andemphasizes the lower distance nodes

Methods in the third group (weighted geneset) design scoring techniques that incorpo-rate existing gene set analysis methods suchas GSEA (Subramanian et al 2005) GSA(Efron amp Tibshirani 2007) or LRPath (SartorLeikauf amp Medvedovic 2009) Pathway-levelscores can be calculated using node scoresthat represent the topology characteristic of thepathway as weight adjustments to a gene setanalysis method PWEA GANPA THINK-Back-DS and CePa use this approach and werefer to them as weighted gene set analysismethods

Pathway significance assessmentPathway scores are intended to provide in-

formation regarding the amount of changeincurred in the pathway between two phe-notypes However the amount of change isnot meaningful by itself since any amount ofchange can take place just by chance An as-sessment of the significance of the measuredchanges is thus required

TopoGSA MetPA and EnrichNet outputscores without any significance assessmentleaving it up to the user to interpret the re-sults This is problematic because users do nothave any instrument to distinguish betweenchanges due to noise or random causes andmeaningful changes that are unlikely to occurjust by chance and that therefore are possi-bly related to the phenotype The rest of theanalysis methods perform a hypothesis test foreach pathway The null hypothesis is that thevalue of the observed statistic is due to randomnoise or chance alone The research hypothe-sis is that the observed values are substantialenough that they are potentially related to thephenotype A p-value for calculated score isthen computed and a user-defined thresholdon the p-value is used to decide whether thethe null hypothesis can be rejected or not foreach pathway Finally a correction for multi-ple comparisons should be performed

Typically pathway analysis methods com-pute one score per pathway The distribution ofthis score under the null hypothesis can be con-structed and compared to the observed scoreHowever there are often too few samples to

Network-BasedApproaches forPathway Level

Analysis

82514

Supplement 61 Current Protocols in Bioinformatics

calculate this distribution so it is assumedthat the distribution is known For example inMetaCore and many other techniques whenthe pathway score is the number of DE nodesthat fall on the pathway the distribution isassumed to be hypergeometric However thehypergeometric distribution assumes that thevariables (genes in this case) are independentwhich is incorrect as witnessed by the factthat the pathway graph structure itself is de-signed to reflect the specific ways in which thegenes influence each other Another approachto identify the distribution is to use statisticaltechniques such as the bootstrap (Efron 1979)Bootstrapping can be done either at the sam-ple level by permuting the sample labels orat gene-set level by permuting the the valuesassigned to the genes in the set

Multivariate analysis and Bayesiannetwork

Multivariate analysis methods mostly usemultivariate probability distributions to com-pute pathway-level statistics and these can begrouped into two subcategories Methods inthe first category use multivariate hypothesistesting while methods in the second categoryare based on Bayesian network

NetGSA TopologyGSA DEGraph clip-per and microGraphite are methods based onmultivariate hypothesis testing These anal-ysis methods assume the vectors of geneexpression values in each (sub)pathway arerandom vectors with multivariate normal dis-tributions The network topology informationis stored in the covariance matrix of the cor-responding distribution For a network if thetwo distributions of the gene expression vec-tors corresponding to the two phenotypes aresignificantly different the network is assumedto be significantly impacted when comparingthe two phenotypes The significance assess-ment is done by a multivariate hypothesis testThe definition of the null hypothesis for thestatistical tests and the techniques to calcu-late the parameters of the distributions are themain differences between these three analysismethods

BPA and BAPA-IGGFD are two methodsbased on Bayesian networks In a Bayesiannetwork which is a special case of proba-bilistic graphical models a random variableis assigned to each node of a directed acyclic(DAG) graph The edges in the graph representthe conditional probabilities between nodesso that the children are independent from eachother and the rest of the graph when condi-tioned on the parents In BPA the value of

the Bayesian random variable assigned to eachnode captures the state of a gene (DE or not) Incontrast in BAPA-IGGFD each random vari-able assigned to an edge is the probability thatup or down regulation of the genes at both endsof an interaction are concordant with the typeof interaction which can be activation or inhi-bition In both BPA and BAPA-IGGFD eachrandom variable is assumed to follow a bino-mial distribution whose probability of successfollows a beta distribution However these twomethods use different approaches in represent-ing the multivariate distribution of the corre-sponding random vector BPA assumes thatthe random vector has a multinomial distribu-tion which is the generalization of the bino-mial distribution In this case the vector of thesuccess probability follows the Dirichlet dis-tribution which is the multivariate extensionof the beta distribution In contrast BAPA-IGGFD assumes the random variables areindependent therefore the multivariate distri-butions are calculated by multiplying the dis-tributions of the random variables in the vec-tor It is worth mentioning that the assumptionof independence in BAPA-IGGFD is contra-dicted by evidence specifically in the case ofedges that share nodes

Other approachesThe four methods DRAGEN GGEA

ToPASeq and BLMA follow strategies thatare very different from those of the other 30methods As such BLMA implements a bi-level meta-analysis approach that can be ap-plied in conjunction with any of the four sta-tistical approaches SPIA (Tarca et al 2009)GSA (Efron amp Tibshirani 2007) ORA (Tava-zoie et al 1999) or PADOG (Tarca DraghiciBhatti amp Romero 2012) The package al-lows users to perform pathway analysis withone dataset or with multiple datasets Simi-larly ToPASeq provides an R package thatruns TopologyGSA DEGraph clipper SPIATBScore PWEA and TAPPA This methodmodels the input (biological networks and ex-periment data) in such a way that it can be usedfor any of the seven different analyses

DRAGEN is an analysis method that scoresthe interactions rather then the genes and usesa regression model to detect differential regu-lation DRAGEN fits each edge of a pathwayinto two linear models (for case and control)and then computes a p-value that representsthe difference between the two models Foreach pathway a summary statistic is computedby combining the p-values of the edges us-ing a weighted Fisherrsquos method (Fisher 1925)

AnalyzingMolecularInteractions

82515

Current Protocols in Bioinformatics Supplement 61

GGEA on the other hand uses Petri Net (Mu-rata 1989) to model the pathway The sum-mary statistic named consistency is computedfrom the fuzzy similarity between the observedgene expression and Petri net with fuzzy logic(PNFL Kuffner Petri Windhager amp Zimmer2010) For both DRAGEN and GGEA the p-value of the pathway is calculated by com-paring the observed summary statistic againstthe null distribution that is constructed bypermutation

CHALLENGES IN PATHWAYANALYSIS

Pathway analysis has become the firstchoice for gaining insights into the underly-ing biology of a phenotype due its explanatorypower However there are outstanding annota-tion and methodological limitations that havenot been addressed (Kelder Conklin Eveloamp Pico 2010 Khatri et al 2012) There arethree main limitations of current knowledgebases First existing knowledge bases are un-able to keep up with the information availablein data obtained from recent technologies Forexample RNA-Seq data allows us to iden-tify transcripts that are active under certainconditions Alternatively spliced transcriptseven if they originate from the same genemay have distinct or even opposite functions(Wang et al 2008) However most knowl-edge bases provide pathway annotation onlyat the gene level Second there is a lack ofcondition and cell-specific information ieinformation about cell type conditions andtime points Finally current pathway annota-tions are neither complete nor perfectly accu-rate (Khatri et al 2012 Rhee Wood Dolinskiamp Draghici 2008) For example the number ofgenes in KEGG have remained around 5000despite being updated continuously in the past10 years while there are approximately 19000human genes annotated with at least one GOterm The number of protein-coding genes isestimated to be between 19000 and 20000(Ezkurdia et al 2014) most of which areincluded in DNA microarray assays such asAffymetrix HG U133 plus 20

Another challenge is the oversimplificationthat characterizes many of the models pro-vided by pathway databases In principle eachtype of tissue might have different mecha-nisms so generic organism-level pathwayspresent a somewhat simplistic description ofthe phenomena Furthermore signaling andmetabolic processes can also be different fromone condition to another or even from one

patient to another Understanding the specificpathways that are impacted in a given pheno-type or sub-group of patients should be anothergoal for the next generation of pathway anal-ysis tools See Khatri et al (2012) for a moredetailed discussion of annotation limitationsof existing knowledge bases

Here we focus on challenges of pathwayanalysis from computational perspectives Wedemonstrate that there is a systematic bias inpathway analysis (Nguyen Mitrea Tagett ampDraghici 2017) This leads to the unreliabilityof most if not all pathway analysis approachesWe also discuss the lack of benchmark datasetsor pipelines to assess the performance of ex-isting approaches

Systematic Bias of Pathway AnalysisMethods

Pathway analysis approaches often rely onhypothesis testing to identify the pathways thatare impacted under the effects of different dis-eases Null distributions are used to modelpopulations so that statistical tests can deter-mine whether an observation is unlikely to oc-cur by chance In principle the p-values pro-duced by a sound statistical test must be uni-formly distributed under the null hypothesis(Barton Crozier Lillycrop Godfrey amp Inskip2013 Bland 2013 Fodor Tickle amp Richard-son 2007 Storey amp Tibshirani 2003) For ex-ample the p-values that result from comparingtwo groups using a t-test should be distributeduniformly if the data are normally distributed(Bland 2013) When the assumptions of sta-tistical models do not hold the resulting p-values are not uniformly distributed under thenull hypothesis This makes classical methodssuch as t-test inaccurate since gene expressionvalues do not necessary follow their assump-tions Here we also show that the problem isextended to pathway analysis as the pathwayp-values obtained from statistical approachesare not uniformly distributed under the nullhypothesis This might lead to severe bias to-wards well-studied diseases such as cancerand thus make the results unreliable (Nguyenet al 2017)

Consider three pathway analysis methodsthat represent three different classes ofmethods for pathway analysis Gene SetAnalysis (GSA Efron amp Tibshirani 2007)is a Functional Class Scoring method (Efronamp Tibshirani 2007 Mootha et al 2003Subramanian et al 2005 Tarca et al2012) Down-weighting of OverlappingGenes (PADOG Tarca et al 2012) is anenrichment method (Beiẞbarth amp Speed

Network-BasedApproaches forPathway Level

Analysis

82516

Supplement 61 Current Protocols in Bioinformatics

2004 Draghici et al 2003 Khatri DraghiciOstermeier amp Krawetz 2002) and Sig-naling Pathway Impact Analysis (SPIATarca et al 2009) is a topology-awaremethod (Draghici et al 2007 Tarca et al2009) To simulate the null distribution wedownload and process the data from ninepublic datasets GSE14924 CD4 GSE14924CD8 GSE17054 GSE12662 GSE57194GSE33223 GSE42140 GSE8023 andGSE15061 Using 140 control samples fromthe nine datasets we simulate 40000 datasetsas follows We randomly label 70 samples ascontrol samples and the remaining 70 samplesas disease samples We repeat this procedure10000 times to generate different groups of70 control and 70 disease samples To makethe simulation more general we also create10000 datasets consisting of 10 control and10 disease samples 10000 datasets consistingof 10 control and 20 disease samples and10000 datasets consisting of 20 controland 10 disease samples We then calculatethe p-values of the KEGG human signalingpathways using each of the three methods

The effect of combining control (iehealthy) samples from different experimentsis to uniformly distribute all sources of biasamong the random groups of samples If wecompare groups of control samples based onexperiments there could be true differencesdue to batch effects By pooling them togetherwe form a population which is considered thereference population This approach is simi-lar to selecting from a large group of peoplethat may contain different sub-groups (egdifferent ethnicities gender race life styleor living conditions) When we randomly se-lect samples (for the two random groups tobe compared) from the reference populationwe expect all bias (eg ethnic subgroups) tobe represented equally in both random groupsand therefore we should see no difference be-tween these random groups no matter howmany distinct ethnic subgroups were presentin the population at large Therefore the p-values of a test for difference between the tworandomly selected groups should be equallyprobable between zero and one

Figure 8254 displays the empirical nulldistributions of p-values using PADOG GSAand SPIA The horizontal axes represent p-values while the vertical axes represent p-value densities Green panels (A0-A6) showp-value distributions from PADOG while blue(B0-B6) and purple (C0-C6) panels show p-value distributions from GSA and SPIA re-spectively For each method the larger panel

(A0 B0 C0) shows the cumulative p-valuesfrom all KEGG signaling pathways The smallpanels six per method display extreme ex-amples of nonuniform p-value distributionsfor specific pathways For each method weshow three distributions severely biased to-wards zero (eg A1-A3) and three distri-butions severely biased towards one (egA4-A6)

These results show that contrary to gen-erally accepted beliefs the p-values are notuniformly distributed for the three methodsconsidered Therefore one should expect avery strong and systematic bias in identifyingsignificant pathways for each of these meth-ods Pathways that have p-values biased to-wards zero will often be falsely identified assignificant (false positives) Likewise path-ways that have p-values biased towards one arelikely to rarely meet the significance require-ments even when they are truly implicatedin the given phenotype (false negatives) Sys-tematic bias due to nonuniformity of p-valuedistributions results in failure of the statisti-cal methods to correctly identify the biologicalpathways implicated in the condition and alsoleads to inconsistent and incorrect results

Querying Data from Knowledge BasesIndependent research groups have tried

different strategies to model complex bio-molecular phenomena These independent ef-forts have led to variation among pathwaydatabases complicating the task of devel-oping pathway analysis methods Dependingon the database there may be differencesin information sources experiment interpre-tation models of molecular interactions orboundaries of the pathways Therefore it ispossible that pathways with the same desig-nation and aiming to describe the same phe-nomena may have different topologies in dif-ferent databases As an example one couldcompare the insulin signaling pathways ofKEGG and BioCarta BioCarta includes fewernodes and emphasizes the effect of insulin ontranscription while KEGG includes transcrip-tion regulation as well as apoptosis and otherbiological processes Differences in graphmodels for molecular interactions are partic-ularly apparent when comparing the signal-ing pathways in KEGG and NCI-PID WhileKEGG represents the interaction informationusing the directed edges themselves NCI-PIDintroduces ldquoprocess nodesrdquo to model interac-tions (see Fig 8252) Developers are facingthe challenge of modifying methods to accept

AnalyzingMolecularInteractions

82517

Current Protocols in Bioinformatics Supplement 61

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 3: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

Table 8251 Pathway Analysis Tools

Method Availabilitya HIPAAb Licensec Coded Year Reference

Signaling pathway analysis using gene expression

MetaCoree Web (httpswwwgenegocommetacorephp)

No ThomsonReuters

Java 2004 NA

Pathway-Express

Standalone (Bioconductor) Web(http vortexcswayneedu)superseded by the ROntoTools

No Freef Java R 2005 (Draghici et al2007 Khatriet al 2007)

PathOlogist Standalone (ftpftp1ncinihgovpubpathologist)

No Free MATLAB 2007 (EfroniSchaefer ampBuetow 2007GreenblumEfroniSchaefer ampBuetow 2011)

iPathwayGuidee

Web (httpswwwadvaitabiocomproductshtml)

Yes AdvaitaCorp

Java R 2009 NA

SPIA Standalone (Bioconductor) No GPL (2) R 2009 (Tarca et al2009)

NetGSA Standalone (httpswwwbiostatwashingtoneduashojaiesoftware

No GPL-2 R 2009 (Shojaie ampMichailidis2009 Shojaieamp Michailidis2010)

PWEA Standalone (httpszlabbueduPWEA)

No Freef C++ 2010 (Hung et al2010)

TopoGSA Web (httpswwwinfobioticsnettopogsa)

No Freef PHP R 2010 (Glaab BaudotKrasnogor ampValencia 2010)

TopologyGSA

Standalone (CRAN) No AGPL-3 R 2010 (MassaChiogna ampRomualdi2010)

DEGraph Standalone (Bioconductor) No GPL-3 R 2010 (JacobNeuvial ampDudoit 2010)

GGEA Standalone (Bioconductor) No Artistic-20

R 2011 (GeistlingerCsaba KuffnerMulder ampZimmer 2011)

BPA Standalone(httpsbumilbounedutrbpa)

No Freef MATLAB 2011 (Isci OzturkJones amp Otu2011)

GANPA Standalone (CRAN) No GPL-2 R 2011 (Fang Tian ampJi 2011)

ROntoToolse

Standalone (Bioconductor) No CC BY-NC-ND4

R 2012 (Voichita et al2012)

BAPA-IGGFD

No implementation available No NA R 2012 (Zhao et al2012)

continued

8253

Current Protocols in Bioinformatics Supplement 61

Table 8251 Pathway Analysis Tools continued

Method Availabilitya HIPAAb Licensec Coded Year Reference

CePa Standalone (CRAN) Web(httpsmcubenjueducncgi-bincepamainpl)

No GPL (2) R 2012 (Gu Liu CaoZhang amp Wang2012)

THINK-Back-DS

Standalone Web(httpseecsumichedudbthinksoftwarehtml)

No Freef Java 2012 (Farfan MaSartorMichailidis ampJagadish 2012)

TBScore No implementation available No NA NA 2012 (Ibrahim JassimCawthorne ampLanglands2012)

ACST Standalone (available as articlesupplemental)

No NA R 2012 (MieczkowskiSwiatek-Machado ampKaminska 2012)

EnrichNet Web (httpswwwenrichnetorg) No free PHP 2012 (Glaab BaudotKrasnogorSchneider ampValencia 2012)

clipper Standalone (httpsromualdibiounipditsoftware)

No AGPL-3 R 2013 (Martini SalesMassa Chiognaamp Romualdi2013)

DEAP Standalone (available as articlesupplemental)

No GNULesserGPL

Python 2013 (HaynesHigdonStanberryCollins ampKolker 2013)

DRAGEN Standalone (httpsbioinfoautsinghuaeducndragen)

No NA C++ 2014 (Ma Jiang ampJiang 2014)

ToPASeqg Standalone (Bioconductor) No AGPL-3 R 2016 (Ihnatova ampBudinska 2015)

pDis Standalone (Bioconductor) No Freef R 2016 (AnsariVoichitaDonato Tagettamp Draghici2017)

SPATIAL No implementation available No NA NA 2016 (BokanizadTagett AnsariHelmi ampDraghici 2016)

BLMAh Standalone (Bioconductor) No GPL (2) R 2017 (Nguyen TagettDonato Mitreaamp Draghici2016 also seeInternetResources)

continued

8254

Supplement 61 Current Protocols in Bioinformatics

Table 8251 Pathway Analysis Tools continued

Method Availabilitya HIPAAb Licensec Coded Year Reference

Signaling pathway analysis using multiple types of data

PARADIGM Standalone (httpssbenzgithubcomParadigm)

No UCSC-CGBfreef

C 2010 (Vaske et al2010)

microGraphite

Standalone(httpsromualdibiounipditsoftware)

No AGPL-3 R 2014 (Calura et al2014)

mirIntegrator Standalone (Bioconductor) No GPL3 R 2016 (Diaz et al2016 Nguyenet al 2016)

Metabolic pathway analysis

ScorePAGE No implementation available No NA NA 2004 (Rahnenfuhreret al 2004)

TAPPA No implementation available No NA NA 2007 (Gao amp Wang2007)

MetPA Web (httpsmetpametabolomicsca) No GPL (2) PHP R 2010 (Xia ampWishart 2010)

MetaboAnalyst

Web (httpsmetpametabolomicsca) No GPL (2) R 2011 (Xia amp Wishart2011 XiaSinelnikovHan ampWishart 2015)

aAvailability is a criterion that describes the implementation of the method as standalone or Web-basedbHIPAA provides information about HIPAA compliancecLicense provides information about the type of the software license GPL is an abbreviation for the GNU General Public License AGPL is anabbreviation for the GNU Affero General Public License CC BY-NC-ND 4 is an abbreviation of Creative Commons AttributionndashNonCommercial-NoDerivatives 40 International Public LicensedCode shows the programming language used for the method implementationeCommercial methodfFree for academic and non-commercial use UCSC-CGB is the University of California Santa Cruz Cancer Genome BrowsergToPASeq provides an R package that runs TopologyGSA DEGraph clipper SPIA TBScore PWEA TAPPAhBLMA provides an R package for bi-level meta-analysis that runs SPIA ORA GSA and PADOG using one or multiple expression datasets

by chance The drawbacks of this type of ap-proach include (i) it only considers the num-ber of DE genes and completely ignores themagnitude of the actual expression changesresulting in information loss (ii) it assumesthat genes are independent which they arenot (since the pathways are graphs that de-scribe precisely how these genes influenceeach other) and (iii) it ignores the interactionsbetween various genes and modules Func-tional Class Scoring (FCS) approaches suchas Gene Set Enrichment Analysis (GSEASubramanian et al 2005) and Gene Set Anal-ysis (GSA Efron amp Tibshirani 2007) havebeen developed to address some of the is-sues raised by ORA approaches The mainimprovement of FCS is the observation thatsmall but coordinated changes in expressionof functionally related genes can have signifi-cant impact on pathways

ORA and FCS approaches are often re-ferred as gene set enrichment methods Com-prehensive lists of gene set analysis ap-proaches as well as comparisons betweenthem can be found in well-developed sur-veys (Emmert-Streib amp Glazko 2011 KelderConklin Evelo amp Pico 2010 Khatri Sirotaamp Butte 2012 Misman et al 2009) Whileuseful for the purpose for which they havebeen developedmdashto analyze sets of genesmdashthese methods completely ignore the topol-ogy and interaction between genes Topology-based approaches which fully exploit allthe knowledge about how genes interact asdescribed by pathways have been devel-oped more recently The first such techniqueswere ScorePAGE (Rahnenfuhrer DominguesMaydt amp Lengauer 2004) for metabolicpathways and Impact Analysis (Draghiciet al 2007 Tarca et al 2009) for signaling

AnalyzingMolecularInteractions

8255

Current Protocols in Bioinformatics Supplement 61

pathways Figure 8251 shows an examplepathway named ERBB signaling pathway inwhich the nodes represent genes and com-pounds while the edges represent the knowninteraction between the compounds Gene setanalysis approaches are not able to accountfor the topological order of genes nor are ableto explain the signal propagation and mecha-nisms of the pathway These approaches wouldyield exactly the same significance value forthis pathway even if the graph were to becompletely redesigned by future discoveryIn contrast network-based pathway analysiscan distinguish between this pathway and anyother pathway with the same proportion of DEgenes

Here we provide a survey of 34network-based methods developed for path-way analysis Our survey of commercialtools for pathway analysis found iPath-wayGuide (Advaita Corporation httpswwwadvaitabiocom) and MetaCore (ThomsonReuters httpswwwthomsonreuterscom) tobe topology based Other commercial toolssuch as Ingenuity Pathway Analysis (httpswwwingenuitycom) or Genomatix (httpswwwgenomatixde) do not use pathway topol-ogy information and thus are not included inthis review The existence of only two com-mercial tools is more evidence of the challengefaced by the developers of such methods dueto lack of standards

In this document we categorize and com-pare the methods based on the followingcriteria the type of input the method ac-cepts graph model and the statistical ap-proaches used to evaluate gene and pathwaychanges Under the heading ldquoExperiment In-put and Pathway Databasesrdquo below we de-scribe the types of input the surveyed meth-ods use Under the heading ldquoGraph Mod-elsrdquo we provide details regarding the graphmodels used for the biological networks Un-der ldquoPathway Scoring Strategiesrdquo we discussthe statistical methods used by the surveyedmethods to assess gene and pathway changesbetween two phenotypes Under ldquoChallengesin Pathway Analysisrdquo we discuss a numberof outstanding challenges that needed to beaddressed in order to improve the reliabil-ity as well as the relevance of the next-generation pathway analysis approaches Wediscuss the key elements and limitations of thesurveyed methods without going into detailsof each technique For a more detailed reviewand technical descriptions of network-basedpathway analysis methods please refer to

Mitrea et al (2013) or referenced manuscripts(Table 8251)

NETWORK-BASED PATHWAYANALYSIS

The term pathway analysis is used in avery broad context in the literature includ-ing biological network construction and in-ference Here we focus on methods that areable to exploit biological knowledge in pub-lic repositories rather than on network in-ference approaches that attempt to infer orreconstruct pathways from molecular mea-surements Table 8251 shows the list of 34pathway analysis approaches together withtheir availability and licensing In this sec-tion we start by describing the availability ofeach method before discussing the input for-mat graph model and underlying statisticalapproaches

Software Availability andImplementation

We often think that the main strength ofan approach lies in its novelty and algorithmefficiency However the implementation andavailability of a tool have become increas-ingly important for several reasons First soft-ware availability and version control are cru-cial for reproducing the experimental resultsthat were used to assess the performance ofthe approach (Sandve Nekrutenko Taylor ampHovig 2013) For this reason many journalsrequest authors to make their software and dataavailable before accepting methods articlesSecond if a software application is not ready-to-run it is very unlikely that the intended au-dience (mostly life scientists) will invest thetime to understand and implement complexalgorithms Practicality user-friendliness out-put format and type of interface are all to beconsidered Depending on the desired avail-ability and intended audience a software pack-age may be implemented as standalone or Webbased Among the 34 approaches there are 30that are available either as a standalone soft-ware package or Web service

Typically standalone tools need to be in-stalled on local machines or servers which of-ten requires some administrative skills Moststandalone tools depend on full or partialcopies of public pathway databases storedlocally and need to be updated periodicallyAdvantages of standalone tools include (i)instant availability that does not require Inter-net access and (ii) the security and privacy ofthe experimental data Web-based tools on the

Network-BasedApproaches forPathway Level

Analysis

8256

Supplement 61 Current Protocols in Bioinformatics

other hand run the analyses on a remote serverproviding computational power and a graph-ical interface The major advantage of Web-based tools is that they are user-friendly and donot require a separate local installation Fromthe accessibility perspective Web-based toolshave the advantage of being available from anylocation as long as there is an Internet connec-tion and a browser available Also the updateis almost transparent to the client This makesthe userrsquos task easy and enables collaborationsince users all over the world can utilize thesame method without the burden of installingit or keeping it up-to-date There are methodsthat provide both Web-based and standaloneimplementations

HIPAA compliance may also be a factor incertain applications that involve data comingfrom patients or data linked to other clinicaldata or clinical records Currently the iPath-wayGuide is the only tool available that cando topological pathway analysis and is HIPAAcompliant

The programming language and style usedfor implementation also play an important rolein the acceptance of a method Software toolsthat are neatly implemented and packaged aremore appealing compared to those that do nothave ready-to-use implementations Many ofthe methods are implemented in the R pro-gramming language and are available as soft-ware packages from Bioconductor CRAN orthe authorrsquos Web site Their popularity amongbiologists and bioinformaticians is due to thefact that many bioinformatics-dedicated pack-ages are available in R In addition the rigor-ous review procedure provided by Bioconduc-tor and CRAN makes the software packagesmore standardized and reliable

Experiment Input and PathwayDatabases

A pathway analysis approach typically re-quires two types of input experiment data andknown biological networks Experiment datais usually collected from high-throughput ex-periments that compare a condition phenotypeto a control phenotype A condition pheno-type can be a disease a drug treatment orthe knock-out (KO) of a gene while a con-trol phenotype can be a healthy state a dif-ferent disease or drug treatment or wild-type(non-KO) samples Experimental data can beobtained from multiple technologies that pro-duce different types of data gene expressionprotein abundance metabolite concentrationmiRNA expression etc Biological networks

or pathways are often represented in the formof graphs that capture our current knowl-edge about the interactions of genes proteinsmetabolites or compounds in an organismThe pathway data is accumulated updatedand refined by amassing knowledge from sci-entific literature describing individual interac-tions or high-throughput experiment results

Table 8252 shows a summary of in-put format and mathematical modeling ofthe surveyed approaches Most pathwayanalysis methods analyze data from high-throughput experiments such as microarraysnext-generation sequencing or proteomicsThey accept either a list of gene IDs or a listof such gene IDs associated with measuredchanges These changes could be measuredwith different technologies and therefore canserve as proxies for different biochemical en-tities For instance one could use gene ex-pression changes measured with microarraysor protein levels measured with a proteomicapproach etc

Different analysis methods use different in-put formats Many methods accept a list of allgenes considered in the experiment togetherwith their expression values Some analysismethods select a subset of genes consideredto be differentially expressed (DE) based on apredefined cut-off The cut-off is typically ap-plied on fold-change p-value or both Thesemethods use the list of DE genes and their cor-responding statistics (fold-change p-value) asinput Other methods use only the list of DEgenes without corresponding expression val-ues because their scoring methods are basedonly on the relative positions of the genes inthe graph

Methods which use cut-offs are sensitive tothe chosen threshold value because a smallchange in the cut-off may drastically changethe number of selected genes (Nam amp Kim2008) In addition they typically use the mostsignificant genes and discard the rest whoseweaker but coordinated changes may also havesignificant impact on pathways Genes withmoderate differential expression may be losteven though they might be important play-ers in the impacted pathways (Ben-ShaulBergman amp Soreq 2005) Furthermore thegenes included in the set of DE genes canvary dramatically if the selection methods arechanged Hence the results of pathway anal-yses based on DE genes may be vastly differ-ent depending on both the selection method aswell as the threshold value (Pan Lih amp Co-hen 2005) Furthermore for the same disease

AnalyzingMolecularInteractions

8257

Current Protocols in Bioinformatics Supplement 61

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

Signaling pathway analysis using gene expression

MetaCore DE genes Proprietary canonicalpathway genome-scalenetwork

Single-type directed Hierarchicallyaggregated

Pathway-Express DE genes change ormeasured geneschanged

KEGG signaling Single-type directed Hierarchicallyaggregated

PathOlogist Measured genesexpression

KEGG Multi-type directed Hierarchicallyaggregated

iPathwayGuide DE genes change ormeasured geneschange

KEGG signalingReactome NCIBioCarta

Single-type directed Hierarchicallyaggregated

SPIA DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

NetGSA Measured genesexpression

KEGG signaling Single-type directed Multivariate analysis

PWEA Measured genesexpression

YeastNet Single-typeundirected

Hierarchicallyaggregated

TopoGSA DE genes PPI network KEGG Single-typeundirected

Hierarchicallyaggregated

TopologyGSA Measured genesexpression

NCI-PID Single-typeundirected

Multivariate analysis

DEGraph Measured genesexpression

KEGG Single-typeundirected

Multivariate analysis

GGEA Measured genesexpression

KEGG Single-type directed Aggregate fuzzysimilarity

BPA Measured genesexpressionmdashwithcut-off

NCI-PID Single-type DAG Bayesian network

GANPA DE genes change ormeasured genesexpression

PPI network KEGGReactome NCI-PIDHumanCyc

Single-typeundirected

Hierarchicallyaggregated

ROntoTools DE genes change ormeasured geneexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

BAPA-IGGFD Measured genesexpression - withcut-off

Literature-basedinteraction databaseKEGG WikiPathwaysReactome MSigDBGO BP PANTHERconstructed geneassociation networkfrom PPIsco-annotation in GOBiological Process (BP)and co-expression inmicroarray data

Single-type DAG Bayesian network

continued

8258

Supplement 61 Current Protocols in Bioinformatics

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

CePa DE genes expressionor measured genesexpression

NCI-PID Single-type directed Hierarchicallyaggregated

THINK-Back-DS DE genes changemeasured genesexpression

KEGG PANTHERBioCarta ReactomeGenMAPP

Single-type directed Hierarchicallyaggregated

TBScore DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

ACST Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

EnrichNet DE genes list PPI network KEGGBioCartaWikiPathwaysReactome NCI-PIDInterPro GO withSTRING 90

Single-typeundirected

Hierarchicallyaggregated

clipper Measured genesexpression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

DEAP Measured genesexpression

KEGG Reactome Single-type directed Hierarchicallyaggregated

DRAGEN Measured genesexpression

RegulonDB M3DHTRIdb ENCODEMSigDB

Single-type directed Linear regression

ToPASeqe Measured genesexpression

KEGG Single-type directed Hierarchical ampmultivariate

pDis DE genes change ormeasured geneschanged

KEGG signaling Single-type directed Hierarchicallyaggregated

SPATIAL DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

BLMAf Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

Signaling pathway analysis using multiple types of data

PARADIGM Measured genesexpression copynumber and proteinslevels

Constructed PPInetworks from MIPSDIP BIND HPRDIntAct and BioGRID

Multi-type directed Hierarchicallyaggregated

microGraphite Measured geneexpression andmiRNA expression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

mirIntegrator DE genes changeand DE miRNAchange

KEGG signalingmiRTarBase

Single-type directed Hierarchicallyaggregated

Metabolic pathway analysis

ScorePAGE Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

continued

8259

Current Protocols in Bioinformatics Supplement 61

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

TAPPA Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

MetPA DE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

MetaboAnalyst DE genes change andDE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

aExperiment input shows the experiment data input format for each method ldquoDErdquo means differentially expressed ldquoChangerdquo means fold-change valueor t-statistics when comparing genemetabolites values between two phenotypes ldquoMeasuredrdquo means the list of all the genesmetabolites measured inthe experiment ldquoListrdquo represents a list of genesmetabolites identifiers (eg symbols) ldquoWith cut-offrdquo show methods that take as input the list of allmeasured genes and in the analysis they mark the DE genesbPathway database shows the name of the knowledge source for biological interactionscGraph model is a characteristic that shows if the graph has one or multiple types of nodes as well as if directed or undirecteddThe package ROntoTools (Bioconductor) allows for analysis using expression values of all geneseToPASeq provides an R package that runs TopologyGSA DEGraph clipper SPIA TBScore PWEA TAPPAfBLMA provides an R package for bi-level meta-analysis that runs SPIA ORA GSA and PADOG using one or multiple expression datasets

independent studies or measurements oftenproduce different sets of differentially ex-pressed (DE) genes (Ein-Dor Kela GetzGivol amp Domany 2005 Ein-Dor Zuk amp Do-many 2006 Tan et al 2003) This makes ap-proaches that use DE genes as input appeareven more unreliable

Usually pathways are sets of genes andorgene products that interact with each other ina coordinated way to accomplish a given bi-ological function A typical signaling path-way (in KEGG for instance) uses nodes torepresent genes or gene products and edgesto represent signals such as activation or re-pression that go from one gene to anotherA typical metabolic pathway uses nodes torepresent biochemical compounds and edgesto represent reactions that transform one ormore compound(s) into one or more othercompounds These reactions are usually car-ried out or controlled by enzymes which arein turn coded by genes Hence in a metabolicpathway genes or gene products are associatedwith edges rather than nodes as in a signal-ing pathway The immediate consequence ofthis difference is that many techniques cannotbe applied directly on all available pathwaysThis is why the analysis of metabolic pathwaysis still generally and arguably underdeveloped(only 128 Google Scholar citations to date forthe original ScorePage paper Rahnenfuhreret al 2004) and there are not many othermethods available for metabolic pathway anal-ysis However the topology-based analysis ofsignaling pathways has been very successful

(over 1200 citations to date for the two impactanalysis papers mentioned above Draghiciet al 2007 Tarca et al 2009) and over 30other methods have been developed since

There are other types of biological net-works that incorporate genome-wide inter-actions between genes or proteins such asprotein-protein interaction (PPI) networksThese networks are not restricted to specificbiological functions The main caveat relatedto PPI data is that most such data are obtainedfrom a bait-prey laboratory assays rather thanfrom in vivo or in vitro studies The fact thattwo proteins stick to each other in an assayperformed in an artificial environment can bemisleading since the two proteins may neverbe present at the same time in the same tissueor the same part of the cell

Publicly available curated pathwaydatabases used by methods listed in Table8252 are KEGG (Ogata et al 1999)NCI-PID (Schaefer et al 2009) BioCarta(httpswwwbiocartacom) WikiPathways(Pico et al 2008) PANTHER (Mi et al2005) and Reactome (Joshi-Tope et al2005) These knowledge bases are built bymanually curating experiments performed indifferent cell types under different conditionsThese curated knowledgebases are morereliable than protein interaction networksbut do not include all known genes andtheir interactions As an example despitebeing continuously updated KEGG includesonly about 5000 human genes in signalingpathways while the number of protein-coding

Network-BasedApproaches forPathway Level

Analysis

82510

Supplement 61 Current Protocols in Bioinformatics

Figure 8252 Five biological network graph models used by public databases are displayedThe KEGG database contains both signaling and metabolic networks Signaling networks havegenesgene products as nodes and regulatory signals as edges Types of regulatory signalsinclude activation inhibition phosphorylation and many others Metabolic networks have bio-chemical compounds as nodes and chemical reactions as edges Enzymes (specialized pro-teinsgene products) catalyze biochemical reactions therefore genes are linked to the edges inthese networks The Reactome database is a collection of biochemical reactions that are groupedin functionally related sets to form a pathway There are two types of nodes biochemical com-pounds and reactions Reaction nodes link biochemical compounds as reactants and productsNCI-PID signaling networks also have two types of nodes component nodes and process nodesComponent nodes are usually biomolecular components Process nodes are usually biochemicalreactions or biological processes In these networks process nodes link two or more componentnodes through directed edges Process nodes are assigned one of the following states positiveor negative regulation or involved in Protein-protein interaction (PPI) networks have proteinsas nodes and physical binding as edges or interactions Two-hybrid assays are typically used todetermine protein-protein interactions PPIs can be directed in the bait-prey orientation when thebait-prey relation is considered (top) or undirected when is not (bottom) The Biological PathwayExchange (BioPAX) format has physical entities as nodes and conversions as edges The ad-vantage of this representation is that it is generic and provides a lot of flexibility to accommodatevarious types of interactions In addition it provides a machine-readable standard that can beused for all databases to provide their data in a unified format Network nodes can be genesgene products complexes or non-coding RNA Network edges can be assembly of a complexdisassembly of a complex or biochemical reactions among others

genes is estimated to be between 19000 and20000 (Ezkurdia et al 2014)

The implementation of analysis methodsconstrains the software to accept a specific in-put pathway data format while the underlyinggraph models in the methods are independentof the input format Regardless of the pathwayformat this information must be parsed into acomputer-readable graph data structure beforebeing processed The implementation may in-corporate a parser or this may be up to the userFor instance SPIA accepts any signaling path-way or network if it can be transformed into anadjacency matrix representing a directed graphwhere all nodes are components and all edgesare interactions NetGSA is similarly flexiblewith regard to signaling and metabolic path-ways SPIA provides KEGG signaling path-ways as a set of pre-parsed adjacency matri-ces The methods described in this chapter maybe restricted to only one pathway database ormay accept several

To the best of our knowledge there havebeen no comprehensive approaches to com-

pare the pros and cons of existing knowledgebases It is completely up to researchers tochoose the tools that are able to work withthe database they trust Intuitively methodsthat are able to exploit the complementary in-formation available from different databaseshave an edge over methods that work with onesingle database

Graph ModelsPathway analysis approaches use two major

graph models to represent biological networksobtained from knowledge bases (i) single-type and (ii) multiple-type The first modelallows only one type of node ie a gene orprotein with edges representing molecular in-teractions between the nodes (KEGG signal-ing in Fig 8252) Models that contain di-rected graphs are more suitable for analysesthat include signal propagation of gene pertur-bation The second graph model allows mul-tiple type of nodes such as components andinteractions (NCI-PID in Fig 8252) Multi-type graph models are more complex than

AnalyzingMolecularInteractions

82511

Current Protocols in Bioinformatics Supplement 61

single-type but they are expected to capturemore pathway characteristics For examplesingle-type models are limited when tryingto describe ldquoallrdquo and ldquoanyrdquo relations betweenmultiple components that are involved in thesame interaction Bipartite graphs which con-tain two types of nodes and allow connectiononly between nodes of different types are aparticular case of multi-type graph models

The majority of analysis methods surveyedhere use a single-type graph model Some ap-ply the analysis on a directed or un-directedsingle-type network built using the input path-way while others transform the pathways intographs with specific characteristics An ex-ample of the later is TopologyGSA whichtransforms the directed input pathway intoan undirected decomposable graph which hasthe advantage of being easily broken downinto separate modules (Lauritzen 1996) Inthis method decomposable graphs are usedto find ldquoimportantrdquo submodulesmdashthose thatdrive the changes across the whole pathwayFor each pathway TopologyGSA creates anundirected moral graph from the underlyingdirected acyclic graph (DAG) by connectingthe parents of each child and removing theedge direction [The moral graph of a DAGis the undirected graph created by adding an(undirected) edge between all parents of thesame node (sometimes called marrying) andthen replacing all directed edges by undirectededges The name stems from the fact that ina moral graph two nodes that have a com-mon child are required to be married by shar-ing an edge] The moral graph is then usedto test the hypothesis that the underlying net-work is changed significantly between the twophenotypes If the the research hypothesis isrejected a decomposabletriangulated graph isgenerated from the moral graph by adding newedges This graph is broken into the maximalpossible submodules and the hypothesis is re-tested on each of them

PathOlogist and PARADIGM are thetwo surveyed methods that use multi-typegraph models PathOlogist uses a bipartitegraph model with component and interac-tion nodes PARADIGM conceptually moti-vated by the central dogma of molecular bi-ology takes a pathway graph as input andconverts it into a more detailed graph whereeach component node is replaced by sev-eral more specific nodes biological entitynodes interaction nodes and nodes contain-ing observed experiment data The observedexperiment nodes could in principle containgene-expression and copy-number informa-

tion Biological entity nodes are DNA mRNAprotein and active protein The interactionnodes are transcription translation or pro-tein activation among others Biological en-tity and interaction node values are derivedfrom these data and specify the probability ofthe node being active These are the hiddenstates of the model From the mathematicalmodel perspective models that allow multipletypes of nodesmdashcomponent nodes and inter-action nodesmdashare more flexible and are able tomodel both AND and OR gates which are verycommon when describing cellular processes

Pathway Scoring StrategiesThe goal of the scoring method is to com-

pute a score for each pathway based on the ex-pression change and the graph model resultingin a ranked list of pathways or sub-pathwaysThere are a variety of approaches to quantifythe changes in a pathway Some of the analysismethods use a hierarchically aggregated scor-ing algorithm where on the first level a scoreis calculated and assigned to each node or pairof nodes (component andor interaction) Onthe second level these scores are aggregated tocompute the score of the pathway On the lastlevel the statistical significance of the path-way score is assessed using univariate hypoth-esis testing Another approach assigns a ran-dom variable to each node and a multivariateprobability distribution is calculated for eachpathway The output score can be calculated intwo ways One way is to use multivariate hy-pothesis testing to assess the statistical signif-icance of changes in the pathway distributionbetween the two phenotypes The other way isto estimate the distribution parameters basedon the Bayesian network model and use thisdistribution to compute a probabilistic score tomeasure the changes In this section we pro-vide details regarding the scoring algorithmsof the surveyed methods See Figure 8253 forscoring algorithms categories

Hierarchically aggregated scoringThe workflow of the hierarchically aggre-

gated scoring strategy has three levels nodestatistic computation pathway statistic com-putation and the evaluation of the significancefor the pathway statistic

Node level scoringMost of the surveyed methods incorpo-

rate pathway topology information in the nodescores The node level scoring can be di-vided into four categories (i) graph mea-sures (centrality) (ii) similarity measures

Network-BasedApproaches forPathway Level

Analysis

82512

Supplement 61 Current Protocols in Bioinformatics

Figure 8253 A summary of the statistical models is presented for the surveyed methods Mostmethods use the hierarchically aggregated scoring strategy in which the score is computed atthe node level before being aggregated at the pathway level for significance assessment In theleft panel the rows show a node-level statistical model while the columns show the aggregatingstrategies (linear non-linear or weighted gene set) Approaches in multivariate analysis andBayesian network categories use multivariate modeling and Bayesian network respectively tofind pathways that are most likely to be impacted Both BLMA and ToPASeq provide R packagesthat run multiple algorithms DRAGEN follows a linear regression strategy while GGEA appliesfuzzy modeling to rank the pathways

(iii) probabilistic graphical models and (iv)normalized node value (NNV) Approaches inthe first category use centrality measures or avariation of these measures to score nodes in agiven pathway Centrality measures representthe importance of a node relative to all othernodes in a network There are several central-ity measures that can be applied to networksof genes and their interactions degree central-ity closeness between-ness and eigenvectorcentrality Degree centrality accounts for thenumber of directed edges that enter and leaveeach node Closeness sums the shortest dis-tance from each node to all other nodes inthe network Node between-ness measures theimportance of a node according to the numberof shortest paths that pass through it Eigen-vector centrality uses the network adjacencymatrix of a graph to determine a dominanteigenvector each element of this vector is ascore for the corresponding node Thus eachscore is influenced by the scores of neighbor-ing nodes In the case of directed graphs anode that has many downstream genes hasmore influence and receives a higher score

Methods in this category include MetaCoreSPIA iPathwayGuide MetPA SPATIAL To-poGSA DEAP pDis mirIntegrator Pathway-Express and BLMA

Approaches in the second category usesimilarity measures in their node level scor-ing Similarity measures estimate the co-expression behavioral similarity or co-regulation of pairs of components Theirvalues can be correlation coefficients covari-ances or dot products of the gene expressionprofile across time or samples In these meth-ods the pathways with clusters of highly cor-related genes are considered more significantAt the node level a score is assigned to eachpair of nodes in the network which is the ra-tio of one similarity measure over the shortestpath distance between these nodes Thus thetopology information is captured in the nodescore by incorporating the shortest path dis-tance of the pair Methods in this category in-clude ScorePAGE and PWEA

Approaches in the third category incorpo-rate the topology in the node level scoringusing a probabilistic graphical model In this

AnalyzingMolecularInteractions

82513

Current Protocols in Bioinformatics Supplement 61

model nodes are random variables and edgesdefine the conditional dependency of the nodesthey link For example PARADIGM takes ob-served experimental data and calculates scoresfor all component nodes in both observed andhidden states from the detailed network cre-ated by the method based on the input path-way For each node score a positive or negativevalue denotes how likely it is for the node tobe active or inactive respectively The scoresare calculated to maximize the probability ofthe observed values A p-value is associatedwith each score of each sample such that eachnode can be tagged as significantly activesignificantly inactive or not significant Foreach network a matrix of p-values is outputin which columns are samples and rows arecomponent nodes Methods in this categoryinclude PARADIGM and PathOlogist

Approaches in the fourth category simplycompute the score for each node using the in-formation obtained from the experiment inputFor example TAPPA calculates the score ofeach node as the square root of the normalizedlog gene expressions (node value) while ACSTand TBScore calculate the node level score us-ing a sign statistic and log fold-change

Pathway level scoringThere are three different ways to compute

the pathway score (i) linear (ii) non-linearand iii) weighted gene set Most methods ag-gregate node level statistics to pathway levelstatistics using linear functions such as av-eraging or summation For example iPath-wayGuide computes the scores of pathway asthe sum of all genes while TBScore weightsthe pathway DE genes based on their log foldchange and the number of distinct DE genesdirectly downstream of them using a depth-first search algorithm

Approaches in the second group (non-linear) use a non-linear function to computethe pathway scores For example TAPPA com-putes the pathway score for each sample as aweighted sum of the product of all node pairscores in the pathway The weight coefficientis 0 when there is no edge between a pair Forany connected node pair the weight is a signfunction which represents joint up- or down-regulation of the pair Another example is En-richNet in which pathway scores measure thedifference of the node score distribution for apathway and a background networkgene setwhich consists of all pathways At the nodelevel the distance of all DE genes to the path-way is measured and summarized as a distancedistribution The method assumes that the most

relevant pathway is the one with the greatestdifference between the pathway node scoredistribution and the background score distri-bution The difference between the two distri-butions is measured by the weighted averagingof the difference between the two discretizedand normalized distributions The averagingmethod down-weights the higher distances andemphasizes the lower distance nodes

Methods in the third group (weighted geneset) design scoring techniques that incorpo-rate existing gene set analysis methods suchas GSEA (Subramanian et al 2005) GSA(Efron amp Tibshirani 2007) or LRPath (SartorLeikauf amp Medvedovic 2009) Pathway-levelscores can be calculated using node scoresthat represent the topology characteristic of thepathway as weight adjustments to a gene setanalysis method PWEA GANPA THINK-Back-DS and CePa use this approach and werefer to them as weighted gene set analysismethods

Pathway significance assessmentPathway scores are intended to provide in-

formation regarding the amount of changeincurred in the pathway between two phe-notypes However the amount of change isnot meaningful by itself since any amount ofchange can take place just by chance An as-sessment of the significance of the measuredchanges is thus required

TopoGSA MetPA and EnrichNet outputscores without any significance assessmentleaving it up to the user to interpret the re-sults This is problematic because users do nothave any instrument to distinguish betweenchanges due to noise or random causes andmeaningful changes that are unlikely to occurjust by chance and that therefore are possi-bly related to the phenotype The rest of theanalysis methods perform a hypothesis test foreach pathway The null hypothesis is that thevalue of the observed statistic is due to randomnoise or chance alone The research hypothe-sis is that the observed values are substantialenough that they are potentially related to thephenotype A p-value for calculated score isthen computed and a user-defined thresholdon the p-value is used to decide whether thethe null hypothesis can be rejected or not foreach pathway Finally a correction for multi-ple comparisons should be performed

Typically pathway analysis methods com-pute one score per pathway The distribution ofthis score under the null hypothesis can be con-structed and compared to the observed scoreHowever there are often too few samples to

Network-BasedApproaches forPathway Level

Analysis

82514

Supplement 61 Current Protocols in Bioinformatics

calculate this distribution so it is assumedthat the distribution is known For example inMetaCore and many other techniques whenthe pathway score is the number of DE nodesthat fall on the pathway the distribution isassumed to be hypergeometric However thehypergeometric distribution assumes that thevariables (genes in this case) are independentwhich is incorrect as witnessed by the factthat the pathway graph structure itself is de-signed to reflect the specific ways in which thegenes influence each other Another approachto identify the distribution is to use statisticaltechniques such as the bootstrap (Efron 1979)Bootstrapping can be done either at the sam-ple level by permuting the sample labels orat gene-set level by permuting the the valuesassigned to the genes in the set

Multivariate analysis and Bayesiannetwork

Multivariate analysis methods mostly usemultivariate probability distributions to com-pute pathway-level statistics and these can begrouped into two subcategories Methods inthe first category use multivariate hypothesistesting while methods in the second categoryare based on Bayesian network

NetGSA TopologyGSA DEGraph clip-per and microGraphite are methods based onmultivariate hypothesis testing These anal-ysis methods assume the vectors of geneexpression values in each (sub)pathway arerandom vectors with multivariate normal dis-tributions The network topology informationis stored in the covariance matrix of the cor-responding distribution For a network if thetwo distributions of the gene expression vec-tors corresponding to the two phenotypes aresignificantly different the network is assumedto be significantly impacted when comparingthe two phenotypes The significance assess-ment is done by a multivariate hypothesis testThe definition of the null hypothesis for thestatistical tests and the techniques to calcu-late the parameters of the distributions are themain differences between these three analysismethods

BPA and BAPA-IGGFD are two methodsbased on Bayesian networks In a Bayesiannetwork which is a special case of proba-bilistic graphical models a random variableis assigned to each node of a directed acyclic(DAG) graph The edges in the graph representthe conditional probabilities between nodesso that the children are independent from eachother and the rest of the graph when condi-tioned on the parents In BPA the value of

the Bayesian random variable assigned to eachnode captures the state of a gene (DE or not) Incontrast in BAPA-IGGFD each random vari-able assigned to an edge is the probability thatup or down regulation of the genes at both endsof an interaction are concordant with the typeof interaction which can be activation or inhi-bition In both BPA and BAPA-IGGFD eachrandom variable is assumed to follow a bino-mial distribution whose probability of successfollows a beta distribution However these twomethods use different approaches in represent-ing the multivariate distribution of the corre-sponding random vector BPA assumes thatthe random vector has a multinomial distribu-tion which is the generalization of the bino-mial distribution In this case the vector of thesuccess probability follows the Dirichlet dis-tribution which is the multivariate extensionof the beta distribution In contrast BAPA-IGGFD assumes the random variables areindependent therefore the multivariate distri-butions are calculated by multiplying the dis-tributions of the random variables in the vec-tor It is worth mentioning that the assumptionof independence in BAPA-IGGFD is contra-dicted by evidence specifically in the case ofedges that share nodes

Other approachesThe four methods DRAGEN GGEA

ToPASeq and BLMA follow strategies thatare very different from those of the other 30methods As such BLMA implements a bi-level meta-analysis approach that can be ap-plied in conjunction with any of the four sta-tistical approaches SPIA (Tarca et al 2009)GSA (Efron amp Tibshirani 2007) ORA (Tava-zoie et al 1999) or PADOG (Tarca DraghiciBhatti amp Romero 2012) The package al-lows users to perform pathway analysis withone dataset or with multiple datasets Simi-larly ToPASeq provides an R package thatruns TopologyGSA DEGraph clipper SPIATBScore PWEA and TAPPA This methodmodels the input (biological networks and ex-periment data) in such a way that it can be usedfor any of the seven different analyses

DRAGEN is an analysis method that scoresthe interactions rather then the genes and usesa regression model to detect differential regu-lation DRAGEN fits each edge of a pathwayinto two linear models (for case and control)and then computes a p-value that representsthe difference between the two models Foreach pathway a summary statistic is computedby combining the p-values of the edges us-ing a weighted Fisherrsquos method (Fisher 1925)

AnalyzingMolecularInteractions

82515

Current Protocols in Bioinformatics Supplement 61

GGEA on the other hand uses Petri Net (Mu-rata 1989) to model the pathway The sum-mary statistic named consistency is computedfrom the fuzzy similarity between the observedgene expression and Petri net with fuzzy logic(PNFL Kuffner Petri Windhager amp Zimmer2010) For both DRAGEN and GGEA the p-value of the pathway is calculated by com-paring the observed summary statistic againstthe null distribution that is constructed bypermutation

CHALLENGES IN PATHWAYANALYSIS

Pathway analysis has become the firstchoice for gaining insights into the underly-ing biology of a phenotype due its explanatorypower However there are outstanding annota-tion and methodological limitations that havenot been addressed (Kelder Conklin Eveloamp Pico 2010 Khatri et al 2012) There arethree main limitations of current knowledgebases First existing knowledge bases are un-able to keep up with the information availablein data obtained from recent technologies Forexample RNA-Seq data allows us to iden-tify transcripts that are active under certainconditions Alternatively spliced transcriptseven if they originate from the same genemay have distinct or even opposite functions(Wang et al 2008) However most knowl-edge bases provide pathway annotation onlyat the gene level Second there is a lack ofcondition and cell-specific information ieinformation about cell type conditions andtime points Finally current pathway annota-tions are neither complete nor perfectly accu-rate (Khatri et al 2012 Rhee Wood Dolinskiamp Draghici 2008) For example the number ofgenes in KEGG have remained around 5000despite being updated continuously in the past10 years while there are approximately 19000human genes annotated with at least one GOterm The number of protein-coding genes isestimated to be between 19000 and 20000(Ezkurdia et al 2014) most of which areincluded in DNA microarray assays such asAffymetrix HG U133 plus 20

Another challenge is the oversimplificationthat characterizes many of the models pro-vided by pathway databases In principle eachtype of tissue might have different mecha-nisms so generic organism-level pathwayspresent a somewhat simplistic description ofthe phenomena Furthermore signaling andmetabolic processes can also be different fromone condition to another or even from one

patient to another Understanding the specificpathways that are impacted in a given pheno-type or sub-group of patients should be anothergoal for the next generation of pathway anal-ysis tools See Khatri et al (2012) for a moredetailed discussion of annotation limitationsof existing knowledge bases

Here we focus on challenges of pathwayanalysis from computational perspectives Wedemonstrate that there is a systematic bias inpathway analysis (Nguyen Mitrea Tagett ampDraghici 2017) This leads to the unreliabilityof most if not all pathway analysis approachesWe also discuss the lack of benchmark datasetsor pipelines to assess the performance of ex-isting approaches

Systematic Bias of Pathway AnalysisMethods

Pathway analysis approaches often rely onhypothesis testing to identify the pathways thatare impacted under the effects of different dis-eases Null distributions are used to modelpopulations so that statistical tests can deter-mine whether an observation is unlikely to oc-cur by chance In principle the p-values pro-duced by a sound statistical test must be uni-formly distributed under the null hypothesis(Barton Crozier Lillycrop Godfrey amp Inskip2013 Bland 2013 Fodor Tickle amp Richard-son 2007 Storey amp Tibshirani 2003) For ex-ample the p-values that result from comparingtwo groups using a t-test should be distributeduniformly if the data are normally distributed(Bland 2013) When the assumptions of sta-tistical models do not hold the resulting p-values are not uniformly distributed under thenull hypothesis This makes classical methodssuch as t-test inaccurate since gene expressionvalues do not necessary follow their assump-tions Here we also show that the problem isextended to pathway analysis as the pathwayp-values obtained from statistical approachesare not uniformly distributed under the nullhypothesis This might lead to severe bias to-wards well-studied diseases such as cancerand thus make the results unreliable (Nguyenet al 2017)

Consider three pathway analysis methodsthat represent three different classes ofmethods for pathway analysis Gene SetAnalysis (GSA Efron amp Tibshirani 2007)is a Functional Class Scoring method (Efronamp Tibshirani 2007 Mootha et al 2003Subramanian et al 2005 Tarca et al2012) Down-weighting of OverlappingGenes (PADOG Tarca et al 2012) is anenrichment method (Beiẞbarth amp Speed

Network-BasedApproaches forPathway Level

Analysis

82516

Supplement 61 Current Protocols in Bioinformatics

2004 Draghici et al 2003 Khatri DraghiciOstermeier amp Krawetz 2002) and Sig-naling Pathway Impact Analysis (SPIATarca et al 2009) is a topology-awaremethod (Draghici et al 2007 Tarca et al2009) To simulate the null distribution wedownload and process the data from ninepublic datasets GSE14924 CD4 GSE14924CD8 GSE17054 GSE12662 GSE57194GSE33223 GSE42140 GSE8023 andGSE15061 Using 140 control samples fromthe nine datasets we simulate 40000 datasetsas follows We randomly label 70 samples ascontrol samples and the remaining 70 samplesas disease samples We repeat this procedure10000 times to generate different groups of70 control and 70 disease samples To makethe simulation more general we also create10000 datasets consisting of 10 control and10 disease samples 10000 datasets consistingof 10 control and 20 disease samples and10000 datasets consisting of 20 controland 10 disease samples We then calculatethe p-values of the KEGG human signalingpathways using each of the three methods

The effect of combining control (iehealthy) samples from different experimentsis to uniformly distribute all sources of biasamong the random groups of samples If wecompare groups of control samples based onexperiments there could be true differencesdue to batch effects By pooling them togetherwe form a population which is considered thereference population This approach is simi-lar to selecting from a large group of peoplethat may contain different sub-groups (egdifferent ethnicities gender race life styleor living conditions) When we randomly se-lect samples (for the two random groups tobe compared) from the reference populationwe expect all bias (eg ethnic subgroups) tobe represented equally in both random groupsand therefore we should see no difference be-tween these random groups no matter howmany distinct ethnic subgroups were presentin the population at large Therefore the p-values of a test for difference between the tworandomly selected groups should be equallyprobable between zero and one

Figure 8254 displays the empirical nulldistributions of p-values using PADOG GSAand SPIA The horizontal axes represent p-values while the vertical axes represent p-value densities Green panels (A0-A6) showp-value distributions from PADOG while blue(B0-B6) and purple (C0-C6) panels show p-value distributions from GSA and SPIA re-spectively For each method the larger panel

(A0 B0 C0) shows the cumulative p-valuesfrom all KEGG signaling pathways The smallpanels six per method display extreme ex-amples of nonuniform p-value distributionsfor specific pathways For each method weshow three distributions severely biased to-wards zero (eg A1-A3) and three distri-butions severely biased towards one (egA4-A6)

These results show that contrary to gen-erally accepted beliefs the p-values are notuniformly distributed for the three methodsconsidered Therefore one should expect avery strong and systematic bias in identifyingsignificant pathways for each of these meth-ods Pathways that have p-values biased to-wards zero will often be falsely identified assignificant (false positives) Likewise path-ways that have p-values biased towards one arelikely to rarely meet the significance require-ments even when they are truly implicatedin the given phenotype (false negatives) Sys-tematic bias due to nonuniformity of p-valuedistributions results in failure of the statisti-cal methods to correctly identify the biologicalpathways implicated in the condition and alsoleads to inconsistent and incorrect results

Querying Data from Knowledge BasesIndependent research groups have tried

different strategies to model complex bio-molecular phenomena These independent ef-forts have led to variation among pathwaydatabases complicating the task of devel-oping pathway analysis methods Dependingon the database there may be differencesin information sources experiment interpre-tation models of molecular interactions orboundaries of the pathways Therefore it ispossible that pathways with the same desig-nation and aiming to describe the same phe-nomena may have different topologies in dif-ferent databases As an example one couldcompare the insulin signaling pathways ofKEGG and BioCarta BioCarta includes fewernodes and emphasizes the effect of insulin ontranscription while KEGG includes transcrip-tion regulation as well as apoptosis and otherbiological processes Differences in graphmodels for molecular interactions are partic-ularly apparent when comparing the signal-ing pathways in KEGG and NCI-PID WhileKEGG represents the interaction informationusing the directed edges themselves NCI-PIDintroduces ldquoprocess nodesrdquo to model interac-tions (see Fig 8252) Developers are facingthe challenge of modifying methods to accept

AnalyzingMolecularInteractions

82517

Current Protocols in Bioinformatics Supplement 61

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 4: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

Table 8251 Pathway Analysis Tools continued

Method Availabilitya HIPAAb Licensec Coded Year Reference

CePa Standalone (CRAN) Web(httpsmcubenjueducncgi-bincepamainpl)

No GPL (2) R 2012 (Gu Liu CaoZhang amp Wang2012)

THINK-Back-DS

Standalone Web(httpseecsumichedudbthinksoftwarehtml)

No Freef Java 2012 (Farfan MaSartorMichailidis ampJagadish 2012)

TBScore No implementation available No NA NA 2012 (Ibrahim JassimCawthorne ampLanglands2012)

ACST Standalone (available as articlesupplemental)

No NA R 2012 (MieczkowskiSwiatek-Machado ampKaminska 2012)

EnrichNet Web (httpswwwenrichnetorg) No free PHP 2012 (Glaab BaudotKrasnogorSchneider ampValencia 2012)

clipper Standalone (httpsromualdibiounipditsoftware)

No AGPL-3 R 2013 (Martini SalesMassa Chiognaamp Romualdi2013)

DEAP Standalone (available as articlesupplemental)

No GNULesserGPL

Python 2013 (HaynesHigdonStanberryCollins ampKolker 2013)

DRAGEN Standalone (httpsbioinfoautsinghuaeducndragen)

No NA C++ 2014 (Ma Jiang ampJiang 2014)

ToPASeqg Standalone (Bioconductor) No AGPL-3 R 2016 (Ihnatova ampBudinska 2015)

pDis Standalone (Bioconductor) No Freef R 2016 (AnsariVoichitaDonato Tagettamp Draghici2017)

SPATIAL No implementation available No NA NA 2016 (BokanizadTagett AnsariHelmi ampDraghici 2016)

BLMAh Standalone (Bioconductor) No GPL (2) R 2017 (Nguyen TagettDonato Mitreaamp Draghici2016 also seeInternetResources)

continued

8254

Supplement 61 Current Protocols in Bioinformatics

Table 8251 Pathway Analysis Tools continued

Method Availabilitya HIPAAb Licensec Coded Year Reference

Signaling pathway analysis using multiple types of data

PARADIGM Standalone (httpssbenzgithubcomParadigm)

No UCSC-CGBfreef

C 2010 (Vaske et al2010)

microGraphite

Standalone(httpsromualdibiounipditsoftware)

No AGPL-3 R 2014 (Calura et al2014)

mirIntegrator Standalone (Bioconductor) No GPL3 R 2016 (Diaz et al2016 Nguyenet al 2016)

Metabolic pathway analysis

ScorePAGE No implementation available No NA NA 2004 (Rahnenfuhreret al 2004)

TAPPA No implementation available No NA NA 2007 (Gao amp Wang2007)

MetPA Web (httpsmetpametabolomicsca) No GPL (2) PHP R 2010 (Xia ampWishart 2010)

MetaboAnalyst

Web (httpsmetpametabolomicsca) No GPL (2) R 2011 (Xia amp Wishart2011 XiaSinelnikovHan ampWishart 2015)

aAvailability is a criterion that describes the implementation of the method as standalone or Web-basedbHIPAA provides information about HIPAA compliancecLicense provides information about the type of the software license GPL is an abbreviation for the GNU General Public License AGPL is anabbreviation for the GNU Affero General Public License CC BY-NC-ND 4 is an abbreviation of Creative Commons AttributionndashNonCommercial-NoDerivatives 40 International Public LicensedCode shows the programming language used for the method implementationeCommercial methodfFree for academic and non-commercial use UCSC-CGB is the University of California Santa Cruz Cancer Genome BrowsergToPASeq provides an R package that runs TopologyGSA DEGraph clipper SPIA TBScore PWEA TAPPAhBLMA provides an R package for bi-level meta-analysis that runs SPIA ORA GSA and PADOG using one or multiple expression datasets

by chance The drawbacks of this type of ap-proach include (i) it only considers the num-ber of DE genes and completely ignores themagnitude of the actual expression changesresulting in information loss (ii) it assumesthat genes are independent which they arenot (since the pathways are graphs that de-scribe precisely how these genes influenceeach other) and (iii) it ignores the interactionsbetween various genes and modules Func-tional Class Scoring (FCS) approaches suchas Gene Set Enrichment Analysis (GSEASubramanian et al 2005) and Gene Set Anal-ysis (GSA Efron amp Tibshirani 2007) havebeen developed to address some of the is-sues raised by ORA approaches The mainimprovement of FCS is the observation thatsmall but coordinated changes in expressionof functionally related genes can have signifi-cant impact on pathways

ORA and FCS approaches are often re-ferred as gene set enrichment methods Com-prehensive lists of gene set analysis ap-proaches as well as comparisons betweenthem can be found in well-developed sur-veys (Emmert-Streib amp Glazko 2011 KelderConklin Evelo amp Pico 2010 Khatri Sirotaamp Butte 2012 Misman et al 2009) Whileuseful for the purpose for which they havebeen developedmdashto analyze sets of genesmdashthese methods completely ignore the topol-ogy and interaction between genes Topology-based approaches which fully exploit allthe knowledge about how genes interact asdescribed by pathways have been devel-oped more recently The first such techniqueswere ScorePAGE (Rahnenfuhrer DominguesMaydt amp Lengauer 2004) for metabolicpathways and Impact Analysis (Draghiciet al 2007 Tarca et al 2009) for signaling

AnalyzingMolecularInteractions

8255

Current Protocols in Bioinformatics Supplement 61

pathways Figure 8251 shows an examplepathway named ERBB signaling pathway inwhich the nodes represent genes and com-pounds while the edges represent the knowninteraction between the compounds Gene setanalysis approaches are not able to accountfor the topological order of genes nor are ableto explain the signal propagation and mecha-nisms of the pathway These approaches wouldyield exactly the same significance value forthis pathway even if the graph were to becompletely redesigned by future discoveryIn contrast network-based pathway analysiscan distinguish between this pathway and anyother pathway with the same proportion of DEgenes

Here we provide a survey of 34network-based methods developed for path-way analysis Our survey of commercialtools for pathway analysis found iPath-wayGuide (Advaita Corporation httpswwwadvaitabiocom) and MetaCore (ThomsonReuters httpswwwthomsonreuterscom) tobe topology based Other commercial toolssuch as Ingenuity Pathway Analysis (httpswwwingenuitycom) or Genomatix (httpswwwgenomatixde) do not use pathway topol-ogy information and thus are not included inthis review The existence of only two com-mercial tools is more evidence of the challengefaced by the developers of such methods dueto lack of standards

In this document we categorize and com-pare the methods based on the followingcriteria the type of input the method ac-cepts graph model and the statistical ap-proaches used to evaluate gene and pathwaychanges Under the heading ldquoExperiment In-put and Pathway Databasesrdquo below we de-scribe the types of input the surveyed meth-ods use Under the heading ldquoGraph Mod-elsrdquo we provide details regarding the graphmodels used for the biological networks Un-der ldquoPathway Scoring Strategiesrdquo we discussthe statistical methods used by the surveyedmethods to assess gene and pathway changesbetween two phenotypes Under ldquoChallengesin Pathway Analysisrdquo we discuss a numberof outstanding challenges that needed to beaddressed in order to improve the reliabil-ity as well as the relevance of the next-generation pathway analysis approaches Wediscuss the key elements and limitations of thesurveyed methods without going into detailsof each technique For a more detailed reviewand technical descriptions of network-basedpathway analysis methods please refer to

Mitrea et al (2013) or referenced manuscripts(Table 8251)

NETWORK-BASED PATHWAYANALYSIS

The term pathway analysis is used in avery broad context in the literature includ-ing biological network construction and in-ference Here we focus on methods that areable to exploit biological knowledge in pub-lic repositories rather than on network in-ference approaches that attempt to infer orreconstruct pathways from molecular mea-surements Table 8251 shows the list of 34pathway analysis approaches together withtheir availability and licensing In this sec-tion we start by describing the availability ofeach method before discussing the input for-mat graph model and underlying statisticalapproaches

Software Availability andImplementation

We often think that the main strength ofan approach lies in its novelty and algorithmefficiency However the implementation andavailability of a tool have become increas-ingly important for several reasons First soft-ware availability and version control are cru-cial for reproducing the experimental resultsthat were used to assess the performance ofthe approach (Sandve Nekrutenko Taylor ampHovig 2013) For this reason many journalsrequest authors to make their software and dataavailable before accepting methods articlesSecond if a software application is not ready-to-run it is very unlikely that the intended au-dience (mostly life scientists) will invest thetime to understand and implement complexalgorithms Practicality user-friendliness out-put format and type of interface are all to beconsidered Depending on the desired avail-ability and intended audience a software pack-age may be implemented as standalone or Webbased Among the 34 approaches there are 30that are available either as a standalone soft-ware package or Web service

Typically standalone tools need to be in-stalled on local machines or servers which of-ten requires some administrative skills Moststandalone tools depend on full or partialcopies of public pathway databases storedlocally and need to be updated periodicallyAdvantages of standalone tools include (i)instant availability that does not require Inter-net access and (ii) the security and privacy ofthe experimental data Web-based tools on the

Network-BasedApproaches forPathway Level

Analysis

8256

Supplement 61 Current Protocols in Bioinformatics

other hand run the analyses on a remote serverproviding computational power and a graph-ical interface The major advantage of Web-based tools is that they are user-friendly and donot require a separate local installation Fromthe accessibility perspective Web-based toolshave the advantage of being available from anylocation as long as there is an Internet connec-tion and a browser available Also the updateis almost transparent to the client This makesthe userrsquos task easy and enables collaborationsince users all over the world can utilize thesame method without the burden of installingit or keeping it up-to-date There are methodsthat provide both Web-based and standaloneimplementations

HIPAA compliance may also be a factor incertain applications that involve data comingfrom patients or data linked to other clinicaldata or clinical records Currently the iPath-wayGuide is the only tool available that cando topological pathway analysis and is HIPAAcompliant

The programming language and style usedfor implementation also play an important rolein the acceptance of a method Software toolsthat are neatly implemented and packaged aremore appealing compared to those that do nothave ready-to-use implementations Many ofthe methods are implemented in the R pro-gramming language and are available as soft-ware packages from Bioconductor CRAN orthe authorrsquos Web site Their popularity amongbiologists and bioinformaticians is due to thefact that many bioinformatics-dedicated pack-ages are available in R In addition the rigor-ous review procedure provided by Bioconduc-tor and CRAN makes the software packagesmore standardized and reliable

Experiment Input and PathwayDatabases

A pathway analysis approach typically re-quires two types of input experiment data andknown biological networks Experiment datais usually collected from high-throughput ex-periments that compare a condition phenotypeto a control phenotype A condition pheno-type can be a disease a drug treatment orthe knock-out (KO) of a gene while a con-trol phenotype can be a healthy state a dif-ferent disease or drug treatment or wild-type(non-KO) samples Experimental data can beobtained from multiple technologies that pro-duce different types of data gene expressionprotein abundance metabolite concentrationmiRNA expression etc Biological networks

or pathways are often represented in the formof graphs that capture our current knowl-edge about the interactions of genes proteinsmetabolites or compounds in an organismThe pathway data is accumulated updatedand refined by amassing knowledge from sci-entific literature describing individual interac-tions or high-throughput experiment results

Table 8252 shows a summary of in-put format and mathematical modeling ofthe surveyed approaches Most pathwayanalysis methods analyze data from high-throughput experiments such as microarraysnext-generation sequencing or proteomicsThey accept either a list of gene IDs or a listof such gene IDs associated with measuredchanges These changes could be measuredwith different technologies and therefore canserve as proxies for different biochemical en-tities For instance one could use gene ex-pression changes measured with microarraysor protein levels measured with a proteomicapproach etc

Different analysis methods use different in-put formats Many methods accept a list of allgenes considered in the experiment togetherwith their expression values Some analysismethods select a subset of genes consideredto be differentially expressed (DE) based on apredefined cut-off The cut-off is typically ap-plied on fold-change p-value or both Thesemethods use the list of DE genes and their cor-responding statistics (fold-change p-value) asinput Other methods use only the list of DEgenes without corresponding expression val-ues because their scoring methods are basedonly on the relative positions of the genes inthe graph

Methods which use cut-offs are sensitive tothe chosen threshold value because a smallchange in the cut-off may drastically changethe number of selected genes (Nam amp Kim2008) In addition they typically use the mostsignificant genes and discard the rest whoseweaker but coordinated changes may also havesignificant impact on pathways Genes withmoderate differential expression may be losteven though they might be important play-ers in the impacted pathways (Ben-ShaulBergman amp Soreq 2005) Furthermore thegenes included in the set of DE genes canvary dramatically if the selection methods arechanged Hence the results of pathway anal-yses based on DE genes may be vastly differ-ent depending on both the selection method aswell as the threshold value (Pan Lih amp Co-hen 2005) Furthermore for the same disease

AnalyzingMolecularInteractions

8257

Current Protocols in Bioinformatics Supplement 61

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

Signaling pathway analysis using gene expression

MetaCore DE genes Proprietary canonicalpathway genome-scalenetwork

Single-type directed Hierarchicallyaggregated

Pathway-Express DE genes change ormeasured geneschanged

KEGG signaling Single-type directed Hierarchicallyaggregated

PathOlogist Measured genesexpression

KEGG Multi-type directed Hierarchicallyaggregated

iPathwayGuide DE genes change ormeasured geneschange

KEGG signalingReactome NCIBioCarta

Single-type directed Hierarchicallyaggregated

SPIA DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

NetGSA Measured genesexpression

KEGG signaling Single-type directed Multivariate analysis

PWEA Measured genesexpression

YeastNet Single-typeundirected

Hierarchicallyaggregated

TopoGSA DE genes PPI network KEGG Single-typeundirected

Hierarchicallyaggregated

TopologyGSA Measured genesexpression

NCI-PID Single-typeundirected

Multivariate analysis

DEGraph Measured genesexpression

KEGG Single-typeundirected

Multivariate analysis

GGEA Measured genesexpression

KEGG Single-type directed Aggregate fuzzysimilarity

BPA Measured genesexpressionmdashwithcut-off

NCI-PID Single-type DAG Bayesian network

GANPA DE genes change ormeasured genesexpression

PPI network KEGGReactome NCI-PIDHumanCyc

Single-typeundirected

Hierarchicallyaggregated

ROntoTools DE genes change ormeasured geneexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

BAPA-IGGFD Measured genesexpression - withcut-off

Literature-basedinteraction databaseKEGG WikiPathwaysReactome MSigDBGO BP PANTHERconstructed geneassociation networkfrom PPIsco-annotation in GOBiological Process (BP)and co-expression inmicroarray data

Single-type DAG Bayesian network

continued

8258

Supplement 61 Current Protocols in Bioinformatics

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

CePa DE genes expressionor measured genesexpression

NCI-PID Single-type directed Hierarchicallyaggregated

THINK-Back-DS DE genes changemeasured genesexpression

KEGG PANTHERBioCarta ReactomeGenMAPP

Single-type directed Hierarchicallyaggregated

TBScore DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

ACST Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

EnrichNet DE genes list PPI network KEGGBioCartaWikiPathwaysReactome NCI-PIDInterPro GO withSTRING 90

Single-typeundirected

Hierarchicallyaggregated

clipper Measured genesexpression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

DEAP Measured genesexpression

KEGG Reactome Single-type directed Hierarchicallyaggregated

DRAGEN Measured genesexpression

RegulonDB M3DHTRIdb ENCODEMSigDB

Single-type directed Linear regression

ToPASeqe Measured genesexpression

KEGG Single-type directed Hierarchical ampmultivariate

pDis DE genes change ormeasured geneschanged

KEGG signaling Single-type directed Hierarchicallyaggregated

SPATIAL DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

BLMAf Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

Signaling pathway analysis using multiple types of data

PARADIGM Measured genesexpression copynumber and proteinslevels

Constructed PPInetworks from MIPSDIP BIND HPRDIntAct and BioGRID

Multi-type directed Hierarchicallyaggregated

microGraphite Measured geneexpression andmiRNA expression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

mirIntegrator DE genes changeand DE miRNAchange

KEGG signalingmiRTarBase

Single-type directed Hierarchicallyaggregated

Metabolic pathway analysis

ScorePAGE Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

continued

8259

Current Protocols in Bioinformatics Supplement 61

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

TAPPA Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

MetPA DE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

MetaboAnalyst DE genes change andDE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

aExperiment input shows the experiment data input format for each method ldquoDErdquo means differentially expressed ldquoChangerdquo means fold-change valueor t-statistics when comparing genemetabolites values between two phenotypes ldquoMeasuredrdquo means the list of all the genesmetabolites measured inthe experiment ldquoListrdquo represents a list of genesmetabolites identifiers (eg symbols) ldquoWith cut-offrdquo show methods that take as input the list of allmeasured genes and in the analysis they mark the DE genesbPathway database shows the name of the knowledge source for biological interactionscGraph model is a characteristic that shows if the graph has one or multiple types of nodes as well as if directed or undirecteddThe package ROntoTools (Bioconductor) allows for analysis using expression values of all geneseToPASeq provides an R package that runs TopologyGSA DEGraph clipper SPIA TBScore PWEA TAPPAfBLMA provides an R package for bi-level meta-analysis that runs SPIA ORA GSA and PADOG using one or multiple expression datasets

independent studies or measurements oftenproduce different sets of differentially ex-pressed (DE) genes (Ein-Dor Kela GetzGivol amp Domany 2005 Ein-Dor Zuk amp Do-many 2006 Tan et al 2003) This makes ap-proaches that use DE genes as input appeareven more unreliable

Usually pathways are sets of genes andorgene products that interact with each other ina coordinated way to accomplish a given bi-ological function A typical signaling path-way (in KEGG for instance) uses nodes torepresent genes or gene products and edgesto represent signals such as activation or re-pression that go from one gene to anotherA typical metabolic pathway uses nodes torepresent biochemical compounds and edgesto represent reactions that transform one ormore compound(s) into one or more othercompounds These reactions are usually car-ried out or controlled by enzymes which arein turn coded by genes Hence in a metabolicpathway genes or gene products are associatedwith edges rather than nodes as in a signal-ing pathway The immediate consequence ofthis difference is that many techniques cannotbe applied directly on all available pathwaysThis is why the analysis of metabolic pathwaysis still generally and arguably underdeveloped(only 128 Google Scholar citations to date forthe original ScorePage paper Rahnenfuhreret al 2004) and there are not many othermethods available for metabolic pathway anal-ysis However the topology-based analysis ofsignaling pathways has been very successful

(over 1200 citations to date for the two impactanalysis papers mentioned above Draghiciet al 2007 Tarca et al 2009) and over 30other methods have been developed since

There are other types of biological net-works that incorporate genome-wide inter-actions between genes or proteins such asprotein-protein interaction (PPI) networksThese networks are not restricted to specificbiological functions The main caveat relatedto PPI data is that most such data are obtainedfrom a bait-prey laboratory assays rather thanfrom in vivo or in vitro studies The fact thattwo proteins stick to each other in an assayperformed in an artificial environment can bemisleading since the two proteins may neverbe present at the same time in the same tissueor the same part of the cell

Publicly available curated pathwaydatabases used by methods listed in Table8252 are KEGG (Ogata et al 1999)NCI-PID (Schaefer et al 2009) BioCarta(httpswwwbiocartacom) WikiPathways(Pico et al 2008) PANTHER (Mi et al2005) and Reactome (Joshi-Tope et al2005) These knowledge bases are built bymanually curating experiments performed indifferent cell types under different conditionsThese curated knowledgebases are morereliable than protein interaction networksbut do not include all known genes andtheir interactions As an example despitebeing continuously updated KEGG includesonly about 5000 human genes in signalingpathways while the number of protein-coding

Network-BasedApproaches forPathway Level

Analysis

82510

Supplement 61 Current Protocols in Bioinformatics

Figure 8252 Five biological network graph models used by public databases are displayedThe KEGG database contains both signaling and metabolic networks Signaling networks havegenesgene products as nodes and regulatory signals as edges Types of regulatory signalsinclude activation inhibition phosphorylation and many others Metabolic networks have bio-chemical compounds as nodes and chemical reactions as edges Enzymes (specialized pro-teinsgene products) catalyze biochemical reactions therefore genes are linked to the edges inthese networks The Reactome database is a collection of biochemical reactions that are groupedin functionally related sets to form a pathway There are two types of nodes biochemical com-pounds and reactions Reaction nodes link biochemical compounds as reactants and productsNCI-PID signaling networks also have two types of nodes component nodes and process nodesComponent nodes are usually biomolecular components Process nodes are usually biochemicalreactions or biological processes In these networks process nodes link two or more componentnodes through directed edges Process nodes are assigned one of the following states positiveor negative regulation or involved in Protein-protein interaction (PPI) networks have proteinsas nodes and physical binding as edges or interactions Two-hybrid assays are typically used todetermine protein-protein interactions PPIs can be directed in the bait-prey orientation when thebait-prey relation is considered (top) or undirected when is not (bottom) The Biological PathwayExchange (BioPAX) format has physical entities as nodes and conversions as edges The ad-vantage of this representation is that it is generic and provides a lot of flexibility to accommodatevarious types of interactions In addition it provides a machine-readable standard that can beused for all databases to provide their data in a unified format Network nodes can be genesgene products complexes or non-coding RNA Network edges can be assembly of a complexdisassembly of a complex or biochemical reactions among others

genes is estimated to be between 19000 and20000 (Ezkurdia et al 2014)

The implementation of analysis methodsconstrains the software to accept a specific in-put pathway data format while the underlyinggraph models in the methods are independentof the input format Regardless of the pathwayformat this information must be parsed into acomputer-readable graph data structure beforebeing processed The implementation may in-corporate a parser or this may be up to the userFor instance SPIA accepts any signaling path-way or network if it can be transformed into anadjacency matrix representing a directed graphwhere all nodes are components and all edgesare interactions NetGSA is similarly flexiblewith regard to signaling and metabolic path-ways SPIA provides KEGG signaling path-ways as a set of pre-parsed adjacency matri-ces The methods described in this chapter maybe restricted to only one pathway database ormay accept several

To the best of our knowledge there havebeen no comprehensive approaches to com-

pare the pros and cons of existing knowledgebases It is completely up to researchers tochoose the tools that are able to work withthe database they trust Intuitively methodsthat are able to exploit the complementary in-formation available from different databaseshave an edge over methods that work with onesingle database

Graph ModelsPathway analysis approaches use two major

graph models to represent biological networksobtained from knowledge bases (i) single-type and (ii) multiple-type The first modelallows only one type of node ie a gene orprotein with edges representing molecular in-teractions between the nodes (KEGG signal-ing in Fig 8252) Models that contain di-rected graphs are more suitable for analysesthat include signal propagation of gene pertur-bation The second graph model allows mul-tiple type of nodes such as components andinteractions (NCI-PID in Fig 8252) Multi-type graph models are more complex than

AnalyzingMolecularInteractions

82511

Current Protocols in Bioinformatics Supplement 61

single-type but they are expected to capturemore pathway characteristics For examplesingle-type models are limited when tryingto describe ldquoallrdquo and ldquoanyrdquo relations betweenmultiple components that are involved in thesame interaction Bipartite graphs which con-tain two types of nodes and allow connectiononly between nodes of different types are aparticular case of multi-type graph models

The majority of analysis methods surveyedhere use a single-type graph model Some ap-ply the analysis on a directed or un-directedsingle-type network built using the input path-way while others transform the pathways intographs with specific characteristics An ex-ample of the later is TopologyGSA whichtransforms the directed input pathway intoan undirected decomposable graph which hasthe advantage of being easily broken downinto separate modules (Lauritzen 1996) Inthis method decomposable graphs are usedto find ldquoimportantrdquo submodulesmdashthose thatdrive the changes across the whole pathwayFor each pathway TopologyGSA creates anundirected moral graph from the underlyingdirected acyclic graph (DAG) by connectingthe parents of each child and removing theedge direction [The moral graph of a DAGis the undirected graph created by adding an(undirected) edge between all parents of thesame node (sometimes called marrying) andthen replacing all directed edges by undirectededges The name stems from the fact that ina moral graph two nodes that have a com-mon child are required to be married by shar-ing an edge] The moral graph is then usedto test the hypothesis that the underlying net-work is changed significantly between the twophenotypes If the the research hypothesis isrejected a decomposabletriangulated graph isgenerated from the moral graph by adding newedges This graph is broken into the maximalpossible submodules and the hypothesis is re-tested on each of them

PathOlogist and PARADIGM are thetwo surveyed methods that use multi-typegraph models PathOlogist uses a bipartitegraph model with component and interac-tion nodes PARADIGM conceptually moti-vated by the central dogma of molecular bi-ology takes a pathway graph as input andconverts it into a more detailed graph whereeach component node is replaced by sev-eral more specific nodes biological entitynodes interaction nodes and nodes contain-ing observed experiment data The observedexperiment nodes could in principle containgene-expression and copy-number informa-

tion Biological entity nodes are DNA mRNAprotein and active protein The interactionnodes are transcription translation or pro-tein activation among others Biological en-tity and interaction node values are derivedfrom these data and specify the probability ofthe node being active These are the hiddenstates of the model From the mathematicalmodel perspective models that allow multipletypes of nodesmdashcomponent nodes and inter-action nodesmdashare more flexible and are able tomodel both AND and OR gates which are verycommon when describing cellular processes

Pathway Scoring StrategiesThe goal of the scoring method is to com-

pute a score for each pathway based on the ex-pression change and the graph model resultingin a ranked list of pathways or sub-pathwaysThere are a variety of approaches to quantifythe changes in a pathway Some of the analysismethods use a hierarchically aggregated scor-ing algorithm where on the first level a scoreis calculated and assigned to each node or pairof nodes (component andor interaction) Onthe second level these scores are aggregated tocompute the score of the pathway On the lastlevel the statistical significance of the path-way score is assessed using univariate hypoth-esis testing Another approach assigns a ran-dom variable to each node and a multivariateprobability distribution is calculated for eachpathway The output score can be calculated intwo ways One way is to use multivariate hy-pothesis testing to assess the statistical signif-icance of changes in the pathway distributionbetween the two phenotypes The other way isto estimate the distribution parameters basedon the Bayesian network model and use thisdistribution to compute a probabilistic score tomeasure the changes In this section we pro-vide details regarding the scoring algorithmsof the surveyed methods See Figure 8253 forscoring algorithms categories

Hierarchically aggregated scoringThe workflow of the hierarchically aggre-

gated scoring strategy has three levels nodestatistic computation pathway statistic com-putation and the evaluation of the significancefor the pathway statistic

Node level scoringMost of the surveyed methods incorpo-

rate pathway topology information in the nodescores The node level scoring can be di-vided into four categories (i) graph mea-sures (centrality) (ii) similarity measures

Network-BasedApproaches forPathway Level

Analysis

82512

Supplement 61 Current Protocols in Bioinformatics

Figure 8253 A summary of the statistical models is presented for the surveyed methods Mostmethods use the hierarchically aggregated scoring strategy in which the score is computed atthe node level before being aggregated at the pathway level for significance assessment In theleft panel the rows show a node-level statistical model while the columns show the aggregatingstrategies (linear non-linear or weighted gene set) Approaches in multivariate analysis andBayesian network categories use multivariate modeling and Bayesian network respectively tofind pathways that are most likely to be impacted Both BLMA and ToPASeq provide R packagesthat run multiple algorithms DRAGEN follows a linear regression strategy while GGEA appliesfuzzy modeling to rank the pathways

(iii) probabilistic graphical models and (iv)normalized node value (NNV) Approaches inthe first category use centrality measures or avariation of these measures to score nodes in agiven pathway Centrality measures representthe importance of a node relative to all othernodes in a network There are several central-ity measures that can be applied to networksof genes and their interactions degree central-ity closeness between-ness and eigenvectorcentrality Degree centrality accounts for thenumber of directed edges that enter and leaveeach node Closeness sums the shortest dis-tance from each node to all other nodes inthe network Node between-ness measures theimportance of a node according to the numberof shortest paths that pass through it Eigen-vector centrality uses the network adjacencymatrix of a graph to determine a dominanteigenvector each element of this vector is ascore for the corresponding node Thus eachscore is influenced by the scores of neighbor-ing nodes In the case of directed graphs anode that has many downstream genes hasmore influence and receives a higher score

Methods in this category include MetaCoreSPIA iPathwayGuide MetPA SPATIAL To-poGSA DEAP pDis mirIntegrator Pathway-Express and BLMA

Approaches in the second category usesimilarity measures in their node level scor-ing Similarity measures estimate the co-expression behavioral similarity or co-regulation of pairs of components Theirvalues can be correlation coefficients covari-ances or dot products of the gene expressionprofile across time or samples In these meth-ods the pathways with clusters of highly cor-related genes are considered more significantAt the node level a score is assigned to eachpair of nodes in the network which is the ra-tio of one similarity measure over the shortestpath distance between these nodes Thus thetopology information is captured in the nodescore by incorporating the shortest path dis-tance of the pair Methods in this category in-clude ScorePAGE and PWEA

Approaches in the third category incorpo-rate the topology in the node level scoringusing a probabilistic graphical model In this

AnalyzingMolecularInteractions

82513

Current Protocols in Bioinformatics Supplement 61

model nodes are random variables and edgesdefine the conditional dependency of the nodesthey link For example PARADIGM takes ob-served experimental data and calculates scoresfor all component nodes in both observed andhidden states from the detailed network cre-ated by the method based on the input path-way For each node score a positive or negativevalue denotes how likely it is for the node tobe active or inactive respectively The scoresare calculated to maximize the probability ofthe observed values A p-value is associatedwith each score of each sample such that eachnode can be tagged as significantly activesignificantly inactive or not significant Foreach network a matrix of p-values is outputin which columns are samples and rows arecomponent nodes Methods in this categoryinclude PARADIGM and PathOlogist

Approaches in the fourth category simplycompute the score for each node using the in-formation obtained from the experiment inputFor example TAPPA calculates the score ofeach node as the square root of the normalizedlog gene expressions (node value) while ACSTand TBScore calculate the node level score us-ing a sign statistic and log fold-change

Pathway level scoringThere are three different ways to compute

the pathway score (i) linear (ii) non-linearand iii) weighted gene set Most methods ag-gregate node level statistics to pathway levelstatistics using linear functions such as av-eraging or summation For example iPath-wayGuide computes the scores of pathway asthe sum of all genes while TBScore weightsthe pathway DE genes based on their log foldchange and the number of distinct DE genesdirectly downstream of them using a depth-first search algorithm

Approaches in the second group (non-linear) use a non-linear function to computethe pathway scores For example TAPPA com-putes the pathway score for each sample as aweighted sum of the product of all node pairscores in the pathway The weight coefficientis 0 when there is no edge between a pair Forany connected node pair the weight is a signfunction which represents joint up- or down-regulation of the pair Another example is En-richNet in which pathway scores measure thedifference of the node score distribution for apathway and a background networkgene setwhich consists of all pathways At the nodelevel the distance of all DE genes to the path-way is measured and summarized as a distancedistribution The method assumes that the most

relevant pathway is the one with the greatestdifference between the pathway node scoredistribution and the background score distri-bution The difference between the two distri-butions is measured by the weighted averagingof the difference between the two discretizedand normalized distributions The averagingmethod down-weights the higher distances andemphasizes the lower distance nodes

Methods in the third group (weighted geneset) design scoring techniques that incorpo-rate existing gene set analysis methods suchas GSEA (Subramanian et al 2005) GSA(Efron amp Tibshirani 2007) or LRPath (SartorLeikauf amp Medvedovic 2009) Pathway-levelscores can be calculated using node scoresthat represent the topology characteristic of thepathway as weight adjustments to a gene setanalysis method PWEA GANPA THINK-Back-DS and CePa use this approach and werefer to them as weighted gene set analysismethods

Pathway significance assessmentPathway scores are intended to provide in-

formation regarding the amount of changeincurred in the pathway between two phe-notypes However the amount of change isnot meaningful by itself since any amount ofchange can take place just by chance An as-sessment of the significance of the measuredchanges is thus required

TopoGSA MetPA and EnrichNet outputscores without any significance assessmentleaving it up to the user to interpret the re-sults This is problematic because users do nothave any instrument to distinguish betweenchanges due to noise or random causes andmeaningful changes that are unlikely to occurjust by chance and that therefore are possi-bly related to the phenotype The rest of theanalysis methods perform a hypothesis test foreach pathway The null hypothesis is that thevalue of the observed statistic is due to randomnoise or chance alone The research hypothe-sis is that the observed values are substantialenough that they are potentially related to thephenotype A p-value for calculated score isthen computed and a user-defined thresholdon the p-value is used to decide whether thethe null hypothesis can be rejected or not foreach pathway Finally a correction for multi-ple comparisons should be performed

Typically pathway analysis methods com-pute one score per pathway The distribution ofthis score under the null hypothesis can be con-structed and compared to the observed scoreHowever there are often too few samples to

Network-BasedApproaches forPathway Level

Analysis

82514

Supplement 61 Current Protocols in Bioinformatics

calculate this distribution so it is assumedthat the distribution is known For example inMetaCore and many other techniques whenthe pathway score is the number of DE nodesthat fall on the pathway the distribution isassumed to be hypergeometric However thehypergeometric distribution assumes that thevariables (genes in this case) are independentwhich is incorrect as witnessed by the factthat the pathway graph structure itself is de-signed to reflect the specific ways in which thegenes influence each other Another approachto identify the distribution is to use statisticaltechniques such as the bootstrap (Efron 1979)Bootstrapping can be done either at the sam-ple level by permuting the sample labels orat gene-set level by permuting the the valuesassigned to the genes in the set

Multivariate analysis and Bayesiannetwork

Multivariate analysis methods mostly usemultivariate probability distributions to com-pute pathway-level statistics and these can begrouped into two subcategories Methods inthe first category use multivariate hypothesistesting while methods in the second categoryare based on Bayesian network

NetGSA TopologyGSA DEGraph clip-per and microGraphite are methods based onmultivariate hypothesis testing These anal-ysis methods assume the vectors of geneexpression values in each (sub)pathway arerandom vectors with multivariate normal dis-tributions The network topology informationis stored in the covariance matrix of the cor-responding distribution For a network if thetwo distributions of the gene expression vec-tors corresponding to the two phenotypes aresignificantly different the network is assumedto be significantly impacted when comparingthe two phenotypes The significance assess-ment is done by a multivariate hypothesis testThe definition of the null hypothesis for thestatistical tests and the techniques to calcu-late the parameters of the distributions are themain differences between these three analysismethods

BPA and BAPA-IGGFD are two methodsbased on Bayesian networks In a Bayesiannetwork which is a special case of proba-bilistic graphical models a random variableis assigned to each node of a directed acyclic(DAG) graph The edges in the graph representthe conditional probabilities between nodesso that the children are independent from eachother and the rest of the graph when condi-tioned on the parents In BPA the value of

the Bayesian random variable assigned to eachnode captures the state of a gene (DE or not) Incontrast in BAPA-IGGFD each random vari-able assigned to an edge is the probability thatup or down regulation of the genes at both endsof an interaction are concordant with the typeof interaction which can be activation or inhi-bition In both BPA and BAPA-IGGFD eachrandom variable is assumed to follow a bino-mial distribution whose probability of successfollows a beta distribution However these twomethods use different approaches in represent-ing the multivariate distribution of the corre-sponding random vector BPA assumes thatthe random vector has a multinomial distribu-tion which is the generalization of the bino-mial distribution In this case the vector of thesuccess probability follows the Dirichlet dis-tribution which is the multivariate extensionof the beta distribution In contrast BAPA-IGGFD assumes the random variables areindependent therefore the multivariate distri-butions are calculated by multiplying the dis-tributions of the random variables in the vec-tor It is worth mentioning that the assumptionof independence in BAPA-IGGFD is contra-dicted by evidence specifically in the case ofedges that share nodes

Other approachesThe four methods DRAGEN GGEA

ToPASeq and BLMA follow strategies thatare very different from those of the other 30methods As such BLMA implements a bi-level meta-analysis approach that can be ap-plied in conjunction with any of the four sta-tistical approaches SPIA (Tarca et al 2009)GSA (Efron amp Tibshirani 2007) ORA (Tava-zoie et al 1999) or PADOG (Tarca DraghiciBhatti amp Romero 2012) The package al-lows users to perform pathway analysis withone dataset or with multiple datasets Simi-larly ToPASeq provides an R package thatruns TopologyGSA DEGraph clipper SPIATBScore PWEA and TAPPA This methodmodels the input (biological networks and ex-periment data) in such a way that it can be usedfor any of the seven different analyses

DRAGEN is an analysis method that scoresthe interactions rather then the genes and usesa regression model to detect differential regu-lation DRAGEN fits each edge of a pathwayinto two linear models (for case and control)and then computes a p-value that representsthe difference between the two models Foreach pathway a summary statistic is computedby combining the p-values of the edges us-ing a weighted Fisherrsquos method (Fisher 1925)

AnalyzingMolecularInteractions

82515

Current Protocols in Bioinformatics Supplement 61

GGEA on the other hand uses Petri Net (Mu-rata 1989) to model the pathway The sum-mary statistic named consistency is computedfrom the fuzzy similarity between the observedgene expression and Petri net with fuzzy logic(PNFL Kuffner Petri Windhager amp Zimmer2010) For both DRAGEN and GGEA the p-value of the pathway is calculated by com-paring the observed summary statistic againstthe null distribution that is constructed bypermutation

CHALLENGES IN PATHWAYANALYSIS

Pathway analysis has become the firstchoice for gaining insights into the underly-ing biology of a phenotype due its explanatorypower However there are outstanding annota-tion and methodological limitations that havenot been addressed (Kelder Conklin Eveloamp Pico 2010 Khatri et al 2012) There arethree main limitations of current knowledgebases First existing knowledge bases are un-able to keep up with the information availablein data obtained from recent technologies Forexample RNA-Seq data allows us to iden-tify transcripts that are active under certainconditions Alternatively spliced transcriptseven if they originate from the same genemay have distinct or even opposite functions(Wang et al 2008) However most knowl-edge bases provide pathway annotation onlyat the gene level Second there is a lack ofcondition and cell-specific information ieinformation about cell type conditions andtime points Finally current pathway annota-tions are neither complete nor perfectly accu-rate (Khatri et al 2012 Rhee Wood Dolinskiamp Draghici 2008) For example the number ofgenes in KEGG have remained around 5000despite being updated continuously in the past10 years while there are approximately 19000human genes annotated with at least one GOterm The number of protein-coding genes isestimated to be between 19000 and 20000(Ezkurdia et al 2014) most of which areincluded in DNA microarray assays such asAffymetrix HG U133 plus 20

Another challenge is the oversimplificationthat characterizes many of the models pro-vided by pathway databases In principle eachtype of tissue might have different mecha-nisms so generic organism-level pathwayspresent a somewhat simplistic description ofthe phenomena Furthermore signaling andmetabolic processes can also be different fromone condition to another or even from one

patient to another Understanding the specificpathways that are impacted in a given pheno-type or sub-group of patients should be anothergoal for the next generation of pathway anal-ysis tools See Khatri et al (2012) for a moredetailed discussion of annotation limitationsof existing knowledge bases

Here we focus on challenges of pathwayanalysis from computational perspectives Wedemonstrate that there is a systematic bias inpathway analysis (Nguyen Mitrea Tagett ampDraghici 2017) This leads to the unreliabilityof most if not all pathway analysis approachesWe also discuss the lack of benchmark datasetsor pipelines to assess the performance of ex-isting approaches

Systematic Bias of Pathway AnalysisMethods

Pathway analysis approaches often rely onhypothesis testing to identify the pathways thatare impacted under the effects of different dis-eases Null distributions are used to modelpopulations so that statistical tests can deter-mine whether an observation is unlikely to oc-cur by chance In principle the p-values pro-duced by a sound statistical test must be uni-formly distributed under the null hypothesis(Barton Crozier Lillycrop Godfrey amp Inskip2013 Bland 2013 Fodor Tickle amp Richard-son 2007 Storey amp Tibshirani 2003) For ex-ample the p-values that result from comparingtwo groups using a t-test should be distributeduniformly if the data are normally distributed(Bland 2013) When the assumptions of sta-tistical models do not hold the resulting p-values are not uniformly distributed under thenull hypothesis This makes classical methodssuch as t-test inaccurate since gene expressionvalues do not necessary follow their assump-tions Here we also show that the problem isextended to pathway analysis as the pathwayp-values obtained from statistical approachesare not uniformly distributed under the nullhypothesis This might lead to severe bias to-wards well-studied diseases such as cancerand thus make the results unreliable (Nguyenet al 2017)

Consider three pathway analysis methodsthat represent three different classes ofmethods for pathway analysis Gene SetAnalysis (GSA Efron amp Tibshirani 2007)is a Functional Class Scoring method (Efronamp Tibshirani 2007 Mootha et al 2003Subramanian et al 2005 Tarca et al2012) Down-weighting of OverlappingGenes (PADOG Tarca et al 2012) is anenrichment method (Beiẞbarth amp Speed

Network-BasedApproaches forPathway Level

Analysis

82516

Supplement 61 Current Protocols in Bioinformatics

2004 Draghici et al 2003 Khatri DraghiciOstermeier amp Krawetz 2002) and Sig-naling Pathway Impact Analysis (SPIATarca et al 2009) is a topology-awaremethod (Draghici et al 2007 Tarca et al2009) To simulate the null distribution wedownload and process the data from ninepublic datasets GSE14924 CD4 GSE14924CD8 GSE17054 GSE12662 GSE57194GSE33223 GSE42140 GSE8023 andGSE15061 Using 140 control samples fromthe nine datasets we simulate 40000 datasetsas follows We randomly label 70 samples ascontrol samples and the remaining 70 samplesas disease samples We repeat this procedure10000 times to generate different groups of70 control and 70 disease samples To makethe simulation more general we also create10000 datasets consisting of 10 control and10 disease samples 10000 datasets consistingof 10 control and 20 disease samples and10000 datasets consisting of 20 controland 10 disease samples We then calculatethe p-values of the KEGG human signalingpathways using each of the three methods

The effect of combining control (iehealthy) samples from different experimentsis to uniformly distribute all sources of biasamong the random groups of samples If wecompare groups of control samples based onexperiments there could be true differencesdue to batch effects By pooling them togetherwe form a population which is considered thereference population This approach is simi-lar to selecting from a large group of peoplethat may contain different sub-groups (egdifferent ethnicities gender race life styleor living conditions) When we randomly se-lect samples (for the two random groups tobe compared) from the reference populationwe expect all bias (eg ethnic subgroups) tobe represented equally in both random groupsand therefore we should see no difference be-tween these random groups no matter howmany distinct ethnic subgroups were presentin the population at large Therefore the p-values of a test for difference between the tworandomly selected groups should be equallyprobable between zero and one

Figure 8254 displays the empirical nulldistributions of p-values using PADOG GSAand SPIA The horizontal axes represent p-values while the vertical axes represent p-value densities Green panels (A0-A6) showp-value distributions from PADOG while blue(B0-B6) and purple (C0-C6) panels show p-value distributions from GSA and SPIA re-spectively For each method the larger panel

(A0 B0 C0) shows the cumulative p-valuesfrom all KEGG signaling pathways The smallpanels six per method display extreme ex-amples of nonuniform p-value distributionsfor specific pathways For each method weshow three distributions severely biased to-wards zero (eg A1-A3) and three distri-butions severely biased towards one (egA4-A6)

These results show that contrary to gen-erally accepted beliefs the p-values are notuniformly distributed for the three methodsconsidered Therefore one should expect avery strong and systematic bias in identifyingsignificant pathways for each of these meth-ods Pathways that have p-values biased to-wards zero will often be falsely identified assignificant (false positives) Likewise path-ways that have p-values biased towards one arelikely to rarely meet the significance require-ments even when they are truly implicatedin the given phenotype (false negatives) Sys-tematic bias due to nonuniformity of p-valuedistributions results in failure of the statisti-cal methods to correctly identify the biologicalpathways implicated in the condition and alsoleads to inconsistent and incorrect results

Querying Data from Knowledge BasesIndependent research groups have tried

different strategies to model complex bio-molecular phenomena These independent ef-forts have led to variation among pathwaydatabases complicating the task of devel-oping pathway analysis methods Dependingon the database there may be differencesin information sources experiment interpre-tation models of molecular interactions orboundaries of the pathways Therefore it ispossible that pathways with the same desig-nation and aiming to describe the same phe-nomena may have different topologies in dif-ferent databases As an example one couldcompare the insulin signaling pathways ofKEGG and BioCarta BioCarta includes fewernodes and emphasizes the effect of insulin ontranscription while KEGG includes transcrip-tion regulation as well as apoptosis and otherbiological processes Differences in graphmodels for molecular interactions are partic-ularly apparent when comparing the signal-ing pathways in KEGG and NCI-PID WhileKEGG represents the interaction informationusing the directed edges themselves NCI-PIDintroduces ldquoprocess nodesrdquo to model interac-tions (see Fig 8252) Developers are facingthe challenge of modifying methods to accept

AnalyzingMolecularInteractions

82517

Current Protocols in Bioinformatics Supplement 61

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 5: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

Table 8251 Pathway Analysis Tools continued

Method Availabilitya HIPAAb Licensec Coded Year Reference

Signaling pathway analysis using multiple types of data

PARADIGM Standalone (httpssbenzgithubcomParadigm)

No UCSC-CGBfreef

C 2010 (Vaske et al2010)

microGraphite

Standalone(httpsromualdibiounipditsoftware)

No AGPL-3 R 2014 (Calura et al2014)

mirIntegrator Standalone (Bioconductor) No GPL3 R 2016 (Diaz et al2016 Nguyenet al 2016)

Metabolic pathway analysis

ScorePAGE No implementation available No NA NA 2004 (Rahnenfuhreret al 2004)

TAPPA No implementation available No NA NA 2007 (Gao amp Wang2007)

MetPA Web (httpsmetpametabolomicsca) No GPL (2) PHP R 2010 (Xia ampWishart 2010)

MetaboAnalyst

Web (httpsmetpametabolomicsca) No GPL (2) R 2011 (Xia amp Wishart2011 XiaSinelnikovHan ampWishart 2015)

aAvailability is a criterion that describes the implementation of the method as standalone or Web-basedbHIPAA provides information about HIPAA compliancecLicense provides information about the type of the software license GPL is an abbreviation for the GNU General Public License AGPL is anabbreviation for the GNU Affero General Public License CC BY-NC-ND 4 is an abbreviation of Creative Commons AttributionndashNonCommercial-NoDerivatives 40 International Public LicensedCode shows the programming language used for the method implementationeCommercial methodfFree for academic and non-commercial use UCSC-CGB is the University of California Santa Cruz Cancer Genome BrowsergToPASeq provides an R package that runs TopologyGSA DEGraph clipper SPIA TBScore PWEA TAPPAhBLMA provides an R package for bi-level meta-analysis that runs SPIA ORA GSA and PADOG using one or multiple expression datasets

by chance The drawbacks of this type of ap-proach include (i) it only considers the num-ber of DE genes and completely ignores themagnitude of the actual expression changesresulting in information loss (ii) it assumesthat genes are independent which they arenot (since the pathways are graphs that de-scribe precisely how these genes influenceeach other) and (iii) it ignores the interactionsbetween various genes and modules Func-tional Class Scoring (FCS) approaches suchas Gene Set Enrichment Analysis (GSEASubramanian et al 2005) and Gene Set Anal-ysis (GSA Efron amp Tibshirani 2007) havebeen developed to address some of the is-sues raised by ORA approaches The mainimprovement of FCS is the observation thatsmall but coordinated changes in expressionof functionally related genes can have signifi-cant impact on pathways

ORA and FCS approaches are often re-ferred as gene set enrichment methods Com-prehensive lists of gene set analysis ap-proaches as well as comparisons betweenthem can be found in well-developed sur-veys (Emmert-Streib amp Glazko 2011 KelderConklin Evelo amp Pico 2010 Khatri Sirotaamp Butte 2012 Misman et al 2009) Whileuseful for the purpose for which they havebeen developedmdashto analyze sets of genesmdashthese methods completely ignore the topol-ogy and interaction between genes Topology-based approaches which fully exploit allthe knowledge about how genes interact asdescribed by pathways have been devel-oped more recently The first such techniqueswere ScorePAGE (Rahnenfuhrer DominguesMaydt amp Lengauer 2004) for metabolicpathways and Impact Analysis (Draghiciet al 2007 Tarca et al 2009) for signaling

AnalyzingMolecularInteractions

8255

Current Protocols in Bioinformatics Supplement 61

pathways Figure 8251 shows an examplepathway named ERBB signaling pathway inwhich the nodes represent genes and com-pounds while the edges represent the knowninteraction between the compounds Gene setanalysis approaches are not able to accountfor the topological order of genes nor are ableto explain the signal propagation and mecha-nisms of the pathway These approaches wouldyield exactly the same significance value forthis pathway even if the graph were to becompletely redesigned by future discoveryIn contrast network-based pathway analysiscan distinguish between this pathway and anyother pathway with the same proportion of DEgenes

Here we provide a survey of 34network-based methods developed for path-way analysis Our survey of commercialtools for pathway analysis found iPath-wayGuide (Advaita Corporation httpswwwadvaitabiocom) and MetaCore (ThomsonReuters httpswwwthomsonreuterscom) tobe topology based Other commercial toolssuch as Ingenuity Pathway Analysis (httpswwwingenuitycom) or Genomatix (httpswwwgenomatixde) do not use pathway topol-ogy information and thus are not included inthis review The existence of only two com-mercial tools is more evidence of the challengefaced by the developers of such methods dueto lack of standards

In this document we categorize and com-pare the methods based on the followingcriteria the type of input the method ac-cepts graph model and the statistical ap-proaches used to evaluate gene and pathwaychanges Under the heading ldquoExperiment In-put and Pathway Databasesrdquo below we de-scribe the types of input the surveyed meth-ods use Under the heading ldquoGraph Mod-elsrdquo we provide details regarding the graphmodels used for the biological networks Un-der ldquoPathway Scoring Strategiesrdquo we discussthe statistical methods used by the surveyedmethods to assess gene and pathway changesbetween two phenotypes Under ldquoChallengesin Pathway Analysisrdquo we discuss a numberof outstanding challenges that needed to beaddressed in order to improve the reliabil-ity as well as the relevance of the next-generation pathway analysis approaches Wediscuss the key elements and limitations of thesurveyed methods without going into detailsof each technique For a more detailed reviewand technical descriptions of network-basedpathway analysis methods please refer to

Mitrea et al (2013) or referenced manuscripts(Table 8251)

NETWORK-BASED PATHWAYANALYSIS

The term pathway analysis is used in avery broad context in the literature includ-ing biological network construction and in-ference Here we focus on methods that areable to exploit biological knowledge in pub-lic repositories rather than on network in-ference approaches that attempt to infer orreconstruct pathways from molecular mea-surements Table 8251 shows the list of 34pathway analysis approaches together withtheir availability and licensing In this sec-tion we start by describing the availability ofeach method before discussing the input for-mat graph model and underlying statisticalapproaches

Software Availability andImplementation

We often think that the main strength ofan approach lies in its novelty and algorithmefficiency However the implementation andavailability of a tool have become increas-ingly important for several reasons First soft-ware availability and version control are cru-cial for reproducing the experimental resultsthat were used to assess the performance ofthe approach (Sandve Nekrutenko Taylor ampHovig 2013) For this reason many journalsrequest authors to make their software and dataavailable before accepting methods articlesSecond if a software application is not ready-to-run it is very unlikely that the intended au-dience (mostly life scientists) will invest thetime to understand and implement complexalgorithms Practicality user-friendliness out-put format and type of interface are all to beconsidered Depending on the desired avail-ability and intended audience a software pack-age may be implemented as standalone or Webbased Among the 34 approaches there are 30that are available either as a standalone soft-ware package or Web service

Typically standalone tools need to be in-stalled on local machines or servers which of-ten requires some administrative skills Moststandalone tools depend on full or partialcopies of public pathway databases storedlocally and need to be updated periodicallyAdvantages of standalone tools include (i)instant availability that does not require Inter-net access and (ii) the security and privacy ofthe experimental data Web-based tools on the

Network-BasedApproaches forPathway Level

Analysis

8256

Supplement 61 Current Protocols in Bioinformatics

other hand run the analyses on a remote serverproviding computational power and a graph-ical interface The major advantage of Web-based tools is that they are user-friendly and donot require a separate local installation Fromthe accessibility perspective Web-based toolshave the advantage of being available from anylocation as long as there is an Internet connec-tion and a browser available Also the updateis almost transparent to the client This makesthe userrsquos task easy and enables collaborationsince users all over the world can utilize thesame method without the burden of installingit or keeping it up-to-date There are methodsthat provide both Web-based and standaloneimplementations

HIPAA compliance may also be a factor incertain applications that involve data comingfrom patients or data linked to other clinicaldata or clinical records Currently the iPath-wayGuide is the only tool available that cando topological pathway analysis and is HIPAAcompliant

The programming language and style usedfor implementation also play an important rolein the acceptance of a method Software toolsthat are neatly implemented and packaged aremore appealing compared to those that do nothave ready-to-use implementations Many ofthe methods are implemented in the R pro-gramming language and are available as soft-ware packages from Bioconductor CRAN orthe authorrsquos Web site Their popularity amongbiologists and bioinformaticians is due to thefact that many bioinformatics-dedicated pack-ages are available in R In addition the rigor-ous review procedure provided by Bioconduc-tor and CRAN makes the software packagesmore standardized and reliable

Experiment Input and PathwayDatabases

A pathway analysis approach typically re-quires two types of input experiment data andknown biological networks Experiment datais usually collected from high-throughput ex-periments that compare a condition phenotypeto a control phenotype A condition pheno-type can be a disease a drug treatment orthe knock-out (KO) of a gene while a con-trol phenotype can be a healthy state a dif-ferent disease or drug treatment or wild-type(non-KO) samples Experimental data can beobtained from multiple technologies that pro-duce different types of data gene expressionprotein abundance metabolite concentrationmiRNA expression etc Biological networks

or pathways are often represented in the formof graphs that capture our current knowl-edge about the interactions of genes proteinsmetabolites or compounds in an organismThe pathway data is accumulated updatedand refined by amassing knowledge from sci-entific literature describing individual interac-tions or high-throughput experiment results

Table 8252 shows a summary of in-put format and mathematical modeling ofthe surveyed approaches Most pathwayanalysis methods analyze data from high-throughput experiments such as microarraysnext-generation sequencing or proteomicsThey accept either a list of gene IDs or a listof such gene IDs associated with measuredchanges These changes could be measuredwith different technologies and therefore canserve as proxies for different biochemical en-tities For instance one could use gene ex-pression changes measured with microarraysor protein levels measured with a proteomicapproach etc

Different analysis methods use different in-put formats Many methods accept a list of allgenes considered in the experiment togetherwith their expression values Some analysismethods select a subset of genes consideredto be differentially expressed (DE) based on apredefined cut-off The cut-off is typically ap-plied on fold-change p-value or both Thesemethods use the list of DE genes and their cor-responding statistics (fold-change p-value) asinput Other methods use only the list of DEgenes without corresponding expression val-ues because their scoring methods are basedonly on the relative positions of the genes inthe graph

Methods which use cut-offs are sensitive tothe chosen threshold value because a smallchange in the cut-off may drastically changethe number of selected genes (Nam amp Kim2008) In addition they typically use the mostsignificant genes and discard the rest whoseweaker but coordinated changes may also havesignificant impact on pathways Genes withmoderate differential expression may be losteven though they might be important play-ers in the impacted pathways (Ben-ShaulBergman amp Soreq 2005) Furthermore thegenes included in the set of DE genes canvary dramatically if the selection methods arechanged Hence the results of pathway anal-yses based on DE genes may be vastly differ-ent depending on both the selection method aswell as the threshold value (Pan Lih amp Co-hen 2005) Furthermore for the same disease

AnalyzingMolecularInteractions

8257

Current Protocols in Bioinformatics Supplement 61

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

Signaling pathway analysis using gene expression

MetaCore DE genes Proprietary canonicalpathway genome-scalenetwork

Single-type directed Hierarchicallyaggregated

Pathway-Express DE genes change ormeasured geneschanged

KEGG signaling Single-type directed Hierarchicallyaggregated

PathOlogist Measured genesexpression

KEGG Multi-type directed Hierarchicallyaggregated

iPathwayGuide DE genes change ormeasured geneschange

KEGG signalingReactome NCIBioCarta

Single-type directed Hierarchicallyaggregated

SPIA DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

NetGSA Measured genesexpression

KEGG signaling Single-type directed Multivariate analysis

PWEA Measured genesexpression

YeastNet Single-typeundirected

Hierarchicallyaggregated

TopoGSA DE genes PPI network KEGG Single-typeundirected

Hierarchicallyaggregated

TopologyGSA Measured genesexpression

NCI-PID Single-typeundirected

Multivariate analysis

DEGraph Measured genesexpression

KEGG Single-typeundirected

Multivariate analysis

GGEA Measured genesexpression

KEGG Single-type directed Aggregate fuzzysimilarity

BPA Measured genesexpressionmdashwithcut-off

NCI-PID Single-type DAG Bayesian network

GANPA DE genes change ormeasured genesexpression

PPI network KEGGReactome NCI-PIDHumanCyc

Single-typeundirected

Hierarchicallyaggregated

ROntoTools DE genes change ormeasured geneexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

BAPA-IGGFD Measured genesexpression - withcut-off

Literature-basedinteraction databaseKEGG WikiPathwaysReactome MSigDBGO BP PANTHERconstructed geneassociation networkfrom PPIsco-annotation in GOBiological Process (BP)and co-expression inmicroarray data

Single-type DAG Bayesian network

continued

8258

Supplement 61 Current Protocols in Bioinformatics

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

CePa DE genes expressionor measured genesexpression

NCI-PID Single-type directed Hierarchicallyaggregated

THINK-Back-DS DE genes changemeasured genesexpression

KEGG PANTHERBioCarta ReactomeGenMAPP

Single-type directed Hierarchicallyaggregated

TBScore DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

ACST Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

EnrichNet DE genes list PPI network KEGGBioCartaWikiPathwaysReactome NCI-PIDInterPro GO withSTRING 90

Single-typeundirected

Hierarchicallyaggregated

clipper Measured genesexpression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

DEAP Measured genesexpression

KEGG Reactome Single-type directed Hierarchicallyaggregated

DRAGEN Measured genesexpression

RegulonDB M3DHTRIdb ENCODEMSigDB

Single-type directed Linear regression

ToPASeqe Measured genesexpression

KEGG Single-type directed Hierarchical ampmultivariate

pDis DE genes change ormeasured geneschanged

KEGG signaling Single-type directed Hierarchicallyaggregated

SPATIAL DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

BLMAf Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

Signaling pathway analysis using multiple types of data

PARADIGM Measured genesexpression copynumber and proteinslevels

Constructed PPInetworks from MIPSDIP BIND HPRDIntAct and BioGRID

Multi-type directed Hierarchicallyaggregated

microGraphite Measured geneexpression andmiRNA expression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

mirIntegrator DE genes changeand DE miRNAchange

KEGG signalingmiRTarBase

Single-type directed Hierarchicallyaggregated

Metabolic pathway analysis

ScorePAGE Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

continued

8259

Current Protocols in Bioinformatics Supplement 61

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

TAPPA Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

MetPA DE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

MetaboAnalyst DE genes change andDE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

aExperiment input shows the experiment data input format for each method ldquoDErdquo means differentially expressed ldquoChangerdquo means fold-change valueor t-statistics when comparing genemetabolites values between two phenotypes ldquoMeasuredrdquo means the list of all the genesmetabolites measured inthe experiment ldquoListrdquo represents a list of genesmetabolites identifiers (eg symbols) ldquoWith cut-offrdquo show methods that take as input the list of allmeasured genes and in the analysis they mark the DE genesbPathway database shows the name of the knowledge source for biological interactionscGraph model is a characteristic that shows if the graph has one or multiple types of nodes as well as if directed or undirecteddThe package ROntoTools (Bioconductor) allows for analysis using expression values of all geneseToPASeq provides an R package that runs TopologyGSA DEGraph clipper SPIA TBScore PWEA TAPPAfBLMA provides an R package for bi-level meta-analysis that runs SPIA ORA GSA and PADOG using one or multiple expression datasets

independent studies or measurements oftenproduce different sets of differentially ex-pressed (DE) genes (Ein-Dor Kela GetzGivol amp Domany 2005 Ein-Dor Zuk amp Do-many 2006 Tan et al 2003) This makes ap-proaches that use DE genes as input appeareven more unreliable

Usually pathways are sets of genes andorgene products that interact with each other ina coordinated way to accomplish a given bi-ological function A typical signaling path-way (in KEGG for instance) uses nodes torepresent genes or gene products and edgesto represent signals such as activation or re-pression that go from one gene to anotherA typical metabolic pathway uses nodes torepresent biochemical compounds and edgesto represent reactions that transform one ormore compound(s) into one or more othercompounds These reactions are usually car-ried out or controlled by enzymes which arein turn coded by genes Hence in a metabolicpathway genes or gene products are associatedwith edges rather than nodes as in a signal-ing pathway The immediate consequence ofthis difference is that many techniques cannotbe applied directly on all available pathwaysThis is why the analysis of metabolic pathwaysis still generally and arguably underdeveloped(only 128 Google Scholar citations to date forthe original ScorePage paper Rahnenfuhreret al 2004) and there are not many othermethods available for metabolic pathway anal-ysis However the topology-based analysis ofsignaling pathways has been very successful

(over 1200 citations to date for the two impactanalysis papers mentioned above Draghiciet al 2007 Tarca et al 2009) and over 30other methods have been developed since

There are other types of biological net-works that incorporate genome-wide inter-actions between genes or proteins such asprotein-protein interaction (PPI) networksThese networks are not restricted to specificbiological functions The main caveat relatedto PPI data is that most such data are obtainedfrom a bait-prey laboratory assays rather thanfrom in vivo or in vitro studies The fact thattwo proteins stick to each other in an assayperformed in an artificial environment can bemisleading since the two proteins may neverbe present at the same time in the same tissueor the same part of the cell

Publicly available curated pathwaydatabases used by methods listed in Table8252 are KEGG (Ogata et al 1999)NCI-PID (Schaefer et al 2009) BioCarta(httpswwwbiocartacom) WikiPathways(Pico et al 2008) PANTHER (Mi et al2005) and Reactome (Joshi-Tope et al2005) These knowledge bases are built bymanually curating experiments performed indifferent cell types under different conditionsThese curated knowledgebases are morereliable than protein interaction networksbut do not include all known genes andtheir interactions As an example despitebeing continuously updated KEGG includesonly about 5000 human genes in signalingpathways while the number of protein-coding

Network-BasedApproaches forPathway Level

Analysis

82510

Supplement 61 Current Protocols in Bioinformatics

Figure 8252 Five biological network graph models used by public databases are displayedThe KEGG database contains both signaling and metabolic networks Signaling networks havegenesgene products as nodes and regulatory signals as edges Types of regulatory signalsinclude activation inhibition phosphorylation and many others Metabolic networks have bio-chemical compounds as nodes and chemical reactions as edges Enzymes (specialized pro-teinsgene products) catalyze biochemical reactions therefore genes are linked to the edges inthese networks The Reactome database is a collection of biochemical reactions that are groupedin functionally related sets to form a pathway There are two types of nodes biochemical com-pounds and reactions Reaction nodes link biochemical compounds as reactants and productsNCI-PID signaling networks also have two types of nodes component nodes and process nodesComponent nodes are usually biomolecular components Process nodes are usually biochemicalreactions or biological processes In these networks process nodes link two or more componentnodes through directed edges Process nodes are assigned one of the following states positiveor negative regulation or involved in Protein-protein interaction (PPI) networks have proteinsas nodes and physical binding as edges or interactions Two-hybrid assays are typically used todetermine protein-protein interactions PPIs can be directed in the bait-prey orientation when thebait-prey relation is considered (top) or undirected when is not (bottom) The Biological PathwayExchange (BioPAX) format has physical entities as nodes and conversions as edges The ad-vantage of this representation is that it is generic and provides a lot of flexibility to accommodatevarious types of interactions In addition it provides a machine-readable standard that can beused for all databases to provide their data in a unified format Network nodes can be genesgene products complexes or non-coding RNA Network edges can be assembly of a complexdisassembly of a complex or biochemical reactions among others

genes is estimated to be between 19000 and20000 (Ezkurdia et al 2014)

The implementation of analysis methodsconstrains the software to accept a specific in-put pathway data format while the underlyinggraph models in the methods are independentof the input format Regardless of the pathwayformat this information must be parsed into acomputer-readable graph data structure beforebeing processed The implementation may in-corporate a parser or this may be up to the userFor instance SPIA accepts any signaling path-way or network if it can be transformed into anadjacency matrix representing a directed graphwhere all nodes are components and all edgesare interactions NetGSA is similarly flexiblewith regard to signaling and metabolic path-ways SPIA provides KEGG signaling path-ways as a set of pre-parsed adjacency matri-ces The methods described in this chapter maybe restricted to only one pathway database ormay accept several

To the best of our knowledge there havebeen no comprehensive approaches to com-

pare the pros and cons of existing knowledgebases It is completely up to researchers tochoose the tools that are able to work withthe database they trust Intuitively methodsthat are able to exploit the complementary in-formation available from different databaseshave an edge over methods that work with onesingle database

Graph ModelsPathway analysis approaches use two major

graph models to represent biological networksobtained from knowledge bases (i) single-type and (ii) multiple-type The first modelallows only one type of node ie a gene orprotein with edges representing molecular in-teractions between the nodes (KEGG signal-ing in Fig 8252) Models that contain di-rected graphs are more suitable for analysesthat include signal propagation of gene pertur-bation The second graph model allows mul-tiple type of nodes such as components andinteractions (NCI-PID in Fig 8252) Multi-type graph models are more complex than

AnalyzingMolecularInteractions

82511

Current Protocols in Bioinformatics Supplement 61

single-type but they are expected to capturemore pathway characteristics For examplesingle-type models are limited when tryingto describe ldquoallrdquo and ldquoanyrdquo relations betweenmultiple components that are involved in thesame interaction Bipartite graphs which con-tain two types of nodes and allow connectiononly between nodes of different types are aparticular case of multi-type graph models

The majority of analysis methods surveyedhere use a single-type graph model Some ap-ply the analysis on a directed or un-directedsingle-type network built using the input path-way while others transform the pathways intographs with specific characteristics An ex-ample of the later is TopologyGSA whichtransforms the directed input pathway intoan undirected decomposable graph which hasthe advantage of being easily broken downinto separate modules (Lauritzen 1996) Inthis method decomposable graphs are usedto find ldquoimportantrdquo submodulesmdashthose thatdrive the changes across the whole pathwayFor each pathway TopologyGSA creates anundirected moral graph from the underlyingdirected acyclic graph (DAG) by connectingthe parents of each child and removing theedge direction [The moral graph of a DAGis the undirected graph created by adding an(undirected) edge between all parents of thesame node (sometimes called marrying) andthen replacing all directed edges by undirectededges The name stems from the fact that ina moral graph two nodes that have a com-mon child are required to be married by shar-ing an edge] The moral graph is then usedto test the hypothesis that the underlying net-work is changed significantly between the twophenotypes If the the research hypothesis isrejected a decomposabletriangulated graph isgenerated from the moral graph by adding newedges This graph is broken into the maximalpossible submodules and the hypothesis is re-tested on each of them

PathOlogist and PARADIGM are thetwo surveyed methods that use multi-typegraph models PathOlogist uses a bipartitegraph model with component and interac-tion nodes PARADIGM conceptually moti-vated by the central dogma of molecular bi-ology takes a pathway graph as input andconverts it into a more detailed graph whereeach component node is replaced by sev-eral more specific nodes biological entitynodes interaction nodes and nodes contain-ing observed experiment data The observedexperiment nodes could in principle containgene-expression and copy-number informa-

tion Biological entity nodes are DNA mRNAprotein and active protein The interactionnodes are transcription translation or pro-tein activation among others Biological en-tity and interaction node values are derivedfrom these data and specify the probability ofthe node being active These are the hiddenstates of the model From the mathematicalmodel perspective models that allow multipletypes of nodesmdashcomponent nodes and inter-action nodesmdashare more flexible and are able tomodel both AND and OR gates which are verycommon when describing cellular processes

Pathway Scoring StrategiesThe goal of the scoring method is to com-

pute a score for each pathway based on the ex-pression change and the graph model resultingin a ranked list of pathways or sub-pathwaysThere are a variety of approaches to quantifythe changes in a pathway Some of the analysismethods use a hierarchically aggregated scor-ing algorithm where on the first level a scoreis calculated and assigned to each node or pairof nodes (component andor interaction) Onthe second level these scores are aggregated tocompute the score of the pathway On the lastlevel the statistical significance of the path-way score is assessed using univariate hypoth-esis testing Another approach assigns a ran-dom variable to each node and a multivariateprobability distribution is calculated for eachpathway The output score can be calculated intwo ways One way is to use multivariate hy-pothesis testing to assess the statistical signif-icance of changes in the pathway distributionbetween the two phenotypes The other way isto estimate the distribution parameters basedon the Bayesian network model and use thisdistribution to compute a probabilistic score tomeasure the changes In this section we pro-vide details regarding the scoring algorithmsof the surveyed methods See Figure 8253 forscoring algorithms categories

Hierarchically aggregated scoringThe workflow of the hierarchically aggre-

gated scoring strategy has three levels nodestatistic computation pathway statistic com-putation and the evaluation of the significancefor the pathway statistic

Node level scoringMost of the surveyed methods incorpo-

rate pathway topology information in the nodescores The node level scoring can be di-vided into four categories (i) graph mea-sures (centrality) (ii) similarity measures

Network-BasedApproaches forPathway Level

Analysis

82512

Supplement 61 Current Protocols in Bioinformatics

Figure 8253 A summary of the statistical models is presented for the surveyed methods Mostmethods use the hierarchically aggregated scoring strategy in which the score is computed atthe node level before being aggregated at the pathway level for significance assessment In theleft panel the rows show a node-level statistical model while the columns show the aggregatingstrategies (linear non-linear or weighted gene set) Approaches in multivariate analysis andBayesian network categories use multivariate modeling and Bayesian network respectively tofind pathways that are most likely to be impacted Both BLMA and ToPASeq provide R packagesthat run multiple algorithms DRAGEN follows a linear regression strategy while GGEA appliesfuzzy modeling to rank the pathways

(iii) probabilistic graphical models and (iv)normalized node value (NNV) Approaches inthe first category use centrality measures or avariation of these measures to score nodes in agiven pathway Centrality measures representthe importance of a node relative to all othernodes in a network There are several central-ity measures that can be applied to networksof genes and their interactions degree central-ity closeness between-ness and eigenvectorcentrality Degree centrality accounts for thenumber of directed edges that enter and leaveeach node Closeness sums the shortest dis-tance from each node to all other nodes inthe network Node between-ness measures theimportance of a node according to the numberof shortest paths that pass through it Eigen-vector centrality uses the network adjacencymatrix of a graph to determine a dominanteigenvector each element of this vector is ascore for the corresponding node Thus eachscore is influenced by the scores of neighbor-ing nodes In the case of directed graphs anode that has many downstream genes hasmore influence and receives a higher score

Methods in this category include MetaCoreSPIA iPathwayGuide MetPA SPATIAL To-poGSA DEAP pDis mirIntegrator Pathway-Express and BLMA

Approaches in the second category usesimilarity measures in their node level scor-ing Similarity measures estimate the co-expression behavioral similarity or co-regulation of pairs of components Theirvalues can be correlation coefficients covari-ances or dot products of the gene expressionprofile across time or samples In these meth-ods the pathways with clusters of highly cor-related genes are considered more significantAt the node level a score is assigned to eachpair of nodes in the network which is the ra-tio of one similarity measure over the shortestpath distance between these nodes Thus thetopology information is captured in the nodescore by incorporating the shortest path dis-tance of the pair Methods in this category in-clude ScorePAGE and PWEA

Approaches in the third category incorpo-rate the topology in the node level scoringusing a probabilistic graphical model In this

AnalyzingMolecularInteractions

82513

Current Protocols in Bioinformatics Supplement 61

model nodes are random variables and edgesdefine the conditional dependency of the nodesthey link For example PARADIGM takes ob-served experimental data and calculates scoresfor all component nodes in both observed andhidden states from the detailed network cre-ated by the method based on the input path-way For each node score a positive or negativevalue denotes how likely it is for the node tobe active or inactive respectively The scoresare calculated to maximize the probability ofthe observed values A p-value is associatedwith each score of each sample such that eachnode can be tagged as significantly activesignificantly inactive or not significant Foreach network a matrix of p-values is outputin which columns are samples and rows arecomponent nodes Methods in this categoryinclude PARADIGM and PathOlogist

Approaches in the fourth category simplycompute the score for each node using the in-formation obtained from the experiment inputFor example TAPPA calculates the score ofeach node as the square root of the normalizedlog gene expressions (node value) while ACSTand TBScore calculate the node level score us-ing a sign statistic and log fold-change

Pathway level scoringThere are three different ways to compute

the pathway score (i) linear (ii) non-linearand iii) weighted gene set Most methods ag-gregate node level statistics to pathway levelstatistics using linear functions such as av-eraging or summation For example iPath-wayGuide computes the scores of pathway asthe sum of all genes while TBScore weightsthe pathway DE genes based on their log foldchange and the number of distinct DE genesdirectly downstream of them using a depth-first search algorithm

Approaches in the second group (non-linear) use a non-linear function to computethe pathway scores For example TAPPA com-putes the pathway score for each sample as aweighted sum of the product of all node pairscores in the pathway The weight coefficientis 0 when there is no edge between a pair Forany connected node pair the weight is a signfunction which represents joint up- or down-regulation of the pair Another example is En-richNet in which pathway scores measure thedifference of the node score distribution for apathway and a background networkgene setwhich consists of all pathways At the nodelevel the distance of all DE genes to the path-way is measured and summarized as a distancedistribution The method assumes that the most

relevant pathway is the one with the greatestdifference between the pathway node scoredistribution and the background score distri-bution The difference between the two distri-butions is measured by the weighted averagingof the difference between the two discretizedand normalized distributions The averagingmethod down-weights the higher distances andemphasizes the lower distance nodes

Methods in the third group (weighted geneset) design scoring techniques that incorpo-rate existing gene set analysis methods suchas GSEA (Subramanian et al 2005) GSA(Efron amp Tibshirani 2007) or LRPath (SartorLeikauf amp Medvedovic 2009) Pathway-levelscores can be calculated using node scoresthat represent the topology characteristic of thepathway as weight adjustments to a gene setanalysis method PWEA GANPA THINK-Back-DS and CePa use this approach and werefer to them as weighted gene set analysismethods

Pathway significance assessmentPathway scores are intended to provide in-

formation regarding the amount of changeincurred in the pathway between two phe-notypes However the amount of change isnot meaningful by itself since any amount ofchange can take place just by chance An as-sessment of the significance of the measuredchanges is thus required

TopoGSA MetPA and EnrichNet outputscores without any significance assessmentleaving it up to the user to interpret the re-sults This is problematic because users do nothave any instrument to distinguish betweenchanges due to noise or random causes andmeaningful changes that are unlikely to occurjust by chance and that therefore are possi-bly related to the phenotype The rest of theanalysis methods perform a hypothesis test foreach pathway The null hypothesis is that thevalue of the observed statistic is due to randomnoise or chance alone The research hypothe-sis is that the observed values are substantialenough that they are potentially related to thephenotype A p-value for calculated score isthen computed and a user-defined thresholdon the p-value is used to decide whether thethe null hypothesis can be rejected or not foreach pathway Finally a correction for multi-ple comparisons should be performed

Typically pathway analysis methods com-pute one score per pathway The distribution ofthis score under the null hypothesis can be con-structed and compared to the observed scoreHowever there are often too few samples to

Network-BasedApproaches forPathway Level

Analysis

82514

Supplement 61 Current Protocols in Bioinformatics

calculate this distribution so it is assumedthat the distribution is known For example inMetaCore and many other techniques whenthe pathway score is the number of DE nodesthat fall on the pathway the distribution isassumed to be hypergeometric However thehypergeometric distribution assumes that thevariables (genes in this case) are independentwhich is incorrect as witnessed by the factthat the pathway graph structure itself is de-signed to reflect the specific ways in which thegenes influence each other Another approachto identify the distribution is to use statisticaltechniques such as the bootstrap (Efron 1979)Bootstrapping can be done either at the sam-ple level by permuting the sample labels orat gene-set level by permuting the the valuesassigned to the genes in the set

Multivariate analysis and Bayesiannetwork

Multivariate analysis methods mostly usemultivariate probability distributions to com-pute pathway-level statistics and these can begrouped into two subcategories Methods inthe first category use multivariate hypothesistesting while methods in the second categoryare based on Bayesian network

NetGSA TopologyGSA DEGraph clip-per and microGraphite are methods based onmultivariate hypothesis testing These anal-ysis methods assume the vectors of geneexpression values in each (sub)pathway arerandom vectors with multivariate normal dis-tributions The network topology informationis stored in the covariance matrix of the cor-responding distribution For a network if thetwo distributions of the gene expression vec-tors corresponding to the two phenotypes aresignificantly different the network is assumedto be significantly impacted when comparingthe two phenotypes The significance assess-ment is done by a multivariate hypothesis testThe definition of the null hypothesis for thestatistical tests and the techniques to calcu-late the parameters of the distributions are themain differences between these three analysismethods

BPA and BAPA-IGGFD are two methodsbased on Bayesian networks In a Bayesiannetwork which is a special case of proba-bilistic graphical models a random variableis assigned to each node of a directed acyclic(DAG) graph The edges in the graph representthe conditional probabilities between nodesso that the children are independent from eachother and the rest of the graph when condi-tioned on the parents In BPA the value of

the Bayesian random variable assigned to eachnode captures the state of a gene (DE or not) Incontrast in BAPA-IGGFD each random vari-able assigned to an edge is the probability thatup or down regulation of the genes at both endsof an interaction are concordant with the typeof interaction which can be activation or inhi-bition In both BPA and BAPA-IGGFD eachrandom variable is assumed to follow a bino-mial distribution whose probability of successfollows a beta distribution However these twomethods use different approaches in represent-ing the multivariate distribution of the corre-sponding random vector BPA assumes thatthe random vector has a multinomial distribu-tion which is the generalization of the bino-mial distribution In this case the vector of thesuccess probability follows the Dirichlet dis-tribution which is the multivariate extensionof the beta distribution In contrast BAPA-IGGFD assumes the random variables areindependent therefore the multivariate distri-butions are calculated by multiplying the dis-tributions of the random variables in the vec-tor It is worth mentioning that the assumptionof independence in BAPA-IGGFD is contra-dicted by evidence specifically in the case ofedges that share nodes

Other approachesThe four methods DRAGEN GGEA

ToPASeq and BLMA follow strategies thatare very different from those of the other 30methods As such BLMA implements a bi-level meta-analysis approach that can be ap-plied in conjunction with any of the four sta-tistical approaches SPIA (Tarca et al 2009)GSA (Efron amp Tibshirani 2007) ORA (Tava-zoie et al 1999) or PADOG (Tarca DraghiciBhatti amp Romero 2012) The package al-lows users to perform pathway analysis withone dataset or with multiple datasets Simi-larly ToPASeq provides an R package thatruns TopologyGSA DEGraph clipper SPIATBScore PWEA and TAPPA This methodmodels the input (biological networks and ex-periment data) in such a way that it can be usedfor any of the seven different analyses

DRAGEN is an analysis method that scoresthe interactions rather then the genes and usesa regression model to detect differential regu-lation DRAGEN fits each edge of a pathwayinto two linear models (for case and control)and then computes a p-value that representsthe difference between the two models Foreach pathway a summary statistic is computedby combining the p-values of the edges us-ing a weighted Fisherrsquos method (Fisher 1925)

AnalyzingMolecularInteractions

82515

Current Protocols in Bioinformatics Supplement 61

GGEA on the other hand uses Petri Net (Mu-rata 1989) to model the pathway The sum-mary statistic named consistency is computedfrom the fuzzy similarity between the observedgene expression and Petri net with fuzzy logic(PNFL Kuffner Petri Windhager amp Zimmer2010) For both DRAGEN and GGEA the p-value of the pathway is calculated by com-paring the observed summary statistic againstthe null distribution that is constructed bypermutation

CHALLENGES IN PATHWAYANALYSIS

Pathway analysis has become the firstchoice for gaining insights into the underly-ing biology of a phenotype due its explanatorypower However there are outstanding annota-tion and methodological limitations that havenot been addressed (Kelder Conklin Eveloamp Pico 2010 Khatri et al 2012) There arethree main limitations of current knowledgebases First existing knowledge bases are un-able to keep up with the information availablein data obtained from recent technologies Forexample RNA-Seq data allows us to iden-tify transcripts that are active under certainconditions Alternatively spliced transcriptseven if they originate from the same genemay have distinct or even opposite functions(Wang et al 2008) However most knowl-edge bases provide pathway annotation onlyat the gene level Second there is a lack ofcondition and cell-specific information ieinformation about cell type conditions andtime points Finally current pathway annota-tions are neither complete nor perfectly accu-rate (Khatri et al 2012 Rhee Wood Dolinskiamp Draghici 2008) For example the number ofgenes in KEGG have remained around 5000despite being updated continuously in the past10 years while there are approximately 19000human genes annotated with at least one GOterm The number of protein-coding genes isestimated to be between 19000 and 20000(Ezkurdia et al 2014) most of which areincluded in DNA microarray assays such asAffymetrix HG U133 plus 20

Another challenge is the oversimplificationthat characterizes many of the models pro-vided by pathway databases In principle eachtype of tissue might have different mecha-nisms so generic organism-level pathwayspresent a somewhat simplistic description ofthe phenomena Furthermore signaling andmetabolic processes can also be different fromone condition to another or even from one

patient to another Understanding the specificpathways that are impacted in a given pheno-type or sub-group of patients should be anothergoal for the next generation of pathway anal-ysis tools See Khatri et al (2012) for a moredetailed discussion of annotation limitationsof existing knowledge bases

Here we focus on challenges of pathwayanalysis from computational perspectives Wedemonstrate that there is a systematic bias inpathway analysis (Nguyen Mitrea Tagett ampDraghici 2017) This leads to the unreliabilityof most if not all pathway analysis approachesWe also discuss the lack of benchmark datasetsor pipelines to assess the performance of ex-isting approaches

Systematic Bias of Pathway AnalysisMethods

Pathway analysis approaches often rely onhypothesis testing to identify the pathways thatare impacted under the effects of different dis-eases Null distributions are used to modelpopulations so that statistical tests can deter-mine whether an observation is unlikely to oc-cur by chance In principle the p-values pro-duced by a sound statistical test must be uni-formly distributed under the null hypothesis(Barton Crozier Lillycrop Godfrey amp Inskip2013 Bland 2013 Fodor Tickle amp Richard-son 2007 Storey amp Tibshirani 2003) For ex-ample the p-values that result from comparingtwo groups using a t-test should be distributeduniformly if the data are normally distributed(Bland 2013) When the assumptions of sta-tistical models do not hold the resulting p-values are not uniformly distributed under thenull hypothesis This makes classical methodssuch as t-test inaccurate since gene expressionvalues do not necessary follow their assump-tions Here we also show that the problem isextended to pathway analysis as the pathwayp-values obtained from statistical approachesare not uniformly distributed under the nullhypothesis This might lead to severe bias to-wards well-studied diseases such as cancerand thus make the results unreliable (Nguyenet al 2017)

Consider three pathway analysis methodsthat represent three different classes ofmethods for pathway analysis Gene SetAnalysis (GSA Efron amp Tibshirani 2007)is a Functional Class Scoring method (Efronamp Tibshirani 2007 Mootha et al 2003Subramanian et al 2005 Tarca et al2012) Down-weighting of OverlappingGenes (PADOG Tarca et al 2012) is anenrichment method (Beiẞbarth amp Speed

Network-BasedApproaches forPathway Level

Analysis

82516

Supplement 61 Current Protocols in Bioinformatics

2004 Draghici et al 2003 Khatri DraghiciOstermeier amp Krawetz 2002) and Sig-naling Pathway Impact Analysis (SPIATarca et al 2009) is a topology-awaremethod (Draghici et al 2007 Tarca et al2009) To simulate the null distribution wedownload and process the data from ninepublic datasets GSE14924 CD4 GSE14924CD8 GSE17054 GSE12662 GSE57194GSE33223 GSE42140 GSE8023 andGSE15061 Using 140 control samples fromthe nine datasets we simulate 40000 datasetsas follows We randomly label 70 samples ascontrol samples and the remaining 70 samplesas disease samples We repeat this procedure10000 times to generate different groups of70 control and 70 disease samples To makethe simulation more general we also create10000 datasets consisting of 10 control and10 disease samples 10000 datasets consistingof 10 control and 20 disease samples and10000 datasets consisting of 20 controland 10 disease samples We then calculatethe p-values of the KEGG human signalingpathways using each of the three methods

The effect of combining control (iehealthy) samples from different experimentsis to uniformly distribute all sources of biasamong the random groups of samples If wecompare groups of control samples based onexperiments there could be true differencesdue to batch effects By pooling them togetherwe form a population which is considered thereference population This approach is simi-lar to selecting from a large group of peoplethat may contain different sub-groups (egdifferent ethnicities gender race life styleor living conditions) When we randomly se-lect samples (for the two random groups tobe compared) from the reference populationwe expect all bias (eg ethnic subgroups) tobe represented equally in both random groupsand therefore we should see no difference be-tween these random groups no matter howmany distinct ethnic subgroups were presentin the population at large Therefore the p-values of a test for difference between the tworandomly selected groups should be equallyprobable between zero and one

Figure 8254 displays the empirical nulldistributions of p-values using PADOG GSAand SPIA The horizontal axes represent p-values while the vertical axes represent p-value densities Green panels (A0-A6) showp-value distributions from PADOG while blue(B0-B6) and purple (C0-C6) panels show p-value distributions from GSA and SPIA re-spectively For each method the larger panel

(A0 B0 C0) shows the cumulative p-valuesfrom all KEGG signaling pathways The smallpanels six per method display extreme ex-amples of nonuniform p-value distributionsfor specific pathways For each method weshow three distributions severely biased to-wards zero (eg A1-A3) and three distri-butions severely biased towards one (egA4-A6)

These results show that contrary to gen-erally accepted beliefs the p-values are notuniformly distributed for the three methodsconsidered Therefore one should expect avery strong and systematic bias in identifyingsignificant pathways for each of these meth-ods Pathways that have p-values biased to-wards zero will often be falsely identified assignificant (false positives) Likewise path-ways that have p-values biased towards one arelikely to rarely meet the significance require-ments even when they are truly implicatedin the given phenotype (false negatives) Sys-tematic bias due to nonuniformity of p-valuedistributions results in failure of the statisti-cal methods to correctly identify the biologicalpathways implicated in the condition and alsoleads to inconsistent and incorrect results

Querying Data from Knowledge BasesIndependent research groups have tried

different strategies to model complex bio-molecular phenomena These independent ef-forts have led to variation among pathwaydatabases complicating the task of devel-oping pathway analysis methods Dependingon the database there may be differencesin information sources experiment interpre-tation models of molecular interactions orboundaries of the pathways Therefore it ispossible that pathways with the same desig-nation and aiming to describe the same phe-nomena may have different topologies in dif-ferent databases As an example one couldcompare the insulin signaling pathways ofKEGG and BioCarta BioCarta includes fewernodes and emphasizes the effect of insulin ontranscription while KEGG includes transcrip-tion regulation as well as apoptosis and otherbiological processes Differences in graphmodels for molecular interactions are partic-ularly apparent when comparing the signal-ing pathways in KEGG and NCI-PID WhileKEGG represents the interaction informationusing the directed edges themselves NCI-PIDintroduces ldquoprocess nodesrdquo to model interac-tions (see Fig 8252) Developers are facingthe challenge of modifying methods to accept

AnalyzingMolecularInteractions

82517

Current Protocols in Bioinformatics Supplement 61

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 6: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

pathways Figure 8251 shows an examplepathway named ERBB signaling pathway inwhich the nodes represent genes and com-pounds while the edges represent the knowninteraction between the compounds Gene setanalysis approaches are not able to accountfor the topological order of genes nor are ableto explain the signal propagation and mecha-nisms of the pathway These approaches wouldyield exactly the same significance value forthis pathway even if the graph were to becompletely redesigned by future discoveryIn contrast network-based pathway analysiscan distinguish between this pathway and anyother pathway with the same proportion of DEgenes

Here we provide a survey of 34network-based methods developed for path-way analysis Our survey of commercialtools for pathway analysis found iPath-wayGuide (Advaita Corporation httpswwwadvaitabiocom) and MetaCore (ThomsonReuters httpswwwthomsonreuterscom) tobe topology based Other commercial toolssuch as Ingenuity Pathway Analysis (httpswwwingenuitycom) or Genomatix (httpswwwgenomatixde) do not use pathway topol-ogy information and thus are not included inthis review The existence of only two com-mercial tools is more evidence of the challengefaced by the developers of such methods dueto lack of standards

In this document we categorize and com-pare the methods based on the followingcriteria the type of input the method ac-cepts graph model and the statistical ap-proaches used to evaluate gene and pathwaychanges Under the heading ldquoExperiment In-put and Pathway Databasesrdquo below we de-scribe the types of input the surveyed meth-ods use Under the heading ldquoGraph Mod-elsrdquo we provide details regarding the graphmodels used for the biological networks Un-der ldquoPathway Scoring Strategiesrdquo we discussthe statistical methods used by the surveyedmethods to assess gene and pathway changesbetween two phenotypes Under ldquoChallengesin Pathway Analysisrdquo we discuss a numberof outstanding challenges that needed to beaddressed in order to improve the reliabil-ity as well as the relevance of the next-generation pathway analysis approaches Wediscuss the key elements and limitations of thesurveyed methods without going into detailsof each technique For a more detailed reviewand technical descriptions of network-basedpathway analysis methods please refer to

Mitrea et al (2013) or referenced manuscripts(Table 8251)

NETWORK-BASED PATHWAYANALYSIS

The term pathway analysis is used in avery broad context in the literature includ-ing biological network construction and in-ference Here we focus on methods that areable to exploit biological knowledge in pub-lic repositories rather than on network in-ference approaches that attempt to infer orreconstruct pathways from molecular mea-surements Table 8251 shows the list of 34pathway analysis approaches together withtheir availability and licensing In this sec-tion we start by describing the availability ofeach method before discussing the input for-mat graph model and underlying statisticalapproaches

Software Availability andImplementation

We often think that the main strength ofan approach lies in its novelty and algorithmefficiency However the implementation andavailability of a tool have become increas-ingly important for several reasons First soft-ware availability and version control are cru-cial for reproducing the experimental resultsthat were used to assess the performance ofthe approach (Sandve Nekrutenko Taylor ampHovig 2013) For this reason many journalsrequest authors to make their software and dataavailable before accepting methods articlesSecond if a software application is not ready-to-run it is very unlikely that the intended au-dience (mostly life scientists) will invest thetime to understand and implement complexalgorithms Practicality user-friendliness out-put format and type of interface are all to beconsidered Depending on the desired avail-ability and intended audience a software pack-age may be implemented as standalone or Webbased Among the 34 approaches there are 30that are available either as a standalone soft-ware package or Web service

Typically standalone tools need to be in-stalled on local machines or servers which of-ten requires some administrative skills Moststandalone tools depend on full or partialcopies of public pathway databases storedlocally and need to be updated periodicallyAdvantages of standalone tools include (i)instant availability that does not require Inter-net access and (ii) the security and privacy ofthe experimental data Web-based tools on the

Network-BasedApproaches forPathway Level

Analysis

8256

Supplement 61 Current Protocols in Bioinformatics

other hand run the analyses on a remote serverproviding computational power and a graph-ical interface The major advantage of Web-based tools is that they are user-friendly and donot require a separate local installation Fromthe accessibility perspective Web-based toolshave the advantage of being available from anylocation as long as there is an Internet connec-tion and a browser available Also the updateis almost transparent to the client This makesthe userrsquos task easy and enables collaborationsince users all over the world can utilize thesame method without the burden of installingit or keeping it up-to-date There are methodsthat provide both Web-based and standaloneimplementations

HIPAA compliance may also be a factor incertain applications that involve data comingfrom patients or data linked to other clinicaldata or clinical records Currently the iPath-wayGuide is the only tool available that cando topological pathway analysis and is HIPAAcompliant

The programming language and style usedfor implementation also play an important rolein the acceptance of a method Software toolsthat are neatly implemented and packaged aremore appealing compared to those that do nothave ready-to-use implementations Many ofthe methods are implemented in the R pro-gramming language and are available as soft-ware packages from Bioconductor CRAN orthe authorrsquos Web site Their popularity amongbiologists and bioinformaticians is due to thefact that many bioinformatics-dedicated pack-ages are available in R In addition the rigor-ous review procedure provided by Bioconduc-tor and CRAN makes the software packagesmore standardized and reliable

Experiment Input and PathwayDatabases

A pathway analysis approach typically re-quires two types of input experiment data andknown biological networks Experiment datais usually collected from high-throughput ex-periments that compare a condition phenotypeto a control phenotype A condition pheno-type can be a disease a drug treatment orthe knock-out (KO) of a gene while a con-trol phenotype can be a healthy state a dif-ferent disease or drug treatment or wild-type(non-KO) samples Experimental data can beobtained from multiple technologies that pro-duce different types of data gene expressionprotein abundance metabolite concentrationmiRNA expression etc Biological networks

or pathways are often represented in the formof graphs that capture our current knowl-edge about the interactions of genes proteinsmetabolites or compounds in an organismThe pathway data is accumulated updatedand refined by amassing knowledge from sci-entific literature describing individual interac-tions or high-throughput experiment results

Table 8252 shows a summary of in-put format and mathematical modeling ofthe surveyed approaches Most pathwayanalysis methods analyze data from high-throughput experiments such as microarraysnext-generation sequencing or proteomicsThey accept either a list of gene IDs or a listof such gene IDs associated with measuredchanges These changes could be measuredwith different technologies and therefore canserve as proxies for different biochemical en-tities For instance one could use gene ex-pression changes measured with microarraysor protein levels measured with a proteomicapproach etc

Different analysis methods use different in-put formats Many methods accept a list of allgenes considered in the experiment togetherwith their expression values Some analysismethods select a subset of genes consideredto be differentially expressed (DE) based on apredefined cut-off The cut-off is typically ap-plied on fold-change p-value or both Thesemethods use the list of DE genes and their cor-responding statistics (fold-change p-value) asinput Other methods use only the list of DEgenes without corresponding expression val-ues because their scoring methods are basedonly on the relative positions of the genes inthe graph

Methods which use cut-offs are sensitive tothe chosen threshold value because a smallchange in the cut-off may drastically changethe number of selected genes (Nam amp Kim2008) In addition they typically use the mostsignificant genes and discard the rest whoseweaker but coordinated changes may also havesignificant impact on pathways Genes withmoderate differential expression may be losteven though they might be important play-ers in the impacted pathways (Ben-ShaulBergman amp Soreq 2005) Furthermore thegenes included in the set of DE genes canvary dramatically if the selection methods arechanged Hence the results of pathway anal-yses based on DE genes may be vastly differ-ent depending on both the selection method aswell as the threshold value (Pan Lih amp Co-hen 2005) Furthermore for the same disease

AnalyzingMolecularInteractions

8257

Current Protocols in Bioinformatics Supplement 61

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

Signaling pathway analysis using gene expression

MetaCore DE genes Proprietary canonicalpathway genome-scalenetwork

Single-type directed Hierarchicallyaggregated

Pathway-Express DE genes change ormeasured geneschanged

KEGG signaling Single-type directed Hierarchicallyaggregated

PathOlogist Measured genesexpression

KEGG Multi-type directed Hierarchicallyaggregated

iPathwayGuide DE genes change ormeasured geneschange

KEGG signalingReactome NCIBioCarta

Single-type directed Hierarchicallyaggregated

SPIA DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

NetGSA Measured genesexpression

KEGG signaling Single-type directed Multivariate analysis

PWEA Measured genesexpression

YeastNet Single-typeundirected

Hierarchicallyaggregated

TopoGSA DE genes PPI network KEGG Single-typeundirected

Hierarchicallyaggregated

TopologyGSA Measured genesexpression

NCI-PID Single-typeundirected

Multivariate analysis

DEGraph Measured genesexpression

KEGG Single-typeundirected

Multivariate analysis

GGEA Measured genesexpression

KEGG Single-type directed Aggregate fuzzysimilarity

BPA Measured genesexpressionmdashwithcut-off

NCI-PID Single-type DAG Bayesian network

GANPA DE genes change ormeasured genesexpression

PPI network KEGGReactome NCI-PIDHumanCyc

Single-typeundirected

Hierarchicallyaggregated

ROntoTools DE genes change ormeasured geneexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

BAPA-IGGFD Measured genesexpression - withcut-off

Literature-basedinteraction databaseKEGG WikiPathwaysReactome MSigDBGO BP PANTHERconstructed geneassociation networkfrom PPIsco-annotation in GOBiological Process (BP)and co-expression inmicroarray data

Single-type DAG Bayesian network

continued

8258

Supplement 61 Current Protocols in Bioinformatics

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

CePa DE genes expressionor measured genesexpression

NCI-PID Single-type directed Hierarchicallyaggregated

THINK-Back-DS DE genes changemeasured genesexpression

KEGG PANTHERBioCarta ReactomeGenMAPP

Single-type directed Hierarchicallyaggregated

TBScore DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

ACST Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

EnrichNet DE genes list PPI network KEGGBioCartaWikiPathwaysReactome NCI-PIDInterPro GO withSTRING 90

Single-typeundirected

Hierarchicallyaggregated

clipper Measured genesexpression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

DEAP Measured genesexpression

KEGG Reactome Single-type directed Hierarchicallyaggregated

DRAGEN Measured genesexpression

RegulonDB M3DHTRIdb ENCODEMSigDB

Single-type directed Linear regression

ToPASeqe Measured genesexpression

KEGG Single-type directed Hierarchical ampmultivariate

pDis DE genes change ormeasured geneschanged

KEGG signaling Single-type directed Hierarchicallyaggregated

SPATIAL DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

BLMAf Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

Signaling pathway analysis using multiple types of data

PARADIGM Measured genesexpression copynumber and proteinslevels

Constructed PPInetworks from MIPSDIP BIND HPRDIntAct and BioGRID

Multi-type directed Hierarchicallyaggregated

microGraphite Measured geneexpression andmiRNA expression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

mirIntegrator DE genes changeand DE miRNAchange

KEGG signalingmiRTarBase

Single-type directed Hierarchicallyaggregated

Metabolic pathway analysis

ScorePAGE Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

continued

8259

Current Protocols in Bioinformatics Supplement 61

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

TAPPA Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

MetPA DE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

MetaboAnalyst DE genes change andDE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

aExperiment input shows the experiment data input format for each method ldquoDErdquo means differentially expressed ldquoChangerdquo means fold-change valueor t-statistics when comparing genemetabolites values between two phenotypes ldquoMeasuredrdquo means the list of all the genesmetabolites measured inthe experiment ldquoListrdquo represents a list of genesmetabolites identifiers (eg symbols) ldquoWith cut-offrdquo show methods that take as input the list of allmeasured genes and in the analysis they mark the DE genesbPathway database shows the name of the knowledge source for biological interactionscGraph model is a characteristic that shows if the graph has one or multiple types of nodes as well as if directed or undirecteddThe package ROntoTools (Bioconductor) allows for analysis using expression values of all geneseToPASeq provides an R package that runs TopologyGSA DEGraph clipper SPIA TBScore PWEA TAPPAfBLMA provides an R package for bi-level meta-analysis that runs SPIA ORA GSA and PADOG using one or multiple expression datasets

independent studies or measurements oftenproduce different sets of differentially ex-pressed (DE) genes (Ein-Dor Kela GetzGivol amp Domany 2005 Ein-Dor Zuk amp Do-many 2006 Tan et al 2003) This makes ap-proaches that use DE genes as input appeareven more unreliable

Usually pathways are sets of genes andorgene products that interact with each other ina coordinated way to accomplish a given bi-ological function A typical signaling path-way (in KEGG for instance) uses nodes torepresent genes or gene products and edgesto represent signals such as activation or re-pression that go from one gene to anotherA typical metabolic pathway uses nodes torepresent biochemical compounds and edgesto represent reactions that transform one ormore compound(s) into one or more othercompounds These reactions are usually car-ried out or controlled by enzymes which arein turn coded by genes Hence in a metabolicpathway genes or gene products are associatedwith edges rather than nodes as in a signal-ing pathway The immediate consequence ofthis difference is that many techniques cannotbe applied directly on all available pathwaysThis is why the analysis of metabolic pathwaysis still generally and arguably underdeveloped(only 128 Google Scholar citations to date forthe original ScorePage paper Rahnenfuhreret al 2004) and there are not many othermethods available for metabolic pathway anal-ysis However the topology-based analysis ofsignaling pathways has been very successful

(over 1200 citations to date for the two impactanalysis papers mentioned above Draghiciet al 2007 Tarca et al 2009) and over 30other methods have been developed since

There are other types of biological net-works that incorporate genome-wide inter-actions between genes or proteins such asprotein-protein interaction (PPI) networksThese networks are not restricted to specificbiological functions The main caveat relatedto PPI data is that most such data are obtainedfrom a bait-prey laboratory assays rather thanfrom in vivo or in vitro studies The fact thattwo proteins stick to each other in an assayperformed in an artificial environment can bemisleading since the two proteins may neverbe present at the same time in the same tissueor the same part of the cell

Publicly available curated pathwaydatabases used by methods listed in Table8252 are KEGG (Ogata et al 1999)NCI-PID (Schaefer et al 2009) BioCarta(httpswwwbiocartacom) WikiPathways(Pico et al 2008) PANTHER (Mi et al2005) and Reactome (Joshi-Tope et al2005) These knowledge bases are built bymanually curating experiments performed indifferent cell types under different conditionsThese curated knowledgebases are morereliable than protein interaction networksbut do not include all known genes andtheir interactions As an example despitebeing continuously updated KEGG includesonly about 5000 human genes in signalingpathways while the number of protein-coding

Network-BasedApproaches forPathway Level

Analysis

82510

Supplement 61 Current Protocols in Bioinformatics

Figure 8252 Five biological network graph models used by public databases are displayedThe KEGG database contains both signaling and metabolic networks Signaling networks havegenesgene products as nodes and regulatory signals as edges Types of regulatory signalsinclude activation inhibition phosphorylation and many others Metabolic networks have bio-chemical compounds as nodes and chemical reactions as edges Enzymes (specialized pro-teinsgene products) catalyze biochemical reactions therefore genes are linked to the edges inthese networks The Reactome database is a collection of biochemical reactions that are groupedin functionally related sets to form a pathway There are two types of nodes biochemical com-pounds and reactions Reaction nodes link biochemical compounds as reactants and productsNCI-PID signaling networks also have two types of nodes component nodes and process nodesComponent nodes are usually biomolecular components Process nodes are usually biochemicalreactions or biological processes In these networks process nodes link two or more componentnodes through directed edges Process nodes are assigned one of the following states positiveor negative regulation or involved in Protein-protein interaction (PPI) networks have proteinsas nodes and physical binding as edges or interactions Two-hybrid assays are typically used todetermine protein-protein interactions PPIs can be directed in the bait-prey orientation when thebait-prey relation is considered (top) or undirected when is not (bottom) The Biological PathwayExchange (BioPAX) format has physical entities as nodes and conversions as edges The ad-vantage of this representation is that it is generic and provides a lot of flexibility to accommodatevarious types of interactions In addition it provides a machine-readable standard that can beused for all databases to provide their data in a unified format Network nodes can be genesgene products complexes or non-coding RNA Network edges can be assembly of a complexdisassembly of a complex or biochemical reactions among others

genes is estimated to be between 19000 and20000 (Ezkurdia et al 2014)

The implementation of analysis methodsconstrains the software to accept a specific in-put pathway data format while the underlyinggraph models in the methods are independentof the input format Regardless of the pathwayformat this information must be parsed into acomputer-readable graph data structure beforebeing processed The implementation may in-corporate a parser or this may be up to the userFor instance SPIA accepts any signaling path-way or network if it can be transformed into anadjacency matrix representing a directed graphwhere all nodes are components and all edgesare interactions NetGSA is similarly flexiblewith regard to signaling and metabolic path-ways SPIA provides KEGG signaling path-ways as a set of pre-parsed adjacency matri-ces The methods described in this chapter maybe restricted to only one pathway database ormay accept several

To the best of our knowledge there havebeen no comprehensive approaches to com-

pare the pros and cons of existing knowledgebases It is completely up to researchers tochoose the tools that are able to work withthe database they trust Intuitively methodsthat are able to exploit the complementary in-formation available from different databaseshave an edge over methods that work with onesingle database

Graph ModelsPathway analysis approaches use two major

graph models to represent biological networksobtained from knowledge bases (i) single-type and (ii) multiple-type The first modelallows only one type of node ie a gene orprotein with edges representing molecular in-teractions between the nodes (KEGG signal-ing in Fig 8252) Models that contain di-rected graphs are more suitable for analysesthat include signal propagation of gene pertur-bation The second graph model allows mul-tiple type of nodes such as components andinteractions (NCI-PID in Fig 8252) Multi-type graph models are more complex than

AnalyzingMolecularInteractions

82511

Current Protocols in Bioinformatics Supplement 61

single-type but they are expected to capturemore pathway characteristics For examplesingle-type models are limited when tryingto describe ldquoallrdquo and ldquoanyrdquo relations betweenmultiple components that are involved in thesame interaction Bipartite graphs which con-tain two types of nodes and allow connectiononly between nodes of different types are aparticular case of multi-type graph models

The majority of analysis methods surveyedhere use a single-type graph model Some ap-ply the analysis on a directed or un-directedsingle-type network built using the input path-way while others transform the pathways intographs with specific characteristics An ex-ample of the later is TopologyGSA whichtransforms the directed input pathway intoan undirected decomposable graph which hasthe advantage of being easily broken downinto separate modules (Lauritzen 1996) Inthis method decomposable graphs are usedto find ldquoimportantrdquo submodulesmdashthose thatdrive the changes across the whole pathwayFor each pathway TopologyGSA creates anundirected moral graph from the underlyingdirected acyclic graph (DAG) by connectingthe parents of each child and removing theedge direction [The moral graph of a DAGis the undirected graph created by adding an(undirected) edge between all parents of thesame node (sometimes called marrying) andthen replacing all directed edges by undirectededges The name stems from the fact that ina moral graph two nodes that have a com-mon child are required to be married by shar-ing an edge] The moral graph is then usedto test the hypothesis that the underlying net-work is changed significantly between the twophenotypes If the the research hypothesis isrejected a decomposabletriangulated graph isgenerated from the moral graph by adding newedges This graph is broken into the maximalpossible submodules and the hypothesis is re-tested on each of them

PathOlogist and PARADIGM are thetwo surveyed methods that use multi-typegraph models PathOlogist uses a bipartitegraph model with component and interac-tion nodes PARADIGM conceptually moti-vated by the central dogma of molecular bi-ology takes a pathway graph as input andconverts it into a more detailed graph whereeach component node is replaced by sev-eral more specific nodes biological entitynodes interaction nodes and nodes contain-ing observed experiment data The observedexperiment nodes could in principle containgene-expression and copy-number informa-

tion Biological entity nodes are DNA mRNAprotein and active protein The interactionnodes are transcription translation or pro-tein activation among others Biological en-tity and interaction node values are derivedfrom these data and specify the probability ofthe node being active These are the hiddenstates of the model From the mathematicalmodel perspective models that allow multipletypes of nodesmdashcomponent nodes and inter-action nodesmdashare more flexible and are able tomodel both AND and OR gates which are verycommon when describing cellular processes

Pathway Scoring StrategiesThe goal of the scoring method is to com-

pute a score for each pathway based on the ex-pression change and the graph model resultingin a ranked list of pathways or sub-pathwaysThere are a variety of approaches to quantifythe changes in a pathway Some of the analysismethods use a hierarchically aggregated scor-ing algorithm where on the first level a scoreis calculated and assigned to each node or pairof nodes (component andor interaction) Onthe second level these scores are aggregated tocompute the score of the pathway On the lastlevel the statistical significance of the path-way score is assessed using univariate hypoth-esis testing Another approach assigns a ran-dom variable to each node and a multivariateprobability distribution is calculated for eachpathway The output score can be calculated intwo ways One way is to use multivariate hy-pothesis testing to assess the statistical signif-icance of changes in the pathway distributionbetween the two phenotypes The other way isto estimate the distribution parameters basedon the Bayesian network model and use thisdistribution to compute a probabilistic score tomeasure the changes In this section we pro-vide details regarding the scoring algorithmsof the surveyed methods See Figure 8253 forscoring algorithms categories

Hierarchically aggregated scoringThe workflow of the hierarchically aggre-

gated scoring strategy has three levels nodestatistic computation pathway statistic com-putation and the evaluation of the significancefor the pathway statistic

Node level scoringMost of the surveyed methods incorpo-

rate pathway topology information in the nodescores The node level scoring can be di-vided into four categories (i) graph mea-sures (centrality) (ii) similarity measures

Network-BasedApproaches forPathway Level

Analysis

82512

Supplement 61 Current Protocols in Bioinformatics

Figure 8253 A summary of the statistical models is presented for the surveyed methods Mostmethods use the hierarchically aggregated scoring strategy in which the score is computed atthe node level before being aggregated at the pathway level for significance assessment In theleft panel the rows show a node-level statistical model while the columns show the aggregatingstrategies (linear non-linear or weighted gene set) Approaches in multivariate analysis andBayesian network categories use multivariate modeling and Bayesian network respectively tofind pathways that are most likely to be impacted Both BLMA and ToPASeq provide R packagesthat run multiple algorithms DRAGEN follows a linear regression strategy while GGEA appliesfuzzy modeling to rank the pathways

(iii) probabilistic graphical models and (iv)normalized node value (NNV) Approaches inthe first category use centrality measures or avariation of these measures to score nodes in agiven pathway Centrality measures representthe importance of a node relative to all othernodes in a network There are several central-ity measures that can be applied to networksof genes and their interactions degree central-ity closeness between-ness and eigenvectorcentrality Degree centrality accounts for thenumber of directed edges that enter and leaveeach node Closeness sums the shortest dis-tance from each node to all other nodes inthe network Node between-ness measures theimportance of a node according to the numberof shortest paths that pass through it Eigen-vector centrality uses the network adjacencymatrix of a graph to determine a dominanteigenvector each element of this vector is ascore for the corresponding node Thus eachscore is influenced by the scores of neighbor-ing nodes In the case of directed graphs anode that has many downstream genes hasmore influence and receives a higher score

Methods in this category include MetaCoreSPIA iPathwayGuide MetPA SPATIAL To-poGSA DEAP pDis mirIntegrator Pathway-Express and BLMA

Approaches in the second category usesimilarity measures in their node level scor-ing Similarity measures estimate the co-expression behavioral similarity or co-regulation of pairs of components Theirvalues can be correlation coefficients covari-ances or dot products of the gene expressionprofile across time or samples In these meth-ods the pathways with clusters of highly cor-related genes are considered more significantAt the node level a score is assigned to eachpair of nodes in the network which is the ra-tio of one similarity measure over the shortestpath distance between these nodes Thus thetopology information is captured in the nodescore by incorporating the shortest path dis-tance of the pair Methods in this category in-clude ScorePAGE and PWEA

Approaches in the third category incorpo-rate the topology in the node level scoringusing a probabilistic graphical model In this

AnalyzingMolecularInteractions

82513

Current Protocols in Bioinformatics Supplement 61

model nodes are random variables and edgesdefine the conditional dependency of the nodesthey link For example PARADIGM takes ob-served experimental data and calculates scoresfor all component nodes in both observed andhidden states from the detailed network cre-ated by the method based on the input path-way For each node score a positive or negativevalue denotes how likely it is for the node tobe active or inactive respectively The scoresare calculated to maximize the probability ofthe observed values A p-value is associatedwith each score of each sample such that eachnode can be tagged as significantly activesignificantly inactive or not significant Foreach network a matrix of p-values is outputin which columns are samples and rows arecomponent nodes Methods in this categoryinclude PARADIGM and PathOlogist

Approaches in the fourth category simplycompute the score for each node using the in-formation obtained from the experiment inputFor example TAPPA calculates the score ofeach node as the square root of the normalizedlog gene expressions (node value) while ACSTand TBScore calculate the node level score us-ing a sign statistic and log fold-change

Pathway level scoringThere are three different ways to compute

the pathway score (i) linear (ii) non-linearand iii) weighted gene set Most methods ag-gregate node level statistics to pathway levelstatistics using linear functions such as av-eraging or summation For example iPath-wayGuide computes the scores of pathway asthe sum of all genes while TBScore weightsthe pathway DE genes based on their log foldchange and the number of distinct DE genesdirectly downstream of them using a depth-first search algorithm

Approaches in the second group (non-linear) use a non-linear function to computethe pathway scores For example TAPPA com-putes the pathway score for each sample as aweighted sum of the product of all node pairscores in the pathway The weight coefficientis 0 when there is no edge between a pair Forany connected node pair the weight is a signfunction which represents joint up- or down-regulation of the pair Another example is En-richNet in which pathway scores measure thedifference of the node score distribution for apathway and a background networkgene setwhich consists of all pathways At the nodelevel the distance of all DE genes to the path-way is measured and summarized as a distancedistribution The method assumes that the most

relevant pathway is the one with the greatestdifference between the pathway node scoredistribution and the background score distri-bution The difference between the two distri-butions is measured by the weighted averagingof the difference between the two discretizedand normalized distributions The averagingmethod down-weights the higher distances andemphasizes the lower distance nodes

Methods in the third group (weighted geneset) design scoring techniques that incorpo-rate existing gene set analysis methods suchas GSEA (Subramanian et al 2005) GSA(Efron amp Tibshirani 2007) or LRPath (SartorLeikauf amp Medvedovic 2009) Pathway-levelscores can be calculated using node scoresthat represent the topology characteristic of thepathway as weight adjustments to a gene setanalysis method PWEA GANPA THINK-Back-DS and CePa use this approach and werefer to them as weighted gene set analysismethods

Pathway significance assessmentPathway scores are intended to provide in-

formation regarding the amount of changeincurred in the pathway between two phe-notypes However the amount of change isnot meaningful by itself since any amount ofchange can take place just by chance An as-sessment of the significance of the measuredchanges is thus required

TopoGSA MetPA and EnrichNet outputscores without any significance assessmentleaving it up to the user to interpret the re-sults This is problematic because users do nothave any instrument to distinguish betweenchanges due to noise or random causes andmeaningful changes that are unlikely to occurjust by chance and that therefore are possi-bly related to the phenotype The rest of theanalysis methods perform a hypothesis test foreach pathway The null hypothesis is that thevalue of the observed statistic is due to randomnoise or chance alone The research hypothe-sis is that the observed values are substantialenough that they are potentially related to thephenotype A p-value for calculated score isthen computed and a user-defined thresholdon the p-value is used to decide whether thethe null hypothesis can be rejected or not foreach pathway Finally a correction for multi-ple comparisons should be performed

Typically pathway analysis methods com-pute one score per pathway The distribution ofthis score under the null hypothesis can be con-structed and compared to the observed scoreHowever there are often too few samples to

Network-BasedApproaches forPathway Level

Analysis

82514

Supplement 61 Current Protocols in Bioinformatics

calculate this distribution so it is assumedthat the distribution is known For example inMetaCore and many other techniques whenthe pathway score is the number of DE nodesthat fall on the pathway the distribution isassumed to be hypergeometric However thehypergeometric distribution assumes that thevariables (genes in this case) are independentwhich is incorrect as witnessed by the factthat the pathway graph structure itself is de-signed to reflect the specific ways in which thegenes influence each other Another approachto identify the distribution is to use statisticaltechniques such as the bootstrap (Efron 1979)Bootstrapping can be done either at the sam-ple level by permuting the sample labels orat gene-set level by permuting the the valuesassigned to the genes in the set

Multivariate analysis and Bayesiannetwork

Multivariate analysis methods mostly usemultivariate probability distributions to com-pute pathway-level statistics and these can begrouped into two subcategories Methods inthe first category use multivariate hypothesistesting while methods in the second categoryare based on Bayesian network

NetGSA TopologyGSA DEGraph clip-per and microGraphite are methods based onmultivariate hypothesis testing These anal-ysis methods assume the vectors of geneexpression values in each (sub)pathway arerandom vectors with multivariate normal dis-tributions The network topology informationis stored in the covariance matrix of the cor-responding distribution For a network if thetwo distributions of the gene expression vec-tors corresponding to the two phenotypes aresignificantly different the network is assumedto be significantly impacted when comparingthe two phenotypes The significance assess-ment is done by a multivariate hypothesis testThe definition of the null hypothesis for thestatistical tests and the techniques to calcu-late the parameters of the distributions are themain differences between these three analysismethods

BPA and BAPA-IGGFD are two methodsbased on Bayesian networks In a Bayesiannetwork which is a special case of proba-bilistic graphical models a random variableis assigned to each node of a directed acyclic(DAG) graph The edges in the graph representthe conditional probabilities between nodesso that the children are independent from eachother and the rest of the graph when condi-tioned on the parents In BPA the value of

the Bayesian random variable assigned to eachnode captures the state of a gene (DE or not) Incontrast in BAPA-IGGFD each random vari-able assigned to an edge is the probability thatup or down regulation of the genes at both endsof an interaction are concordant with the typeof interaction which can be activation or inhi-bition In both BPA and BAPA-IGGFD eachrandom variable is assumed to follow a bino-mial distribution whose probability of successfollows a beta distribution However these twomethods use different approaches in represent-ing the multivariate distribution of the corre-sponding random vector BPA assumes thatthe random vector has a multinomial distribu-tion which is the generalization of the bino-mial distribution In this case the vector of thesuccess probability follows the Dirichlet dis-tribution which is the multivariate extensionof the beta distribution In contrast BAPA-IGGFD assumes the random variables areindependent therefore the multivariate distri-butions are calculated by multiplying the dis-tributions of the random variables in the vec-tor It is worth mentioning that the assumptionof independence in BAPA-IGGFD is contra-dicted by evidence specifically in the case ofedges that share nodes

Other approachesThe four methods DRAGEN GGEA

ToPASeq and BLMA follow strategies thatare very different from those of the other 30methods As such BLMA implements a bi-level meta-analysis approach that can be ap-plied in conjunction with any of the four sta-tistical approaches SPIA (Tarca et al 2009)GSA (Efron amp Tibshirani 2007) ORA (Tava-zoie et al 1999) or PADOG (Tarca DraghiciBhatti amp Romero 2012) The package al-lows users to perform pathway analysis withone dataset or with multiple datasets Simi-larly ToPASeq provides an R package thatruns TopologyGSA DEGraph clipper SPIATBScore PWEA and TAPPA This methodmodels the input (biological networks and ex-periment data) in such a way that it can be usedfor any of the seven different analyses

DRAGEN is an analysis method that scoresthe interactions rather then the genes and usesa regression model to detect differential regu-lation DRAGEN fits each edge of a pathwayinto two linear models (for case and control)and then computes a p-value that representsthe difference between the two models Foreach pathway a summary statistic is computedby combining the p-values of the edges us-ing a weighted Fisherrsquos method (Fisher 1925)

AnalyzingMolecularInteractions

82515

Current Protocols in Bioinformatics Supplement 61

GGEA on the other hand uses Petri Net (Mu-rata 1989) to model the pathway The sum-mary statistic named consistency is computedfrom the fuzzy similarity between the observedgene expression and Petri net with fuzzy logic(PNFL Kuffner Petri Windhager amp Zimmer2010) For both DRAGEN and GGEA the p-value of the pathway is calculated by com-paring the observed summary statistic againstthe null distribution that is constructed bypermutation

CHALLENGES IN PATHWAYANALYSIS

Pathway analysis has become the firstchoice for gaining insights into the underly-ing biology of a phenotype due its explanatorypower However there are outstanding annota-tion and methodological limitations that havenot been addressed (Kelder Conklin Eveloamp Pico 2010 Khatri et al 2012) There arethree main limitations of current knowledgebases First existing knowledge bases are un-able to keep up with the information availablein data obtained from recent technologies Forexample RNA-Seq data allows us to iden-tify transcripts that are active under certainconditions Alternatively spliced transcriptseven if they originate from the same genemay have distinct or even opposite functions(Wang et al 2008) However most knowl-edge bases provide pathway annotation onlyat the gene level Second there is a lack ofcondition and cell-specific information ieinformation about cell type conditions andtime points Finally current pathway annota-tions are neither complete nor perfectly accu-rate (Khatri et al 2012 Rhee Wood Dolinskiamp Draghici 2008) For example the number ofgenes in KEGG have remained around 5000despite being updated continuously in the past10 years while there are approximately 19000human genes annotated with at least one GOterm The number of protein-coding genes isestimated to be between 19000 and 20000(Ezkurdia et al 2014) most of which areincluded in DNA microarray assays such asAffymetrix HG U133 plus 20

Another challenge is the oversimplificationthat characterizes many of the models pro-vided by pathway databases In principle eachtype of tissue might have different mecha-nisms so generic organism-level pathwayspresent a somewhat simplistic description ofthe phenomena Furthermore signaling andmetabolic processes can also be different fromone condition to another or even from one

patient to another Understanding the specificpathways that are impacted in a given pheno-type or sub-group of patients should be anothergoal for the next generation of pathway anal-ysis tools See Khatri et al (2012) for a moredetailed discussion of annotation limitationsof existing knowledge bases

Here we focus on challenges of pathwayanalysis from computational perspectives Wedemonstrate that there is a systematic bias inpathway analysis (Nguyen Mitrea Tagett ampDraghici 2017) This leads to the unreliabilityof most if not all pathway analysis approachesWe also discuss the lack of benchmark datasetsor pipelines to assess the performance of ex-isting approaches

Systematic Bias of Pathway AnalysisMethods

Pathway analysis approaches often rely onhypothesis testing to identify the pathways thatare impacted under the effects of different dis-eases Null distributions are used to modelpopulations so that statistical tests can deter-mine whether an observation is unlikely to oc-cur by chance In principle the p-values pro-duced by a sound statistical test must be uni-formly distributed under the null hypothesis(Barton Crozier Lillycrop Godfrey amp Inskip2013 Bland 2013 Fodor Tickle amp Richard-son 2007 Storey amp Tibshirani 2003) For ex-ample the p-values that result from comparingtwo groups using a t-test should be distributeduniformly if the data are normally distributed(Bland 2013) When the assumptions of sta-tistical models do not hold the resulting p-values are not uniformly distributed under thenull hypothesis This makes classical methodssuch as t-test inaccurate since gene expressionvalues do not necessary follow their assump-tions Here we also show that the problem isextended to pathway analysis as the pathwayp-values obtained from statistical approachesare not uniformly distributed under the nullhypothesis This might lead to severe bias to-wards well-studied diseases such as cancerand thus make the results unreliable (Nguyenet al 2017)

Consider three pathway analysis methodsthat represent three different classes ofmethods for pathway analysis Gene SetAnalysis (GSA Efron amp Tibshirani 2007)is a Functional Class Scoring method (Efronamp Tibshirani 2007 Mootha et al 2003Subramanian et al 2005 Tarca et al2012) Down-weighting of OverlappingGenes (PADOG Tarca et al 2012) is anenrichment method (Beiẞbarth amp Speed

Network-BasedApproaches forPathway Level

Analysis

82516

Supplement 61 Current Protocols in Bioinformatics

2004 Draghici et al 2003 Khatri DraghiciOstermeier amp Krawetz 2002) and Sig-naling Pathway Impact Analysis (SPIATarca et al 2009) is a topology-awaremethod (Draghici et al 2007 Tarca et al2009) To simulate the null distribution wedownload and process the data from ninepublic datasets GSE14924 CD4 GSE14924CD8 GSE17054 GSE12662 GSE57194GSE33223 GSE42140 GSE8023 andGSE15061 Using 140 control samples fromthe nine datasets we simulate 40000 datasetsas follows We randomly label 70 samples ascontrol samples and the remaining 70 samplesas disease samples We repeat this procedure10000 times to generate different groups of70 control and 70 disease samples To makethe simulation more general we also create10000 datasets consisting of 10 control and10 disease samples 10000 datasets consistingof 10 control and 20 disease samples and10000 datasets consisting of 20 controland 10 disease samples We then calculatethe p-values of the KEGG human signalingpathways using each of the three methods

The effect of combining control (iehealthy) samples from different experimentsis to uniformly distribute all sources of biasamong the random groups of samples If wecompare groups of control samples based onexperiments there could be true differencesdue to batch effects By pooling them togetherwe form a population which is considered thereference population This approach is simi-lar to selecting from a large group of peoplethat may contain different sub-groups (egdifferent ethnicities gender race life styleor living conditions) When we randomly se-lect samples (for the two random groups tobe compared) from the reference populationwe expect all bias (eg ethnic subgroups) tobe represented equally in both random groupsand therefore we should see no difference be-tween these random groups no matter howmany distinct ethnic subgroups were presentin the population at large Therefore the p-values of a test for difference between the tworandomly selected groups should be equallyprobable between zero and one

Figure 8254 displays the empirical nulldistributions of p-values using PADOG GSAand SPIA The horizontal axes represent p-values while the vertical axes represent p-value densities Green panels (A0-A6) showp-value distributions from PADOG while blue(B0-B6) and purple (C0-C6) panels show p-value distributions from GSA and SPIA re-spectively For each method the larger panel

(A0 B0 C0) shows the cumulative p-valuesfrom all KEGG signaling pathways The smallpanels six per method display extreme ex-amples of nonuniform p-value distributionsfor specific pathways For each method weshow three distributions severely biased to-wards zero (eg A1-A3) and three distri-butions severely biased towards one (egA4-A6)

These results show that contrary to gen-erally accepted beliefs the p-values are notuniformly distributed for the three methodsconsidered Therefore one should expect avery strong and systematic bias in identifyingsignificant pathways for each of these meth-ods Pathways that have p-values biased to-wards zero will often be falsely identified assignificant (false positives) Likewise path-ways that have p-values biased towards one arelikely to rarely meet the significance require-ments even when they are truly implicatedin the given phenotype (false negatives) Sys-tematic bias due to nonuniformity of p-valuedistributions results in failure of the statisti-cal methods to correctly identify the biologicalpathways implicated in the condition and alsoleads to inconsistent and incorrect results

Querying Data from Knowledge BasesIndependent research groups have tried

different strategies to model complex bio-molecular phenomena These independent ef-forts have led to variation among pathwaydatabases complicating the task of devel-oping pathway analysis methods Dependingon the database there may be differencesin information sources experiment interpre-tation models of molecular interactions orboundaries of the pathways Therefore it ispossible that pathways with the same desig-nation and aiming to describe the same phe-nomena may have different topologies in dif-ferent databases As an example one couldcompare the insulin signaling pathways ofKEGG and BioCarta BioCarta includes fewernodes and emphasizes the effect of insulin ontranscription while KEGG includes transcrip-tion regulation as well as apoptosis and otherbiological processes Differences in graphmodels for molecular interactions are partic-ularly apparent when comparing the signal-ing pathways in KEGG and NCI-PID WhileKEGG represents the interaction informationusing the directed edges themselves NCI-PIDintroduces ldquoprocess nodesrdquo to model interac-tions (see Fig 8252) Developers are facingthe challenge of modifying methods to accept

AnalyzingMolecularInteractions

82517

Current Protocols in Bioinformatics Supplement 61

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 7: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

other hand run the analyses on a remote serverproviding computational power and a graph-ical interface The major advantage of Web-based tools is that they are user-friendly and donot require a separate local installation Fromthe accessibility perspective Web-based toolshave the advantage of being available from anylocation as long as there is an Internet connec-tion and a browser available Also the updateis almost transparent to the client This makesthe userrsquos task easy and enables collaborationsince users all over the world can utilize thesame method without the burden of installingit or keeping it up-to-date There are methodsthat provide both Web-based and standaloneimplementations

HIPAA compliance may also be a factor incertain applications that involve data comingfrom patients or data linked to other clinicaldata or clinical records Currently the iPath-wayGuide is the only tool available that cando topological pathway analysis and is HIPAAcompliant

The programming language and style usedfor implementation also play an important rolein the acceptance of a method Software toolsthat are neatly implemented and packaged aremore appealing compared to those that do nothave ready-to-use implementations Many ofthe methods are implemented in the R pro-gramming language and are available as soft-ware packages from Bioconductor CRAN orthe authorrsquos Web site Their popularity amongbiologists and bioinformaticians is due to thefact that many bioinformatics-dedicated pack-ages are available in R In addition the rigor-ous review procedure provided by Bioconduc-tor and CRAN makes the software packagesmore standardized and reliable

Experiment Input and PathwayDatabases

A pathway analysis approach typically re-quires two types of input experiment data andknown biological networks Experiment datais usually collected from high-throughput ex-periments that compare a condition phenotypeto a control phenotype A condition pheno-type can be a disease a drug treatment orthe knock-out (KO) of a gene while a con-trol phenotype can be a healthy state a dif-ferent disease or drug treatment or wild-type(non-KO) samples Experimental data can beobtained from multiple technologies that pro-duce different types of data gene expressionprotein abundance metabolite concentrationmiRNA expression etc Biological networks

or pathways are often represented in the formof graphs that capture our current knowl-edge about the interactions of genes proteinsmetabolites or compounds in an organismThe pathway data is accumulated updatedand refined by amassing knowledge from sci-entific literature describing individual interac-tions or high-throughput experiment results

Table 8252 shows a summary of in-put format and mathematical modeling ofthe surveyed approaches Most pathwayanalysis methods analyze data from high-throughput experiments such as microarraysnext-generation sequencing or proteomicsThey accept either a list of gene IDs or a listof such gene IDs associated with measuredchanges These changes could be measuredwith different technologies and therefore canserve as proxies for different biochemical en-tities For instance one could use gene ex-pression changes measured with microarraysor protein levels measured with a proteomicapproach etc

Different analysis methods use different in-put formats Many methods accept a list of allgenes considered in the experiment togetherwith their expression values Some analysismethods select a subset of genes consideredto be differentially expressed (DE) based on apredefined cut-off The cut-off is typically ap-plied on fold-change p-value or both Thesemethods use the list of DE genes and their cor-responding statistics (fold-change p-value) asinput Other methods use only the list of DEgenes without corresponding expression val-ues because their scoring methods are basedonly on the relative positions of the genes inthe graph

Methods which use cut-offs are sensitive tothe chosen threshold value because a smallchange in the cut-off may drastically changethe number of selected genes (Nam amp Kim2008) In addition they typically use the mostsignificant genes and discard the rest whoseweaker but coordinated changes may also havesignificant impact on pathways Genes withmoderate differential expression may be losteven though they might be important play-ers in the impacted pathways (Ben-ShaulBergman amp Soreq 2005) Furthermore thegenes included in the set of DE genes canvary dramatically if the selection methods arechanged Hence the results of pathway anal-yses based on DE genes may be vastly differ-ent depending on both the selection method aswell as the threshold value (Pan Lih amp Co-hen 2005) Furthermore for the same disease

AnalyzingMolecularInteractions

8257

Current Protocols in Bioinformatics Supplement 61

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

Signaling pathway analysis using gene expression

MetaCore DE genes Proprietary canonicalpathway genome-scalenetwork

Single-type directed Hierarchicallyaggregated

Pathway-Express DE genes change ormeasured geneschanged

KEGG signaling Single-type directed Hierarchicallyaggregated

PathOlogist Measured genesexpression

KEGG Multi-type directed Hierarchicallyaggregated

iPathwayGuide DE genes change ormeasured geneschange

KEGG signalingReactome NCIBioCarta

Single-type directed Hierarchicallyaggregated

SPIA DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

NetGSA Measured genesexpression

KEGG signaling Single-type directed Multivariate analysis

PWEA Measured genesexpression

YeastNet Single-typeundirected

Hierarchicallyaggregated

TopoGSA DE genes PPI network KEGG Single-typeundirected

Hierarchicallyaggregated

TopologyGSA Measured genesexpression

NCI-PID Single-typeundirected

Multivariate analysis

DEGraph Measured genesexpression

KEGG Single-typeundirected

Multivariate analysis

GGEA Measured genesexpression

KEGG Single-type directed Aggregate fuzzysimilarity

BPA Measured genesexpressionmdashwithcut-off

NCI-PID Single-type DAG Bayesian network

GANPA DE genes change ormeasured genesexpression

PPI network KEGGReactome NCI-PIDHumanCyc

Single-typeundirected

Hierarchicallyaggregated

ROntoTools DE genes change ormeasured geneexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

BAPA-IGGFD Measured genesexpression - withcut-off

Literature-basedinteraction databaseKEGG WikiPathwaysReactome MSigDBGO BP PANTHERconstructed geneassociation networkfrom PPIsco-annotation in GOBiological Process (BP)and co-expression inmicroarray data

Single-type DAG Bayesian network

continued

8258

Supplement 61 Current Protocols in Bioinformatics

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

CePa DE genes expressionor measured genesexpression

NCI-PID Single-type directed Hierarchicallyaggregated

THINK-Back-DS DE genes changemeasured genesexpression

KEGG PANTHERBioCarta ReactomeGenMAPP

Single-type directed Hierarchicallyaggregated

TBScore DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

ACST Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

EnrichNet DE genes list PPI network KEGGBioCartaWikiPathwaysReactome NCI-PIDInterPro GO withSTRING 90

Single-typeundirected

Hierarchicallyaggregated

clipper Measured genesexpression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

DEAP Measured genesexpression

KEGG Reactome Single-type directed Hierarchicallyaggregated

DRAGEN Measured genesexpression

RegulonDB M3DHTRIdb ENCODEMSigDB

Single-type directed Linear regression

ToPASeqe Measured genesexpression

KEGG Single-type directed Hierarchical ampmultivariate

pDis DE genes change ormeasured geneschanged

KEGG signaling Single-type directed Hierarchicallyaggregated

SPATIAL DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

BLMAf Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

Signaling pathway analysis using multiple types of data

PARADIGM Measured genesexpression copynumber and proteinslevels

Constructed PPInetworks from MIPSDIP BIND HPRDIntAct and BioGRID

Multi-type directed Hierarchicallyaggregated

microGraphite Measured geneexpression andmiRNA expression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

mirIntegrator DE genes changeand DE miRNAchange

KEGG signalingmiRTarBase

Single-type directed Hierarchicallyaggregated

Metabolic pathway analysis

ScorePAGE Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

continued

8259

Current Protocols in Bioinformatics Supplement 61

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

TAPPA Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

MetPA DE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

MetaboAnalyst DE genes change andDE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

aExperiment input shows the experiment data input format for each method ldquoDErdquo means differentially expressed ldquoChangerdquo means fold-change valueor t-statistics when comparing genemetabolites values between two phenotypes ldquoMeasuredrdquo means the list of all the genesmetabolites measured inthe experiment ldquoListrdquo represents a list of genesmetabolites identifiers (eg symbols) ldquoWith cut-offrdquo show methods that take as input the list of allmeasured genes and in the analysis they mark the DE genesbPathway database shows the name of the knowledge source for biological interactionscGraph model is a characteristic that shows if the graph has one or multiple types of nodes as well as if directed or undirecteddThe package ROntoTools (Bioconductor) allows for analysis using expression values of all geneseToPASeq provides an R package that runs TopologyGSA DEGraph clipper SPIA TBScore PWEA TAPPAfBLMA provides an R package for bi-level meta-analysis that runs SPIA ORA GSA and PADOG using one or multiple expression datasets

independent studies or measurements oftenproduce different sets of differentially ex-pressed (DE) genes (Ein-Dor Kela GetzGivol amp Domany 2005 Ein-Dor Zuk amp Do-many 2006 Tan et al 2003) This makes ap-proaches that use DE genes as input appeareven more unreliable

Usually pathways are sets of genes andorgene products that interact with each other ina coordinated way to accomplish a given bi-ological function A typical signaling path-way (in KEGG for instance) uses nodes torepresent genes or gene products and edgesto represent signals such as activation or re-pression that go from one gene to anotherA typical metabolic pathway uses nodes torepresent biochemical compounds and edgesto represent reactions that transform one ormore compound(s) into one or more othercompounds These reactions are usually car-ried out or controlled by enzymes which arein turn coded by genes Hence in a metabolicpathway genes or gene products are associatedwith edges rather than nodes as in a signal-ing pathway The immediate consequence ofthis difference is that many techniques cannotbe applied directly on all available pathwaysThis is why the analysis of metabolic pathwaysis still generally and arguably underdeveloped(only 128 Google Scholar citations to date forthe original ScorePage paper Rahnenfuhreret al 2004) and there are not many othermethods available for metabolic pathway anal-ysis However the topology-based analysis ofsignaling pathways has been very successful

(over 1200 citations to date for the two impactanalysis papers mentioned above Draghiciet al 2007 Tarca et al 2009) and over 30other methods have been developed since

There are other types of biological net-works that incorporate genome-wide inter-actions between genes or proteins such asprotein-protein interaction (PPI) networksThese networks are not restricted to specificbiological functions The main caveat relatedto PPI data is that most such data are obtainedfrom a bait-prey laboratory assays rather thanfrom in vivo or in vitro studies The fact thattwo proteins stick to each other in an assayperformed in an artificial environment can bemisleading since the two proteins may neverbe present at the same time in the same tissueor the same part of the cell

Publicly available curated pathwaydatabases used by methods listed in Table8252 are KEGG (Ogata et al 1999)NCI-PID (Schaefer et al 2009) BioCarta(httpswwwbiocartacom) WikiPathways(Pico et al 2008) PANTHER (Mi et al2005) and Reactome (Joshi-Tope et al2005) These knowledge bases are built bymanually curating experiments performed indifferent cell types under different conditionsThese curated knowledgebases are morereliable than protein interaction networksbut do not include all known genes andtheir interactions As an example despitebeing continuously updated KEGG includesonly about 5000 human genes in signalingpathways while the number of protein-coding

Network-BasedApproaches forPathway Level

Analysis

82510

Supplement 61 Current Protocols in Bioinformatics

Figure 8252 Five biological network graph models used by public databases are displayedThe KEGG database contains both signaling and metabolic networks Signaling networks havegenesgene products as nodes and regulatory signals as edges Types of regulatory signalsinclude activation inhibition phosphorylation and many others Metabolic networks have bio-chemical compounds as nodes and chemical reactions as edges Enzymes (specialized pro-teinsgene products) catalyze biochemical reactions therefore genes are linked to the edges inthese networks The Reactome database is a collection of biochemical reactions that are groupedin functionally related sets to form a pathway There are two types of nodes biochemical com-pounds and reactions Reaction nodes link biochemical compounds as reactants and productsNCI-PID signaling networks also have two types of nodes component nodes and process nodesComponent nodes are usually biomolecular components Process nodes are usually biochemicalreactions or biological processes In these networks process nodes link two or more componentnodes through directed edges Process nodes are assigned one of the following states positiveor negative regulation or involved in Protein-protein interaction (PPI) networks have proteinsas nodes and physical binding as edges or interactions Two-hybrid assays are typically used todetermine protein-protein interactions PPIs can be directed in the bait-prey orientation when thebait-prey relation is considered (top) or undirected when is not (bottom) The Biological PathwayExchange (BioPAX) format has physical entities as nodes and conversions as edges The ad-vantage of this representation is that it is generic and provides a lot of flexibility to accommodatevarious types of interactions In addition it provides a machine-readable standard that can beused for all databases to provide their data in a unified format Network nodes can be genesgene products complexes or non-coding RNA Network edges can be assembly of a complexdisassembly of a complex or biochemical reactions among others

genes is estimated to be between 19000 and20000 (Ezkurdia et al 2014)

The implementation of analysis methodsconstrains the software to accept a specific in-put pathway data format while the underlyinggraph models in the methods are independentof the input format Regardless of the pathwayformat this information must be parsed into acomputer-readable graph data structure beforebeing processed The implementation may in-corporate a parser or this may be up to the userFor instance SPIA accepts any signaling path-way or network if it can be transformed into anadjacency matrix representing a directed graphwhere all nodes are components and all edgesare interactions NetGSA is similarly flexiblewith regard to signaling and metabolic path-ways SPIA provides KEGG signaling path-ways as a set of pre-parsed adjacency matri-ces The methods described in this chapter maybe restricted to only one pathway database ormay accept several

To the best of our knowledge there havebeen no comprehensive approaches to com-

pare the pros and cons of existing knowledgebases It is completely up to researchers tochoose the tools that are able to work withthe database they trust Intuitively methodsthat are able to exploit the complementary in-formation available from different databaseshave an edge over methods that work with onesingle database

Graph ModelsPathway analysis approaches use two major

graph models to represent biological networksobtained from knowledge bases (i) single-type and (ii) multiple-type The first modelallows only one type of node ie a gene orprotein with edges representing molecular in-teractions between the nodes (KEGG signal-ing in Fig 8252) Models that contain di-rected graphs are more suitable for analysesthat include signal propagation of gene pertur-bation The second graph model allows mul-tiple type of nodes such as components andinteractions (NCI-PID in Fig 8252) Multi-type graph models are more complex than

AnalyzingMolecularInteractions

82511

Current Protocols in Bioinformatics Supplement 61

single-type but they are expected to capturemore pathway characteristics For examplesingle-type models are limited when tryingto describe ldquoallrdquo and ldquoanyrdquo relations betweenmultiple components that are involved in thesame interaction Bipartite graphs which con-tain two types of nodes and allow connectiononly between nodes of different types are aparticular case of multi-type graph models

The majority of analysis methods surveyedhere use a single-type graph model Some ap-ply the analysis on a directed or un-directedsingle-type network built using the input path-way while others transform the pathways intographs with specific characteristics An ex-ample of the later is TopologyGSA whichtransforms the directed input pathway intoan undirected decomposable graph which hasthe advantage of being easily broken downinto separate modules (Lauritzen 1996) Inthis method decomposable graphs are usedto find ldquoimportantrdquo submodulesmdashthose thatdrive the changes across the whole pathwayFor each pathway TopologyGSA creates anundirected moral graph from the underlyingdirected acyclic graph (DAG) by connectingthe parents of each child and removing theedge direction [The moral graph of a DAGis the undirected graph created by adding an(undirected) edge between all parents of thesame node (sometimes called marrying) andthen replacing all directed edges by undirectededges The name stems from the fact that ina moral graph two nodes that have a com-mon child are required to be married by shar-ing an edge] The moral graph is then usedto test the hypothesis that the underlying net-work is changed significantly between the twophenotypes If the the research hypothesis isrejected a decomposabletriangulated graph isgenerated from the moral graph by adding newedges This graph is broken into the maximalpossible submodules and the hypothesis is re-tested on each of them

PathOlogist and PARADIGM are thetwo surveyed methods that use multi-typegraph models PathOlogist uses a bipartitegraph model with component and interac-tion nodes PARADIGM conceptually moti-vated by the central dogma of molecular bi-ology takes a pathway graph as input andconverts it into a more detailed graph whereeach component node is replaced by sev-eral more specific nodes biological entitynodes interaction nodes and nodes contain-ing observed experiment data The observedexperiment nodes could in principle containgene-expression and copy-number informa-

tion Biological entity nodes are DNA mRNAprotein and active protein The interactionnodes are transcription translation or pro-tein activation among others Biological en-tity and interaction node values are derivedfrom these data and specify the probability ofthe node being active These are the hiddenstates of the model From the mathematicalmodel perspective models that allow multipletypes of nodesmdashcomponent nodes and inter-action nodesmdashare more flexible and are able tomodel both AND and OR gates which are verycommon when describing cellular processes

Pathway Scoring StrategiesThe goal of the scoring method is to com-

pute a score for each pathway based on the ex-pression change and the graph model resultingin a ranked list of pathways or sub-pathwaysThere are a variety of approaches to quantifythe changes in a pathway Some of the analysismethods use a hierarchically aggregated scor-ing algorithm where on the first level a scoreis calculated and assigned to each node or pairof nodes (component andor interaction) Onthe second level these scores are aggregated tocompute the score of the pathway On the lastlevel the statistical significance of the path-way score is assessed using univariate hypoth-esis testing Another approach assigns a ran-dom variable to each node and a multivariateprobability distribution is calculated for eachpathway The output score can be calculated intwo ways One way is to use multivariate hy-pothesis testing to assess the statistical signif-icance of changes in the pathway distributionbetween the two phenotypes The other way isto estimate the distribution parameters basedon the Bayesian network model and use thisdistribution to compute a probabilistic score tomeasure the changes In this section we pro-vide details regarding the scoring algorithmsof the surveyed methods See Figure 8253 forscoring algorithms categories

Hierarchically aggregated scoringThe workflow of the hierarchically aggre-

gated scoring strategy has three levels nodestatistic computation pathway statistic com-putation and the evaluation of the significancefor the pathway statistic

Node level scoringMost of the surveyed methods incorpo-

rate pathway topology information in the nodescores The node level scoring can be di-vided into four categories (i) graph mea-sures (centrality) (ii) similarity measures

Network-BasedApproaches forPathway Level

Analysis

82512

Supplement 61 Current Protocols in Bioinformatics

Figure 8253 A summary of the statistical models is presented for the surveyed methods Mostmethods use the hierarchically aggregated scoring strategy in which the score is computed atthe node level before being aggregated at the pathway level for significance assessment In theleft panel the rows show a node-level statistical model while the columns show the aggregatingstrategies (linear non-linear or weighted gene set) Approaches in multivariate analysis andBayesian network categories use multivariate modeling and Bayesian network respectively tofind pathways that are most likely to be impacted Both BLMA and ToPASeq provide R packagesthat run multiple algorithms DRAGEN follows a linear regression strategy while GGEA appliesfuzzy modeling to rank the pathways

(iii) probabilistic graphical models and (iv)normalized node value (NNV) Approaches inthe first category use centrality measures or avariation of these measures to score nodes in agiven pathway Centrality measures representthe importance of a node relative to all othernodes in a network There are several central-ity measures that can be applied to networksof genes and their interactions degree central-ity closeness between-ness and eigenvectorcentrality Degree centrality accounts for thenumber of directed edges that enter and leaveeach node Closeness sums the shortest dis-tance from each node to all other nodes inthe network Node between-ness measures theimportance of a node according to the numberof shortest paths that pass through it Eigen-vector centrality uses the network adjacencymatrix of a graph to determine a dominanteigenvector each element of this vector is ascore for the corresponding node Thus eachscore is influenced by the scores of neighbor-ing nodes In the case of directed graphs anode that has many downstream genes hasmore influence and receives a higher score

Methods in this category include MetaCoreSPIA iPathwayGuide MetPA SPATIAL To-poGSA DEAP pDis mirIntegrator Pathway-Express and BLMA

Approaches in the second category usesimilarity measures in their node level scor-ing Similarity measures estimate the co-expression behavioral similarity or co-regulation of pairs of components Theirvalues can be correlation coefficients covari-ances or dot products of the gene expressionprofile across time or samples In these meth-ods the pathways with clusters of highly cor-related genes are considered more significantAt the node level a score is assigned to eachpair of nodes in the network which is the ra-tio of one similarity measure over the shortestpath distance between these nodes Thus thetopology information is captured in the nodescore by incorporating the shortest path dis-tance of the pair Methods in this category in-clude ScorePAGE and PWEA

Approaches in the third category incorpo-rate the topology in the node level scoringusing a probabilistic graphical model In this

AnalyzingMolecularInteractions

82513

Current Protocols in Bioinformatics Supplement 61

model nodes are random variables and edgesdefine the conditional dependency of the nodesthey link For example PARADIGM takes ob-served experimental data and calculates scoresfor all component nodes in both observed andhidden states from the detailed network cre-ated by the method based on the input path-way For each node score a positive or negativevalue denotes how likely it is for the node tobe active or inactive respectively The scoresare calculated to maximize the probability ofthe observed values A p-value is associatedwith each score of each sample such that eachnode can be tagged as significantly activesignificantly inactive or not significant Foreach network a matrix of p-values is outputin which columns are samples and rows arecomponent nodes Methods in this categoryinclude PARADIGM and PathOlogist

Approaches in the fourth category simplycompute the score for each node using the in-formation obtained from the experiment inputFor example TAPPA calculates the score ofeach node as the square root of the normalizedlog gene expressions (node value) while ACSTand TBScore calculate the node level score us-ing a sign statistic and log fold-change

Pathway level scoringThere are three different ways to compute

the pathway score (i) linear (ii) non-linearand iii) weighted gene set Most methods ag-gregate node level statistics to pathway levelstatistics using linear functions such as av-eraging or summation For example iPath-wayGuide computes the scores of pathway asthe sum of all genes while TBScore weightsthe pathway DE genes based on their log foldchange and the number of distinct DE genesdirectly downstream of them using a depth-first search algorithm

Approaches in the second group (non-linear) use a non-linear function to computethe pathway scores For example TAPPA com-putes the pathway score for each sample as aweighted sum of the product of all node pairscores in the pathway The weight coefficientis 0 when there is no edge between a pair Forany connected node pair the weight is a signfunction which represents joint up- or down-regulation of the pair Another example is En-richNet in which pathway scores measure thedifference of the node score distribution for apathway and a background networkgene setwhich consists of all pathways At the nodelevel the distance of all DE genes to the path-way is measured and summarized as a distancedistribution The method assumes that the most

relevant pathway is the one with the greatestdifference between the pathway node scoredistribution and the background score distri-bution The difference between the two distri-butions is measured by the weighted averagingof the difference between the two discretizedand normalized distributions The averagingmethod down-weights the higher distances andemphasizes the lower distance nodes

Methods in the third group (weighted geneset) design scoring techniques that incorpo-rate existing gene set analysis methods suchas GSEA (Subramanian et al 2005) GSA(Efron amp Tibshirani 2007) or LRPath (SartorLeikauf amp Medvedovic 2009) Pathway-levelscores can be calculated using node scoresthat represent the topology characteristic of thepathway as weight adjustments to a gene setanalysis method PWEA GANPA THINK-Back-DS and CePa use this approach and werefer to them as weighted gene set analysismethods

Pathway significance assessmentPathway scores are intended to provide in-

formation regarding the amount of changeincurred in the pathway between two phe-notypes However the amount of change isnot meaningful by itself since any amount ofchange can take place just by chance An as-sessment of the significance of the measuredchanges is thus required

TopoGSA MetPA and EnrichNet outputscores without any significance assessmentleaving it up to the user to interpret the re-sults This is problematic because users do nothave any instrument to distinguish betweenchanges due to noise or random causes andmeaningful changes that are unlikely to occurjust by chance and that therefore are possi-bly related to the phenotype The rest of theanalysis methods perform a hypothesis test foreach pathway The null hypothesis is that thevalue of the observed statistic is due to randomnoise or chance alone The research hypothe-sis is that the observed values are substantialenough that they are potentially related to thephenotype A p-value for calculated score isthen computed and a user-defined thresholdon the p-value is used to decide whether thethe null hypothesis can be rejected or not foreach pathway Finally a correction for multi-ple comparisons should be performed

Typically pathway analysis methods com-pute one score per pathway The distribution ofthis score under the null hypothesis can be con-structed and compared to the observed scoreHowever there are often too few samples to

Network-BasedApproaches forPathway Level

Analysis

82514

Supplement 61 Current Protocols in Bioinformatics

calculate this distribution so it is assumedthat the distribution is known For example inMetaCore and many other techniques whenthe pathway score is the number of DE nodesthat fall on the pathway the distribution isassumed to be hypergeometric However thehypergeometric distribution assumes that thevariables (genes in this case) are independentwhich is incorrect as witnessed by the factthat the pathway graph structure itself is de-signed to reflect the specific ways in which thegenes influence each other Another approachto identify the distribution is to use statisticaltechniques such as the bootstrap (Efron 1979)Bootstrapping can be done either at the sam-ple level by permuting the sample labels orat gene-set level by permuting the the valuesassigned to the genes in the set

Multivariate analysis and Bayesiannetwork

Multivariate analysis methods mostly usemultivariate probability distributions to com-pute pathway-level statistics and these can begrouped into two subcategories Methods inthe first category use multivariate hypothesistesting while methods in the second categoryare based on Bayesian network

NetGSA TopologyGSA DEGraph clip-per and microGraphite are methods based onmultivariate hypothesis testing These anal-ysis methods assume the vectors of geneexpression values in each (sub)pathway arerandom vectors with multivariate normal dis-tributions The network topology informationis stored in the covariance matrix of the cor-responding distribution For a network if thetwo distributions of the gene expression vec-tors corresponding to the two phenotypes aresignificantly different the network is assumedto be significantly impacted when comparingthe two phenotypes The significance assess-ment is done by a multivariate hypothesis testThe definition of the null hypothesis for thestatistical tests and the techniques to calcu-late the parameters of the distributions are themain differences between these three analysismethods

BPA and BAPA-IGGFD are two methodsbased on Bayesian networks In a Bayesiannetwork which is a special case of proba-bilistic graphical models a random variableis assigned to each node of a directed acyclic(DAG) graph The edges in the graph representthe conditional probabilities between nodesso that the children are independent from eachother and the rest of the graph when condi-tioned on the parents In BPA the value of

the Bayesian random variable assigned to eachnode captures the state of a gene (DE or not) Incontrast in BAPA-IGGFD each random vari-able assigned to an edge is the probability thatup or down regulation of the genes at both endsof an interaction are concordant with the typeof interaction which can be activation or inhi-bition In both BPA and BAPA-IGGFD eachrandom variable is assumed to follow a bino-mial distribution whose probability of successfollows a beta distribution However these twomethods use different approaches in represent-ing the multivariate distribution of the corre-sponding random vector BPA assumes thatthe random vector has a multinomial distribu-tion which is the generalization of the bino-mial distribution In this case the vector of thesuccess probability follows the Dirichlet dis-tribution which is the multivariate extensionof the beta distribution In contrast BAPA-IGGFD assumes the random variables areindependent therefore the multivariate distri-butions are calculated by multiplying the dis-tributions of the random variables in the vec-tor It is worth mentioning that the assumptionof independence in BAPA-IGGFD is contra-dicted by evidence specifically in the case ofedges that share nodes

Other approachesThe four methods DRAGEN GGEA

ToPASeq and BLMA follow strategies thatare very different from those of the other 30methods As such BLMA implements a bi-level meta-analysis approach that can be ap-plied in conjunction with any of the four sta-tistical approaches SPIA (Tarca et al 2009)GSA (Efron amp Tibshirani 2007) ORA (Tava-zoie et al 1999) or PADOG (Tarca DraghiciBhatti amp Romero 2012) The package al-lows users to perform pathway analysis withone dataset or with multiple datasets Simi-larly ToPASeq provides an R package thatruns TopologyGSA DEGraph clipper SPIATBScore PWEA and TAPPA This methodmodels the input (biological networks and ex-periment data) in such a way that it can be usedfor any of the seven different analyses

DRAGEN is an analysis method that scoresthe interactions rather then the genes and usesa regression model to detect differential regu-lation DRAGEN fits each edge of a pathwayinto two linear models (for case and control)and then computes a p-value that representsthe difference between the two models Foreach pathway a summary statistic is computedby combining the p-values of the edges us-ing a weighted Fisherrsquos method (Fisher 1925)

AnalyzingMolecularInteractions

82515

Current Protocols in Bioinformatics Supplement 61

GGEA on the other hand uses Petri Net (Mu-rata 1989) to model the pathway The sum-mary statistic named consistency is computedfrom the fuzzy similarity between the observedgene expression and Petri net with fuzzy logic(PNFL Kuffner Petri Windhager amp Zimmer2010) For both DRAGEN and GGEA the p-value of the pathway is calculated by com-paring the observed summary statistic againstthe null distribution that is constructed bypermutation

CHALLENGES IN PATHWAYANALYSIS

Pathway analysis has become the firstchoice for gaining insights into the underly-ing biology of a phenotype due its explanatorypower However there are outstanding annota-tion and methodological limitations that havenot been addressed (Kelder Conklin Eveloamp Pico 2010 Khatri et al 2012) There arethree main limitations of current knowledgebases First existing knowledge bases are un-able to keep up with the information availablein data obtained from recent technologies Forexample RNA-Seq data allows us to iden-tify transcripts that are active under certainconditions Alternatively spliced transcriptseven if they originate from the same genemay have distinct or even opposite functions(Wang et al 2008) However most knowl-edge bases provide pathway annotation onlyat the gene level Second there is a lack ofcondition and cell-specific information ieinformation about cell type conditions andtime points Finally current pathway annota-tions are neither complete nor perfectly accu-rate (Khatri et al 2012 Rhee Wood Dolinskiamp Draghici 2008) For example the number ofgenes in KEGG have remained around 5000despite being updated continuously in the past10 years while there are approximately 19000human genes annotated with at least one GOterm The number of protein-coding genes isestimated to be between 19000 and 20000(Ezkurdia et al 2014) most of which areincluded in DNA microarray assays such asAffymetrix HG U133 plus 20

Another challenge is the oversimplificationthat characterizes many of the models pro-vided by pathway databases In principle eachtype of tissue might have different mecha-nisms so generic organism-level pathwayspresent a somewhat simplistic description ofthe phenomena Furthermore signaling andmetabolic processes can also be different fromone condition to another or even from one

patient to another Understanding the specificpathways that are impacted in a given pheno-type or sub-group of patients should be anothergoal for the next generation of pathway anal-ysis tools See Khatri et al (2012) for a moredetailed discussion of annotation limitationsof existing knowledge bases

Here we focus on challenges of pathwayanalysis from computational perspectives Wedemonstrate that there is a systematic bias inpathway analysis (Nguyen Mitrea Tagett ampDraghici 2017) This leads to the unreliabilityof most if not all pathway analysis approachesWe also discuss the lack of benchmark datasetsor pipelines to assess the performance of ex-isting approaches

Systematic Bias of Pathway AnalysisMethods

Pathway analysis approaches often rely onhypothesis testing to identify the pathways thatare impacted under the effects of different dis-eases Null distributions are used to modelpopulations so that statistical tests can deter-mine whether an observation is unlikely to oc-cur by chance In principle the p-values pro-duced by a sound statistical test must be uni-formly distributed under the null hypothesis(Barton Crozier Lillycrop Godfrey amp Inskip2013 Bland 2013 Fodor Tickle amp Richard-son 2007 Storey amp Tibshirani 2003) For ex-ample the p-values that result from comparingtwo groups using a t-test should be distributeduniformly if the data are normally distributed(Bland 2013) When the assumptions of sta-tistical models do not hold the resulting p-values are not uniformly distributed under thenull hypothesis This makes classical methodssuch as t-test inaccurate since gene expressionvalues do not necessary follow their assump-tions Here we also show that the problem isextended to pathway analysis as the pathwayp-values obtained from statistical approachesare not uniformly distributed under the nullhypothesis This might lead to severe bias to-wards well-studied diseases such as cancerand thus make the results unreliable (Nguyenet al 2017)

Consider three pathway analysis methodsthat represent three different classes ofmethods for pathway analysis Gene SetAnalysis (GSA Efron amp Tibshirani 2007)is a Functional Class Scoring method (Efronamp Tibshirani 2007 Mootha et al 2003Subramanian et al 2005 Tarca et al2012) Down-weighting of OverlappingGenes (PADOG Tarca et al 2012) is anenrichment method (Beiẞbarth amp Speed

Network-BasedApproaches forPathway Level

Analysis

82516

Supplement 61 Current Protocols in Bioinformatics

2004 Draghici et al 2003 Khatri DraghiciOstermeier amp Krawetz 2002) and Sig-naling Pathway Impact Analysis (SPIATarca et al 2009) is a topology-awaremethod (Draghici et al 2007 Tarca et al2009) To simulate the null distribution wedownload and process the data from ninepublic datasets GSE14924 CD4 GSE14924CD8 GSE17054 GSE12662 GSE57194GSE33223 GSE42140 GSE8023 andGSE15061 Using 140 control samples fromthe nine datasets we simulate 40000 datasetsas follows We randomly label 70 samples ascontrol samples and the remaining 70 samplesas disease samples We repeat this procedure10000 times to generate different groups of70 control and 70 disease samples To makethe simulation more general we also create10000 datasets consisting of 10 control and10 disease samples 10000 datasets consistingof 10 control and 20 disease samples and10000 datasets consisting of 20 controland 10 disease samples We then calculatethe p-values of the KEGG human signalingpathways using each of the three methods

The effect of combining control (iehealthy) samples from different experimentsis to uniformly distribute all sources of biasamong the random groups of samples If wecompare groups of control samples based onexperiments there could be true differencesdue to batch effects By pooling them togetherwe form a population which is considered thereference population This approach is simi-lar to selecting from a large group of peoplethat may contain different sub-groups (egdifferent ethnicities gender race life styleor living conditions) When we randomly se-lect samples (for the two random groups tobe compared) from the reference populationwe expect all bias (eg ethnic subgroups) tobe represented equally in both random groupsand therefore we should see no difference be-tween these random groups no matter howmany distinct ethnic subgroups were presentin the population at large Therefore the p-values of a test for difference between the tworandomly selected groups should be equallyprobable between zero and one

Figure 8254 displays the empirical nulldistributions of p-values using PADOG GSAand SPIA The horizontal axes represent p-values while the vertical axes represent p-value densities Green panels (A0-A6) showp-value distributions from PADOG while blue(B0-B6) and purple (C0-C6) panels show p-value distributions from GSA and SPIA re-spectively For each method the larger panel

(A0 B0 C0) shows the cumulative p-valuesfrom all KEGG signaling pathways The smallpanels six per method display extreme ex-amples of nonuniform p-value distributionsfor specific pathways For each method weshow three distributions severely biased to-wards zero (eg A1-A3) and three distri-butions severely biased towards one (egA4-A6)

These results show that contrary to gen-erally accepted beliefs the p-values are notuniformly distributed for the three methodsconsidered Therefore one should expect avery strong and systematic bias in identifyingsignificant pathways for each of these meth-ods Pathways that have p-values biased to-wards zero will often be falsely identified assignificant (false positives) Likewise path-ways that have p-values biased towards one arelikely to rarely meet the significance require-ments even when they are truly implicatedin the given phenotype (false negatives) Sys-tematic bias due to nonuniformity of p-valuedistributions results in failure of the statisti-cal methods to correctly identify the biologicalpathways implicated in the condition and alsoleads to inconsistent and incorrect results

Querying Data from Knowledge BasesIndependent research groups have tried

different strategies to model complex bio-molecular phenomena These independent ef-forts have led to variation among pathwaydatabases complicating the task of devel-oping pathway analysis methods Dependingon the database there may be differencesin information sources experiment interpre-tation models of molecular interactions orboundaries of the pathways Therefore it ispossible that pathways with the same desig-nation and aiming to describe the same phe-nomena may have different topologies in dif-ferent databases As an example one couldcompare the insulin signaling pathways ofKEGG and BioCarta BioCarta includes fewernodes and emphasizes the effect of insulin ontranscription while KEGG includes transcrip-tion regulation as well as apoptosis and otherbiological processes Differences in graphmodels for molecular interactions are partic-ularly apparent when comparing the signal-ing pathways in KEGG and NCI-PID WhileKEGG represents the interaction informationusing the directed edges themselves NCI-PIDintroduces ldquoprocess nodesrdquo to model interac-tions (see Fig 8252) Developers are facingthe challenge of modifying methods to accept

AnalyzingMolecularInteractions

82517

Current Protocols in Bioinformatics Supplement 61

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 8: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

Signaling pathway analysis using gene expression

MetaCore DE genes Proprietary canonicalpathway genome-scalenetwork

Single-type directed Hierarchicallyaggregated

Pathway-Express DE genes change ormeasured geneschanged

KEGG signaling Single-type directed Hierarchicallyaggregated

PathOlogist Measured genesexpression

KEGG Multi-type directed Hierarchicallyaggregated

iPathwayGuide DE genes change ormeasured geneschange

KEGG signalingReactome NCIBioCarta

Single-type directed Hierarchicallyaggregated

SPIA DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

NetGSA Measured genesexpression

KEGG signaling Single-type directed Multivariate analysis

PWEA Measured genesexpression

YeastNet Single-typeundirected

Hierarchicallyaggregated

TopoGSA DE genes PPI network KEGG Single-typeundirected

Hierarchicallyaggregated

TopologyGSA Measured genesexpression

NCI-PID Single-typeundirected

Multivariate analysis

DEGraph Measured genesexpression

KEGG Single-typeundirected

Multivariate analysis

GGEA Measured genesexpression

KEGG Single-type directed Aggregate fuzzysimilarity

BPA Measured genesexpressionmdashwithcut-off

NCI-PID Single-type DAG Bayesian network

GANPA DE genes change ormeasured genesexpression

PPI network KEGGReactome NCI-PIDHumanCyc

Single-typeundirected

Hierarchicallyaggregated

ROntoTools DE genes change ormeasured geneexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

BAPA-IGGFD Measured genesexpression - withcut-off

Literature-basedinteraction databaseKEGG WikiPathwaysReactome MSigDBGO BP PANTHERconstructed geneassociation networkfrom PPIsco-annotation in GOBiological Process (BP)and co-expression inmicroarray data

Single-type DAG Bayesian network

continued

8258

Supplement 61 Current Protocols in Bioinformatics

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

CePa DE genes expressionor measured genesexpression

NCI-PID Single-type directed Hierarchicallyaggregated

THINK-Back-DS DE genes changemeasured genesexpression

KEGG PANTHERBioCarta ReactomeGenMAPP

Single-type directed Hierarchicallyaggregated

TBScore DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

ACST Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

EnrichNet DE genes list PPI network KEGGBioCartaWikiPathwaysReactome NCI-PIDInterPro GO withSTRING 90

Single-typeundirected

Hierarchicallyaggregated

clipper Measured genesexpression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

DEAP Measured genesexpression

KEGG Reactome Single-type directed Hierarchicallyaggregated

DRAGEN Measured genesexpression

RegulonDB M3DHTRIdb ENCODEMSigDB

Single-type directed Linear regression

ToPASeqe Measured genesexpression

KEGG Single-type directed Hierarchical ampmultivariate

pDis DE genes change ormeasured geneschanged

KEGG signaling Single-type directed Hierarchicallyaggregated

SPATIAL DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

BLMAf Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

Signaling pathway analysis using multiple types of data

PARADIGM Measured genesexpression copynumber and proteinslevels

Constructed PPInetworks from MIPSDIP BIND HPRDIntAct and BioGRID

Multi-type directed Hierarchicallyaggregated

microGraphite Measured geneexpression andmiRNA expression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

mirIntegrator DE genes changeand DE miRNAchange

KEGG signalingmiRTarBase

Single-type directed Hierarchicallyaggregated

Metabolic pathway analysis

ScorePAGE Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

continued

8259

Current Protocols in Bioinformatics Supplement 61

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

TAPPA Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

MetPA DE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

MetaboAnalyst DE genes change andDE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

aExperiment input shows the experiment data input format for each method ldquoDErdquo means differentially expressed ldquoChangerdquo means fold-change valueor t-statistics when comparing genemetabolites values between two phenotypes ldquoMeasuredrdquo means the list of all the genesmetabolites measured inthe experiment ldquoListrdquo represents a list of genesmetabolites identifiers (eg symbols) ldquoWith cut-offrdquo show methods that take as input the list of allmeasured genes and in the analysis they mark the DE genesbPathway database shows the name of the knowledge source for biological interactionscGraph model is a characteristic that shows if the graph has one or multiple types of nodes as well as if directed or undirecteddThe package ROntoTools (Bioconductor) allows for analysis using expression values of all geneseToPASeq provides an R package that runs TopologyGSA DEGraph clipper SPIA TBScore PWEA TAPPAfBLMA provides an R package for bi-level meta-analysis that runs SPIA ORA GSA and PADOG using one or multiple expression datasets

independent studies or measurements oftenproduce different sets of differentially ex-pressed (DE) genes (Ein-Dor Kela GetzGivol amp Domany 2005 Ein-Dor Zuk amp Do-many 2006 Tan et al 2003) This makes ap-proaches that use DE genes as input appeareven more unreliable

Usually pathways are sets of genes andorgene products that interact with each other ina coordinated way to accomplish a given bi-ological function A typical signaling path-way (in KEGG for instance) uses nodes torepresent genes or gene products and edgesto represent signals such as activation or re-pression that go from one gene to anotherA typical metabolic pathway uses nodes torepresent biochemical compounds and edgesto represent reactions that transform one ormore compound(s) into one or more othercompounds These reactions are usually car-ried out or controlled by enzymes which arein turn coded by genes Hence in a metabolicpathway genes or gene products are associatedwith edges rather than nodes as in a signal-ing pathway The immediate consequence ofthis difference is that many techniques cannotbe applied directly on all available pathwaysThis is why the analysis of metabolic pathwaysis still generally and arguably underdeveloped(only 128 Google Scholar citations to date forthe original ScorePage paper Rahnenfuhreret al 2004) and there are not many othermethods available for metabolic pathway anal-ysis However the topology-based analysis ofsignaling pathways has been very successful

(over 1200 citations to date for the two impactanalysis papers mentioned above Draghiciet al 2007 Tarca et al 2009) and over 30other methods have been developed since

There are other types of biological net-works that incorporate genome-wide inter-actions between genes or proteins such asprotein-protein interaction (PPI) networksThese networks are not restricted to specificbiological functions The main caveat relatedto PPI data is that most such data are obtainedfrom a bait-prey laboratory assays rather thanfrom in vivo or in vitro studies The fact thattwo proteins stick to each other in an assayperformed in an artificial environment can bemisleading since the two proteins may neverbe present at the same time in the same tissueor the same part of the cell

Publicly available curated pathwaydatabases used by methods listed in Table8252 are KEGG (Ogata et al 1999)NCI-PID (Schaefer et al 2009) BioCarta(httpswwwbiocartacom) WikiPathways(Pico et al 2008) PANTHER (Mi et al2005) and Reactome (Joshi-Tope et al2005) These knowledge bases are built bymanually curating experiments performed indifferent cell types under different conditionsThese curated knowledgebases are morereliable than protein interaction networksbut do not include all known genes andtheir interactions As an example despitebeing continuously updated KEGG includesonly about 5000 human genes in signalingpathways while the number of protein-coding

Network-BasedApproaches forPathway Level

Analysis

82510

Supplement 61 Current Protocols in Bioinformatics

Figure 8252 Five biological network graph models used by public databases are displayedThe KEGG database contains both signaling and metabolic networks Signaling networks havegenesgene products as nodes and regulatory signals as edges Types of regulatory signalsinclude activation inhibition phosphorylation and many others Metabolic networks have bio-chemical compounds as nodes and chemical reactions as edges Enzymes (specialized pro-teinsgene products) catalyze biochemical reactions therefore genes are linked to the edges inthese networks The Reactome database is a collection of biochemical reactions that are groupedin functionally related sets to form a pathway There are two types of nodes biochemical com-pounds and reactions Reaction nodes link biochemical compounds as reactants and productsNCI-PID signaling networks also have two types of nodes component nodes and process nodesComponent nodes are usually biomolecular components Process nodes are usually biochemicalreactions or biological processes In these networks process nodes link two or more componentnodes through directed edges Process nodes are assigned one of the following states positiveor negative regulation or involved in Protein-protein interaction (PPI) networks have proteinsas nodes and physical binding as edges or interactions Two-hybrid assays are typically used todetermine protein-protein interactions PPIs can be directed in the bait-prey orientation when thebait-prey relation is considered (top) or undirected when is not (bottom) The Biological PathwayExchange (BioPAX) format has physical entities as nodes and conversions as edges The ad-vantage of this representation is that it is generic and provides a lot of flexibility to accommodatevarious types of interactions In addition it provides a machine-readable standard that can beused for all databases to provide their data in a unified format Network nodes can be genesgene products complexes or non-coding RNA Network edges can be assembly of a complexdisassembly of a complex or biochemical reactions among others

genes is estimated to be between 19000 and20000 (Ezkurdia et al 2014)

The implementation of analysis methodsconstrains the software to accept a specific in-put pathway data format while the underlyinggraph models in the methods are independentof the input format Regardless of the pathwayformat this information must be parsed into acomputer-readable graph data structure beforebeing processed The implementation may in-corporate a parser or this may be up to the userFor instance SPIA accepts any signaling path-way or network if it can be transformed into anadjacency matrix representing a directed graphwhere all nodes are components and all edgesare interactions NetGSA is similarly flexiblewith regard to signaling and metabolic path-ways SPIA provides KEGG signaling path-ways as a set of pre-parsed adjacency matri-ces The methods described in this chapter maybe restricted to only one pathway database ormay accept several

To the best of our knowledge there havebeen no comprehensive approaches to com-

pare the pros and cons of existing knowledgebases It is completely up to researchers tochoose the tools that are able to work withthe database they trust Intuitively methodsthat are able to exploit the complementary in-formation available from different databaseshave an edge over methods that work with onesingle database

Graph ModelsPathway analysis approaches use two major

graph models to represent biological networksobtained from knowledge bases (i) single-type and (ii) multiple-type The first modelallows only one type of node ie a gene orprotein with edges representing molecular in-teractions between the nodes (KEGG signal-ing in Fig 8252) Models that contain di-rected graphs are more suitable for analysesthat include signal propagation of gene pertur-bation The second graph model allows mul-tiple type of nodes such as components andinteractions (NCI-PID in Fig 8252) Multi-type graph models are more complex than

AnalyzingMolecularInteractions

82511

Current Protocols in Bioinformatics Supplement 61

single-type but they are expected to capturemore pathway characteristics For examplesingle-type models are limited when tryingto describe ldquoallrdquo and ldquoanyrdquo relations betweenmultiple components that are involved in thesame interaction Bipartite graphs which con-tain two types of nodes and allow connectiononly between nodes of different types are aparticular case of multi-type graph models

The majority of analysis methods surveyedhere use a single-type graph model Some ap-ply the analysis on a directed or un-directedsingle-type network built using the input path-way while others transform the pathways intographs with specific characteristics An ex-ample of the later is TopologyGSA whichtransforms the directed input pathway intoan undirected decomposable graph which hasthe advantage of being easily broken downinto separate modules (Lauritzen 1996) Inthis method decomposable graphs are usedto find ldquoimportantrdquo submodulesmdashthose thatdrive the changes across the whole pathwayFor each pathway TopologyGSA creates anundirected moral graph from the underlyingdirected acyclic graph (DAG) by connectingthe parents of each child and removing theedge direction [The moral graph of a DAGis the undirected graph created by adding an(undirected) edge between all parents of thesame node (sometimes called marrying) andthen replacing all directed edges by undirectededges The name stems from the fact that ina moral graph two nodes that have a com-mon child are required to be married by shar-ing an edge] The moral graph is then usedto test the hypothesis that the underlying net-work is changed significantly between the twophenotypes If the the research hypothesis isrejected a decomposabletriangulated graph isgenerated from the moral graph by adding newedges This graph is broken into the maximalpossible submodules and the hypothesis is re-tested on each of them

PathOlogist and PARADIGM are thetwo surveyed methods that use multi-typegraph models PathOlogist uses a bipartitegraph model with component and interac-tion nodes PARADIGM conceptually moti-vated by the central dogma of molecular bi-ology takes a pathway graph as input andconverts it into a more detailed graph whereeach component node is replaced by sev-eral more specific nodes biological entitynodes interaction nodes and nodes contain-ing observed experiment data The observedexperiment nodes could in principle containgene-expression and copy-number informa-

tion Biological entity nodes are DNA mRNAprotein and active protein The interactionnodes are transcription translation or pro-tein activation among others Biological en-tity and interaction node values are derivedfrom these data and specify the probability ofthe node being active These are the hiddenstates of the model From the mathematicalmodel perspective models that allow multipletypes of nodesmdashcomponent nodes and inter-action nodesmdashare more flexible and are able tomodel both AND and OR gates which are verycommon when describing cellular processes

Pathway Scoring StrategiesThe goal of the scoring method is to com-

pute a score for each pathway based on the ex-pression change and the graph model resultingin a ranked list of pathways or sub-pathwaysThere are a variety of approaches to quantifythe changes in a pathway Some of the analysismethods use a hierarchically aggregated scor-ing algorithm where on the first level a scoreis calculated and assigned to each node or pairof nodes (component andor interaction) Onthe second level these scores are aggregated tocompute the score of the pathway On the lastlevel the statistical significance of the path-way score is assessed using univariate hypoth-esis testing Another approach assigns a ran-dom variable to each node and a multivariateprobability distribution is calculated for eachpathway The output score can be calculated intwo ways One way is to use multivariate hy-pothesis testing to assess the statistical signif-icance of changes in the pathway distributionbetween the two phenotypes The other way isto estimate the distribution parameters basedon the Bayesian network model and use thisdistribution to compute a probabilistic score tomeasure the changes In this section we pro-vide details regarding the scoring algorithmsof the surveyed methods See Figure 8253 forscoring algorithms categories

Hierarchically aggregated scoringThe workflow of the hierarchically aggre-

gated scoring strategy has three levels nodestatistic computation pathway statistic com-putation and the evaluation of the significancefor the pathway statistic

Node level scoringMost of the surveyed methods incorpo-

rate pathway topology information in the nodescores The node level scoring can be di-vided into four categories (i) graph mea-sures (centrality) (ii) similarity measures

Network-BasedApproaches forPathway Level

Analysis

82512

Supplement 61 Current Protocols in Bioinformatics

Figure 8253 A summary of the statistical models is presented for the surveyed methods Mostmethods use the hierarchically aggregated scoring strategy in which the score is computed atthe node level before being aggregated at the pathway level for significance assessment In theleft panel the rows show a node-level statistical model while the columns show the aggregatingstrategies (linear non-linear or weighted gene set) Approaches in multivariate analysis andBayesian network categories use multivariate modeling and Bayesian network respectively tofind pathways that are most likely to be impacted Both BLMA and ToPASeq provide R packagesthat run multiple algorithms DRAGEN follows a linear regression strategy while GGEA appliesfuzzy modeling to rank the pathways

(iii) probabilistic graphical models and (iv)normalized node value (NNV) Approaches inthe first category use centrality measures or avariation of these measures to score nodes in agiven pathway Centrality measures representthe importance of a node relative to all othernodes in a network There are several central-ity measures that can be applied to networksof genes and their interactions degree central-ity closeness between-ness and eigenvectorcentrality Degree centrality accounts for thenumber of directed edges that enter and leaveeach node Closeness sums the shortest dis-tance from each node to all other nodes inthe network Node between-ness measures theimportance of a node according to the numberof shortest paths that pass through it Eigen-vector centrality uses the network adjacencymatrix of a graph to determine a dominanteigenvector each element of this vector is ascore for the corresponding node Thus eachscore is influenced by the scores of neighbor-ing nodes In the case of directed graphs anode that has many downstream genes hasmore influence and receives a higher score

Methods in this category include MetaCoreSPIA iPathwayGuide MetPA SPATIAL To-poGSA DEAP pDis mirIntegrator Pathway-Express and BLMA

Approaches in the second category usesimilarity measures in their node level scor-ing Similarity measures estimate the co-expression behavioral similarity or co-regulation of pairs of components Theirvalues can be correlation coefficients covari-ances or dot products of the gene expressionprofile across time or samples In these meth-ods the pathways with clusters of highly cor-related genes are considered more significantAt the node level a score is assigned to eachpair of nodes in the network which is the ra-tio of one similarity measure over the shortestpath distance between these nodes Thus thetopology information is captured in the nodescore by incorporating the shortest path dis-tance of the pair Methods in this category in-clude ScorePAGE and PWEA

Approaches in the third category incorpo-rate the topology in the node level scoringusing a probabilistic graphical model In this

AnalyzingMolecularInteractions

82513

Current Protocols in Bioinformatics Supplement 61

model nodes are random variables and edgesdefine the conditional dependency of the nodesthey link For example PARADIGM takes ob-served experimental data and calculates scoresfor all component nodes in both observed andhidden states from the detailed network cre-ated by the method based on the input path-way For each node score a positive or negativevalue denotes how likely it is for the node tobe active or inactive respectively The scoresare calculated to maximize the probability ofthe observed values A p-value is associatedwith each score of each sample such that eachnode can be tagged as significantly activesignificantly inactive or not significant Foreach network a matrix of p-values is outputin which columns are samples and rows arecomponent nodes Methods in this categoryinclude PARADIGM and PathOlogist

Approaches in the fourth category simplycompute the score for each node using the in-formation obtained from the experiment inputFor example TAPPA calculates the score ofeach node as the square root of the normalizedlog gene expressions (node value) while ACSTand TBScore calculate the node level score us-ing a sign statistic and log fold-change

Pathway level scoringThere are three different ways to compute

the pathway score (i) linear (ii) non-linearand iii) weighted gene set Most methods ag-gregate node level statistics to pathway levelstatistics using linear functions such as av-eraging or summation For example iPath-wayGuide computes the scores of pathway asthe sum of all genes while TBScore weightsthe pathway DE genes based on their log foldchange and the number of distinct DE genesdirectly downstream of them using a depth-first search algorithm

Approaches in the second group (non-linear) use a non-linear function to computethe pathway scores For example TAPPA com-putes the pathway score for each sample as aweighted sum of the product of all node pairscores in the pathway The weight coefficientis 0 when there is no edge between a pair Forany connected node pair the weight is a signfunction which represents joint up- or down-regulation of the pair Another example is En-richNet in which pathway scores measure thedifference of the node score distribution for apathway and a background networkgene setwhich consists of all pathways At the nodelevel the distance of all DE genes to the path-way is measured and summarized as a distancedistribution The method assumes that the most

relevant pathway is the one with the greatestdifference between the pathway node scoredistribution and the background score distri-bution The difference between the two distri-butions is measured by the weighted averagingof the difference between the two discretizedand normalized distributions The averagingmethod down-weights the higher distances andemphasizes the lower distance nodes

Methods in the third group (weighted geneset) design scoring techniques that incorpo-rate existing gene set analysis methods suchas GSEA (Subramanian et al 2005) GSA(Efron amp Tibshirani 2007) or LRPath (SartorLeikauf amp Medvedovic 2009) Pathway-levelscores can be calculated using node scoresthat represent the topology characteristic of thepathway as weight adjustments to a gene setanalysis method PWEA GANPA THINK-Back-DS and CePa use this approach and werefer to them as weighted gene set analysismethods

Pathway significance assessmentPathway scores are intended to provide in-

formation regarding the amount of changeincurred in the pathway between two phe-notypes However the amount of change isnot meaningful by itself since any amount ofchange can take place just by chance An as-sessment of the significance of the measuredchanges is thus required

TopoGSA MetPA and EnrichNet outputscores without any significance assessmentleaving it up to the user to interpret the re-sults This is problematic because users do nothave any instrument to distinguish betweenchanges due to noise or random causes andmeaningful changes that are unlikely to occurjust by chance and that therefore are possi-bly related to the phenotype The rest of theanalysis methods perform a hypothesis test foreach pathway The null hypothesis is that thevalue of the observed statistic is due to randomnoise or chance alone The research hypothe-sis is that the observed values are substantialenough that they are potentially related to thephenotype A p-value for calculated score isthen computed and a user-defined thresholdon the p-value is used to decide whether thethe null hypothesis can be rejected or not foreach pathway Finally a correction for multi-ple comparisons should be performed

Typically pathway analysis methods com-pute one score per pathway The distribution ofthis score under the null hypothesis can be con-structed and compared to the observed scoreHowever there are often too few samples to

Network-BasedApproaches forPathway Level

Analysis

82514

Supplement 61 Current Protocols in Bioinformatics

calculate this distribution so it is assumedthat the distribution is known For example inMetaCore and many other techniques whenthe pathway score is the number of DE nodesthat fall on the pathway the distribution isassumed to be hypergeometric However thehypergeometric distribution assumes that thevariables (genes in this case) are independentwhich is incorrect as witnessed by the factthat the pathway graph structure itself is de-signed to reflect the specific ways in which thegenes influence each other Another approachto identify the distribution is to use statisticaltechniques such as the bootstrap (Efron 1979)Bootstrapping can be done either at the sam-ple level by permuting the sample labels orat gene-set level by permuting the the valuesassigned to the genes in the set

Multivariate analysis and Bayesiannetwork

Multivariate analysis methods mostly usemultivariate probability distributions to com-pute pathway-level statistics and these can begrouped into two subcategories Methods inthe first category use multivariate hypothesistesting while methods in the second categoryare based on Bayesian network

NetGSA TopologyGSA DEGraph clip-per and microGraphite are methods based onmultivariate hypothesis testing These anal-ysis methods assume the vectors of geneexpression values in each (sub)pathway arerandom vectors with multivariate normal dis-tributions The network topology informationis stored in the covariance matrix of the cor-responding distribution For a network if thetwo distributions of the gene expression vec-tors corresponding to the two phenotypes aresignificantly different the network is assumedto be significantly impacted when comparingthe two phenotypes The significance assess-ment is done by a multivariate hypothesis testThe definition of the null hypothesis for thestatistical tests and the techniques to calcu-late the parameters of the distributions are themain differences between these three analysismethods

BPA and BAPA-IGGFD are two methodsbased on Bayesian networks In a Bayesiannetwork which is a special case of proba-bilistic graphical models a random variableis assigned to each node of a directed acyclic(DAG) graph The edges in the graph representthe conditional probabilities between nodesso that the children are independent from eachother and the rest of the graph when condi-tioned on the parents In BPA the value of

the Bayesian random variable assigned to eachnode captures the state of a gene (DE or not) Incontrast in BAPA-IGGFD each random vari-able assigned to an edge is the probability thatup or down regulation of the genes at both endsof an interaction are concordant with the typeof interaction which can be activation or inhi-bition In both BPA and BAPA-IGGFD eachrandom variable is assumed to follow a bino-mial distribution whose probability of successfollows a beta distribution However these twomethods use different approaches in represent-ing the multivariate distribution of the corre-sponding random vector BPA assumes thatthe random vector has a multinomial distribu-tion which is the generalization of the bino-mial distribution In this case the vector of thesuccess probability follows the Dirichlet dis-tribution which is the multivariate extensionof the beta distribution In contrast BAPA-IGGFD assumes the random variables areindependent therefore the multivariate distri-butions are calculated by multiplying the dis-tributions of the random variables in the vec-tor It is worth mentioning that the assumptionof independence in BAPA-IGGFD is contra-dicted by evidence specifically in the case ofedges that share nodes

Other approachesThe four methods DRAGEN GGEA

ToPASeq and BLMA follow strategies thatare very different from those of the other 30methods As such BLMA implements a bi-level meta-analysis approach that can be ap-plied in conjunction with any of the four sta-tistical approaches SPIA (Tarca et al 2009)GSA (Efron amp Tibshirani 2007) ORA (Tava-zoie et al 1999) or PADOG (Tarca DraghiciBhatti amp Romero 2012) The package al-lows users to perform pathway analysis withone dataset or with multiple datasets Simi-larly ToPASeq provides an R package thatruns TopologyGSA DEGraph clipper SPIATBScore PWEA and TAPPA This methodmodels the input (biological networks and ex-periment data) in such a way that it can be usedfor any of the seven different analyses

DRAGEN is an analysis method that scoresthe interactions rather then the genes and usesa regression model to detect differential regu-lation DRAGEN fits each edge of a pathwayinto two linear models (for case and control)and then computes a p-value that representsthe difference between the two models Foreach pathway a summary statistic is computedby combining the p-values of the edges us-ing a weighted Fisherrsquos method (Fisher 1925)

AnalyzingMolecularInteractions

82515

Current Protocols in Bioinformatics Supplement 61

GGEA on the other hand uses Petri Net (Mu-rata 1989) to model the pathway The sum-mary statistic named consistency is computedfrom the fuzzy similarity between the observedgene expression and Petri net with fuzzy logic(PNFL Kuffner Petri Windhager amp Zimmer2010) For both DRAGEN and GGEA the p-value of the pathway is calculated by com-paring the observed summary statistic againstthe null distribution that is constructed bypermutation

CHALLENGES IN PATHWAYANALYSIS

Pathway analysis has become the firstchoice for gaining insights into the underly-ing biology of a phenotype due its explanatorypower However there are outstanding annota-tion and methodological limitations that havenot been addressed (Kelder Conklin Eveloamp Pico 2010 Khatri et al 2012) There arethree main limitations of current knowledgebases First existing knowledge bases are un-able to keep up with the information availablein data obtained from recent technologies Forexample RNA-Seq data allows us to iden-tify transcripts that are active under certainconditions Alternatively spliced transcriptseven if they originate from the same genemay have distinct or even opposite functions(Wang et al 2008) However most knowl-edge bases provide pathway annotation onlyat the gene level Second there is a lack ofcondition and cell-specific information ieinformation about cell type conditions andtime points Finally current pathway annota-tions are neither complete nor perfectly accu-rate (Khatri et al 2012 Rhee Wood Dolinskiamp Draghici 2008) For example the number ofgenes in KEGG have remained around 5000despite being updated continuously in the past10 years while there are approximately 19000human genes annotated with at least one GOterm The number of protein-coding genes isestimated to be between 19000 and 20000(Ezkurdia et al 2014) most of which areincluded in DNA microarray assays such asAffymetrix HG U133 plus 20

Another challenge is the oversimplificationthat characterizes many of the models pro-vided by pathway databases In principle eachtype of tissue might have different mecha-nisms so generic organism-level pathwayspresent a somewhat simplistic description ofthe phenomena Furthermore signaling andmetabolic processes can also be different fromone condition to another or even from one

patient to another Understanding the specificpathways that are impacted in a given pheno-type or sub-group of patients should be anothergoal for the next generation of pathway anal-ysis tools See Khatri et al (2012) for a moredetailed discussion of annotation limitationsof existing knowledge bases

Here we focus on challenges of pathwayanalysis from computational perspectives Wedemonstrate that there is a systematic bias inpathway analysis (Nguyen Mitrea Tagett ampDraghici 2017) This leads to the unreliabilityof most if not all pathway analysis approachesWe also discuss the lack of benchmark datasetsor pipelines to assess the performance of ex-isting approaches

Systematic Bias of Pathway AnalysisMethods

Pathway analysis approaches often rely onhypothesis testing to identify the pathways thatare impacted under the effects of different dis-eases Null distributions are used to modelpopulations so that statistical tests can deter-mine whether an observation is unlikely to oc-cur by chance In principle the p-values pro-duced by a sound statistical test must be uni-formly distributed under the null hypothesis(Barton Crozier Lillycrop Godfrey amp Inskip2013 Bland 2013 Fodor Tickle amp Richard-son 2007 Storey amp Tibshirani 2003) For ex-ample the p-values that result from comparingtwo groups using a t-test should be distributeduniformly if the data are normally distributed(Bland 2013) When the assumptions of sta-tistical models do not hold the resulting p-values are not uniformly distributed under thenull hypothesis This makes classical methodssuch as t-test inaccurate since gene expressionvalues do not necessary follow their assump-tions Here we also show that the problem isextended to pathway analysis as the pathwayp-values obtained from statistical approachesare not uniformly distributed under the nullhypothesis This might lead to severe bias to-wards well-studied diseases such as cancerand thus make the results unreliable (Nguyenet al 2017)

Consider three pathway analysis methodsthat represent three different classes ofmethods for pathway analysis Gene SetAnalysis (GSA Efron amp Tibshirani 2007)is a Functional Class Scoring method (Efronamp Tibshirani 2007 Mootha et al 2003Subramanian et al 2005 Tarca et al2012) Down-weighting of OverlappingGenes (PADOG Tarca et al 2012) is anenrichment method (Beiẞbarth amp Speed

Network-BasedApproaches forPathway Level

Analysis

82516

Supplement 61 Current Protocols in Bioinformatics

2004 Draghici et al 2003 Khatri DraghiciOstermeier amp Krawetz 2002) and Sig-naling Pathway Impact Analysis (SPIATarca et al 2009) is a topology-awaremethod (Draghici et al 2007 Tarca et al2009) To simulate the null distribution wedownload and process the data from ninepublic datasets GSE14924 CD4 GSE14924CD8 GSE17054 GSE12662 GSE57194GSE33223 GSE42140 GSE8023 andGSE15061 Using 140 control samples fromthe nine datasets we simulate 40000 datasetsas follows We randomly label 70 samples ascontrol samples and the remaining 70 samplesas disease samples We repeat this procedure10000 times to generate different groups of70 control and 70 disease samples To makethe simulation more general we also create10000 datasets consisting of 10 control and10 disease samples 10000 datasets consistingof 10 control and 20 disease samples and10000 datasets consisting of 20 controland 10 disease samples We then calculatethe p-values of the KEGG human signalingpathways using each of the three methods

The effect of combining control (iehealthy) samples from different experimentsis to uniformly distribute all sources of biasamong the random groups of samples If wecompare groups of control samples based onexperiments there could be true differencesdue to batch effects By pooling them togetherwe form a population which is considered thereference population This approach is simi-lar to selecting from a large group of peoplethat may contain different sub-groups (egdifferent ethnicities gender race life styleor living conditions) When we randomly se-lect samples (for the two random groups tobe compared) from the reference populationwe expect all bias (eg ethnic subgroups) tobe represented equally in both random groupsand therefore we should see no difference be-tween these random groups no matter howmany distinct ethnic subgroups were presentin the population at large Therefore the p-values of a test for difference between the tworandomly selected groups should be equallyprobable between zero and one

Figure 8254 displays the empirical nulldistributions of p-values using PADOG GSAand SPIA The horizontal axes represent p-values while the vertical axes represent p-value densities Green panels (A0-A6) showp-value distributions from PADOG while blue(B0-B6) and purple (C0-C6) panels show p-value distributions from GSA and SPIA re-spectively For each method the larger panel

(A0 B0 C0) shows the cumulative p-valuesfrom all KEGG signaling pathways The smallpanels six per method display extreme ex-amples of nonuniform p-value distributionsfor specific pathways For each method weshow three distributions severely biased to-wards zero (eg A1-A3) and three distri-butions severely biased towards one (egA4-A6)

These results show that contrary to gen-erally accepted beliefs the p-values are notuniformly distributed for the three methodsconsidered Therefore one should expect avery strong and systematic bias in identifyingsignificant pathways for each of these meth-ods Pathways that have p-values biased to-wards zero will often be falsely identified assignificant (false positives) Likewise path-ways that have p-values biased towards one arelikely to rarely meet the significance require-ments even when they are truly implicatedin the given phenotype (false negatives) Sys-tematic bias due to nonuniformity of p-valuedistributions results in failure of the statisti-cal methods to correctly identify the biologicalpathways implicated in the condition and alsoleads to inconsistent and incorrect results

Querying Data from Knowledge BasesIndependent research groups have tried

different strategies to model complex bio-molecular phenomena These independent ef-forts have led to variation among pathwaydatabases complicating the task of devel-oping pathway analysis methods Dependingon the database there may be differencesin information sources experiment interpre-tation models of molecular interactions orboundaries of the pathways Therefore it ispossible that pathways with the same desig-nation and aiming to describe the same phe-nomena may have different topologies in dif-ferent databases As an example one couldcompare the insulin signaling pathways ofKEGG and BioCarta BioCarta includes fewernodes and emphasizes the effect of insulin ontranscription while KEGG includes transcrip-tion regulation as well as apoptosis and otherbiological processes Differences in graphmodels for molecular interactions are partic-ularly apparent when comparing the signal-ing pathways in KEGG and NCI-PID WhileKEGG represents the interaction informationusing the directed edges themselves NCI-PIDintroduces ldquoprocess nodesrdquo to model interac-tions (see Fig 8252) Developers are facingthe challenge of modifying methods to accept

AnalyzingMolecularInteractions

82517

Current Protocols in Bioinformatics Supplement 61

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 9: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

CePa DE genes expressionor measured genesexpression

NCI-PID Single-type directed Hierarchicallyaggregated

THINK-Back-DS DE genes changemeasured genesexpression

KEGG PANTHERBioCarta ReactomeGenMAPP

Single-type directed Hierarchicallyaggregated

TBScore DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

ACST Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

EnrichNet DE genes list PPI network KEGGBioCartaWikiPathwaysReactome NCI-PIDInterPro GO withSTRING 90

Single-typeundirected

Hierarchicallyaggregated

clipper Measured genesexpression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

DEAP Measured genesexpression

KEGG Reactome Single-type directed Hierarchicallyaggregated

DRAGEN Measured genesexpression

RegulonDB M3DHTRIdb ENCODEMSigDB

Single-type directed Linear regression

ToPASeqe Measured genesexpression

KEGG Single-type directed Hierarchical ampmultivariate

pDis DE genes change ormeasured geneschanged

KEGG signaling Single-type directed Hierarchicallyaggregated

SPATIAL DE genes change KEGG signaling Single-type directed Hierarchicallyaggregated

BLMAf Measured genesexpression

KEGG signaling Single-type directed Hierarchicallyaggregated

Signaling pathway analysis using multiple types of data

PARADIGM Measured genesexpression copynumber and proteinslevels

Constructed PPInetworks from MIPSDIP BIND HPRDIntAct and BioGRID

Multi-type directed Hierarchicallyaggregated

microGraphite Measured geneexpression andmiRNA expression

BioCarta KEGGNCI-PID Reactome

Single-type directed Multivariate analysis

mirIntegrator DE genes changeand DE miRNAchange

KEGG signalingmiRTarBase

Single-type directed Hierarchicallyaggregated

Metabolic pathway analysis

ScorePAGE Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

continued

8259

Current Protocols in Bioinformatics Supplement 61

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

TAPPA Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

MetPA DE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

MetaboAnalyst DE genes change andDE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

aExperiment input shows the experiment data input format for each method ldquoDErdquo means differentially expressed ldquoChangerdquo means fold-change valueor t-statistics when comparing genemetabolites values between two phenotypes ldquoMeasuredrdquo means the list of all the genesmetabolites measured inthe experiment ldquoListrdquo represents a list of genesmetabolites identifiers (eg symbols) ldquoWith cut-offrdquo show methods that take as input the list of allmeasured genes and in the analysis they mark the DE genesbPathway database shows the name of the knowledge source for biological interactionscGraph model is a characteristic that shows if the graph has one or multiple types of nodes as well as if directed or undirecteddThe package ROntoTools (Bioconductor) allows for analysis using expression values of all geneseToPASeq provides an R package that runs TopologyGSA DEGraph clipper SPIA TBScore PWEA TAPPAfBLMA provides an R package for bi-level meta-analysis that runs SPIA ORA GSA and PADOG using one or multiple expression datasets

independent studies or measurements oftenproduce different sets of differentially ex-pressed (DE) genes (Ein-Dor Kela GetzGivol amp Domany 2005 Ein-Dor Zuk amp Do-many 2006 Tan et al 2003) This makes ap-proaches that use DE genes as input appeareven more unreliable

Usually pathways are sets of genes andorgene products that interact with each other ina coordinated way to accomplish a given bi-ological function A typical signaling path-way (in KEGG for instance) uses nodes torepresent genes or gene products and edgesto represent signals such as activation or re-pression that go from one gene to anotherA typical metabolic pathway uses nodes torepresent biochemical compounds and edgesto represent reactions that transform one ormore compound(s) into one or more othercompounds These reactions are usually car-ried out or controlled by enzymes which arein turn coded by genes Hence in a metabolicpathway genes or gene products are associatedwith edges rather than nodes as in a signal-ing pathway The immediate consequence ofthis difference is that many techniques cannotbe applied directly on all available pathwaysThis is why the analysis of metabolic pathwaysis still generally and arguably underdeveloped(only 128 Google Scholar citations to date forthe original ScorePage paper Rahnenfuhreret al 2004) and there are not many othermethods available for metabolic pathway anal-ysis However the topology-based analysis ofsignaling pathways has been very successful

(over 1200 citations to date for the two impactanalysis papers mentioned above Draghiciet al 2007 Tarca et al 2009) and over 30other methods have been developed since

There are other types of biological net-works that incorporate genome-wide inter-actions between genes or proteins such asprotein-protein interaction (PPI) networksThese networks are not restricted to specificbiological functions The main caveat relatedto PPI data is that most such data are obtainedfrom a bait-prey laboratory assays rather thanfrom in vivo or in vitro studies The fact thattwo proteins stick to each other in an assayperformed in an artificial environment can bemisleading since the two proteins may neverbe present at the same time in the same tissueor the same part of the cell

Publicly available curated pathwaydatabases used by methods listed in Table8252 are KEGG (Ogata et al 1999)NCI-PID (Schaefer et al 2009) BioCarta(httpswwwbiocartacom) WikiPathways(Pico et al 2008) PANTHER (Mi et al2005) and Reactome (Joshi-Tope et al2005) These knowledge bases are built bymanually curating experiments performed indifferent cell types under different conditionsThese curated knowledgebases are morereliable than protein interaction networksbut do not include all known genes andtheir interactions As an example despitebeing continuously updated KEGG includesonly about 5000 human genes in signalingpathways while the number of protein-coding

Network-BasedApproaches forPathway Level

Analysis

82510

Supplement 61 Current Protocols in Bioinformatics

Figure 8252 Five biological network graph models used by public databases are displayedThe KEGG database contains both signaling and metabolic networks Signaling networks havegenesgene products as nodes and regulatory signals as edges Types of regulatory signalsinclude activation inhibition phosphorylation and many others Metabolic networks have bio-chemical compounds as nodes and chemical reactions as edges Enzymes (specialized pro-teinsgene products) catalyze biochemical reactions therefore genes are linked to the edges inthese networks The Reactome database is a collection of biochemical reactions that are groupedin functionally related sets to form a pathway There are two types of nodes biochemical com-pounds and reactions Reaction nodes link biochemical compounds as reactants and productsNCI-PID signaling networks also have two types of nodes component nodes and process nodesComponent nodes are usually biomolecular components Process nodes are usually biochemicalreactions or biological processes In these networks process nodes link two or more componentnodes through directed edges Process nodes are assigned one of the following states positiveor negative regulation or involved in Protein-protein interaction (PPI) networks have proteinsas nodes and physical binding as edges or interactions Two-hybrid assays are typically used todetermine protein-protein interactions PPIs can be directed in the bait-prey orientation when thebait-prey relation is considered (top) or undirected when is not (bottom) The Biological PathwayExchange (BioPAX) format has physical entities as nodes and conversions as edges The ad-vantage of this representation is that it is generic and provides a lot of flexibility to accommodatevarious types of interactions In addition it provides a machine-readable standard that can beused for all databases to provide their data in a unified format Network nodes can be genesgene products complexes or non-coding RNA Network edges can be assembly of a complexdisassembly of a complex or biochemical reactions among others

genes is estimated to be between 19000 and20000 (Ezkurdia et al 2014)

The implementation of analysis methodsconstrains the software to accept a specific in-put pathway data format while the underlyinggraph models in the methods are independentof the input format Regardless of the pathwayformat this information must be parsed into acomputer-readable graph data structure beforebeing processed The implementation may in-corporate a parser or this may be up to the userFor instance SPIA accepts any signaling path-way or network if it can be transformed into anadjacency matrix representing a directed graphwhere all nodes are components and all edgesare interactions NetGSA is similarly flexiblewith regard to signaling and metabolic path-ways SPIA provides KEGG signaling path-ways as a set of pre-parsed adjacency matri-ces The methods described in this chapter maybe restricted to only one pathway database ormay accept several

To the best of our knowledge there havebeen no comprehensive approaches to com-

pare the pros and cons of existing knowledgebases It is completely up to researchers tochoose the tools that are able to work withthe database they trust Intuitively methodsthat are able to exploit the complementary in-formation available from different databaseshave an edge over methods that work with onesingle database

Graph ModelsPathway analysis approaches use two major

graph models to represent biological networksobtained from knowledge bases (i) single-type and (ii) multiple-type The first modelallows only one type of node ie a gene orprotein with edges representing molecular in-teractions between the nodes (KEGG signal-ing in Fig 8252) Models that contain di-rected graphs are more suitable for analysesthat include signal propagation of gene pertur-bation The second graph model allows mul-tiple type of nodes such as components andinteractions (NCI-PID in Fig 8252) Multi-type graph models are more complex than

AnalyzingMolecularInteractions

82511

Current Protocols in Bioinformatics Supplement 61

single-type but they are expected to capturemore pathway characteristics For examplesingle-type models are limited when tryingto describe ldquoallrdquo and ldquoanyrdquo relations betweenmultiple components that are involved in thesame interaction Bipartite graphs which con-tain two types of nodes and allow connectiononly between nodes of different types are aparticular case of multi-type graph models

The majority of analysis methods surveyedhere use a single-type graph model Some ap-ply the analysis on a directed or un-directedsingle-type network built using the input path-way while others transform the pathways intographs with specific characteristics An ex-ample of the later is TopologyGSA whichtransforms the directed input pathway intoan undirected decomposable graph which hasthe advantage of being easily broken downinto separate modules (Lauritzen 1996) Inthis method decomposable graphs are usedto find ldquoimportantrdquo submodulesmdashthose thatdrive the changes across the whole pathwayFor each pathway TopologyGSA creates anundirected moral graph from the underlyingdirected acyclic graph (DAG) by connectingthe parents of each child and removing theedge direction [The moral graph of a DAGis the undirected graph created by adding an(undirected) edge between all parents of thesame node (sometimes called marrying) andthen replacing all directed edges by undirectededges The name stems from the fact that ina moral graph two nodes that have a com-mon child are required to be married by shar-ing an edge] The moral graph is then usedto test the hypothesis that the underlying net-work is changed significantly between the twophenotypes If the the research hypothesis isrejected a decomposabletriangulated graph isgenerated from the moral graph by adding newedges This graph is broken into the maximalpossible submodules and the hypothesis is re-tested on each of them

PathOlogist and PARADIGM are thetwo surveyed methods that use multi-typegraph models PathOlogist uses a bipartitegraph model with component and interac-tion nodes PARADIGM conceptually moti-vated by the central dogma of molecular bi-ology takes a pathway graph as input andconverts it into a more detailed graph whereeach component node is replaced by sev-eral more specific nodes biological entitynodes interaction nodes and nodes contain-ing observed experiment data The observedexperiment nodes could in principle containgene-expression and copy-number informa-

tion Biological entity nodes are DNA mRNAprotein and active protein The interactionnodes are transcription translation or pro-tein activation among others Biological en-tity and interaction node values are derivedfrom these data and specify the probability ofthe node being active These are the hiddenstates of the model From the mathematicalmodel perspective models that allow multipletypes of nodesmdashcomponent nodes and inter-action nodesmdashare more flexible and are able tomodel both AND and OR gates which are verycommon when describing cellular processes

Pathway Scoring StrategiesThe goal of the scoring method is to com-

pute a score for each pathway based on the ex-pression change and the graph model resultingin a ranked list of pathways or sub-pathwaysThere are a variety of approaches to quantifythe changes in a pathway Some of the analysismethods use a hierarchically aggregated scor-ing algorithm where on the first level a scoreis calculated and assigned to each node or pairof nodes (component andor interaction) Onthe second level these scores are aggregated tocompute the score of the pathway On the lastlevel the statistical significance of the path-way score is assessed using univariate hypoth-esis testing Another approach assigns a ran-dom variable to each node and a multivariateprobability distribution is calculated for eachpathway The output score can be calculated intwo ways One way is to use multivariate hy-pothesis testing to assess the statistical signif-icance of changes in the pathway distributionbetween the two phenotypes The other way isto estimate the distribution parameters basedon the Bayesian network model and use thisdistribution to compute a probabilistic score tomeasure the changes In this section we pro-vide details regarding the scoring algorithmsof the surveyed methods See Figure 8253 forscoring algorithms categories

Hierarchically aggregated scoringThe workflow of the hierarchically aggre-

gated scoring strategy has three levels nodestatistic computation pathway statistic com-putation and the evaluation of the significancefor the pathway statistic

Node level scoringMost of the surveyed methods incorpo-

rate pathway topology information in the nodescores The node level scoring can be di-vided into four categories (i) graph mea-sures (centrality) (ii) similarity measures

Network-BasedApproaches forPathway Level

Analysis

82512

Supplement 61 Current Protocols in Bioinformatics

Figure 8253 A summary of the statistical models is presented for the surveyed methods Mostmethods use the hierarchically aggregated scoring strategy in which the score is computed atthe node level before being aggregated at the pathway level for significance assessment In theleft panel the rows show a node-level statistical model while the columns show the aggregatingstrategies (linear non-linear or weighted gene set) Approaches in multivariate analysis andBayesian network categories use multivariate modeling and Bayesian network respectively tofind pathways that are most likely to be impacted Both BLMA and ToPASeq provide R packagesthat run multiple algorithms DRAGEN follows a linear regression strategy while GGEA appliesfuzzy modeling to rank the pathways

(iii) probabilistic graphical models and (iv)normalized node value (NNV) Approaches inthe first category use centrality measures or avariation of these measures to score nodes in agiven pathway Centrality measures representthe importance of a node relative to all othernodes in a network There are several central-ity measures that can be applied to networksof genes and their interactions degree central-ity closeness between-ness and eigenvectorcentrality Degree centrality accounts for thenumber of directed edges that enter and leaveeach node Closeness sums the shortest dis-tance from each node to all other nodes inthe network Node between-ness measures theimportance of a node according to the numberof shortest paths that pass through it Eigen-vector centrality uses the network adjacencymatrix of a graph to determine a dominanteigenvector each element of this vector is ascore for the corresponding node Thus eachscore is influenced by the scores of neighbor-ing nodes In the case of directed graphs anode that has many downstream genes hasmore influence and receives a higher score

Methods in this category include MetaCoreSPIA iPathwayGuide MetPA SPATIAL To-poGSA DEAP pDis mirIntegrator Pathway-Express and BLMA

Approaches in the second category usesimilarity measures in their node level scor-ing Similarity measures estimate the co-expression behavioral similarity or co-regulation of pairs of components Theirvalues can be correlation coefficients covari-ances or dot products of the gene expressionprofile across time or samples In these meth-ods the pathways with clusters of highly cor-related genes are considered more significantAt the node level a score is assigned to eachpair of nodes in the network which is the ra-tio of one similarity measure over the shortestpath distance between these nodes Thus thetopology information is captured in the nodescore by incorporating the shortest path dis-tance of the pair Methods in this category in-clude ScorePAGE and PWEA

Approaches in the third category incorpo-rate the topology in the node level scoringusing a probabilistic graphical model In this

AnalyzingMolecularInteractions

82513

Current Protocols in Bioinformatics Supplement 61

model nodes are random variables and edgesdefine the conditional dependency of the nodesthey link For example PARADIGM takes ob-served experimental data and calculates scoresfor all component nodes in both observed andhidden states from the detailed network cre-ated by the method based on the input path-way For each node score a positive or negativevalue denotes how likely it is for the node tobe active or inactive respectively The scoresare calculated to maximize the probability ofthe observed values A p-value is associatedwith each score of each sample such that eachnode can be tagged as significantly activesignificantly inactive or not significant Foreach network a matrix of p-values is outputin which columns are samples and rows arecomponent nodes Methods in this categoryinclude PARADIGM and PathOlogist

Approaches in the fourth category simplycompute the score for each node using the in-formation obtained from the experiment inputFor example TAPPA calculates the score ofeach node as the square root of the normalizedlog gene expressions (node value) while ACSTand TBScore calculate the node level score us-ing a sign statistic and log fold-change

Pathway level scoringThere are three different ways to compute

the pathway score (i) linear (ii) non-linearand iii) weighted gene set Most methods ag-gregate node level statistics to pathway levelstatistics using linear functions such as av-eraging or summation For example iPath-wayGuide computes the scores of pathway asthe sum of all genes while TBScore weightsthe pathway DE genes based on their log foldchange and the number of distinct DE genesdirectly downstream of them using a depth-first search algorithm

Approaches in the second group (non-linear) use a non-linear function to computethe pathway scores For example TAPPA com-putes the pathway score for each sample as aweighted sum of the product of all node pairscores in the pathway The weight coefficientis 0 when there is no edge between a pair Forany connected node pair the weight is a signfunction which represents joint up- or down-regulation of the pair Another example is En-richNet in which pathway scores measure thedifference of the node score distribution for apathway and a background networkgene setwhich consists of all pathways At the nodelevel the distance of all DE genes to the path-way is measured and summarized as a distancedistribution The method assumes that the most

relevant pathway is the one with the greatestdifference between the pathway node scoredistribution and the background score distri-bution The difference between the two distri-butions is measured by the weighted averagingof the difference between the two discretizedand normalized distributions The averagingmethod down-weights the higher distances andemphasizes the lower distance nodes

Methods in the third group (weighted geneset) design scoring techniques that incorpo-rate existing gene set analysis methods suchas GSEA (Subramanian et al 2005) GSA(Efron amp Tibshirani 2007) or LRPath (SartorLeikauf amp Medvedovic 2009) Pathway-levelscores can be calculated using node scoresthat represent the topology characteristic of thepathway as weight adjustments to a gene setanalysis method PWEA GANPA THINK-Back-DS and CePa use this approach and werefer to them as weighted gene set analysismethods

Pathway significance assessmentPathway scores are intended to provide in-

formation regarding the amount of changeincurred in the pathway between two phe-notypes However the amount of change isnot meaningful by itself since any amount ofchange can take place just by chance An as-sessment of the significance of the measuredchanges is thus required

TopoGSA MetPA and EnrichNet outputscores without any significance assessmentleaving it up to the user to interpret the re-sults This is problematic because users do nothave any instrument to distinguish betweenchanges due to noise or random causes andmeaningful changes that are unlikely to occurjust by chance and that therefore are possi-bly related to the phenotype The rest of theanalysis methods perform a hypothesis test foreach pathway The null hypothesis is that thevalue of the observed statistic is due to randomnoise or chance alone The research hypothe-sis is that the observed values are substantialenough that they are potentially related to thephenotype A p-value for calculated score isthen computed and a user-defined thresholdon the p-value is used to decide whether thethe null hypothesis can be rejected or not foreach pathway Finally a correction for multi-ple comparisons should be performed

Typically pathway analysis methods com-pute one score per pathway The distribution ofthis score under the null hypothesis can be con-structed and compared to the observed scoreHowever there are often too few samples to

Network-BasedApproaches forPathway Level

Analysis

82514

Supplement 61 Current Protocols in Bioinformatics

calculate this distribution so it is assumedthat the distribution is known For example inMetaCore and many other techniques whenthe pathway score is the number of DE nodesthat fall on the pathway the distribution isassumed to be hypergeometric However thehypergeometric distribution assumes that thevariables (genes in this case) are independentwhich is incorrect as witnessed by the factthat the pathway graph structure itself is de-signed to reflect the specific ways in which thegenes influence each other Another approachto identify the distribution is to use statisticaltechniques such as the bootstrap (Efron 1979)Bootstrapping can be done either at the sam-ple level by permuting the sample labels orat gene-set level by permuting the the valuesassigned to the genes in the set

Multivariate analysis and Bayesiannetwork

Multivariate analysis methods mostly usemultivariate probability distributions to com-pute pathway-level statistics and these can begrouped into two subcategories Methods inthe first category use multivariate hypothesistesting while methods in the second categoryare based on Bayesian network

NetGSA TopologyGSA DEGraph clip-per and microGraphite are methods based onmultivariate hypothesis testing These anal-ysis methods assume the vectors of geneexpression values in each (sub)pathway arerandom vectors with multivariate normal dis-tributions The network topology informationis stored in the covariance matrix of the cor-responding distribution For a network if thetwo distributions of the gene expression vec-tors corresponding to the two phenotypes aresignificantly different the network is assumedto be significantly impacted when comparingthe two phenotypes The significance assess-ment is done by a multivariate hypothesis testThe definition of the null hypothesis for thestatistical tests and the techniques to calcu-late the parameters of the distributions are themain differences between these three analysismethods

BPA and BAPA-IGGFD are two methodsbased on Bayesian networks In a Bayesiannetwork which is a special case of proba-bilistic graphical models a random variableis assigned to each node of a directed acyclic(DAG) graph The edges in the graph representthe conditional probabilities between nodesso that the children are independent from eachother and the rest of the graph when condi-tioned on the parents In BPA the value of

the Bayesian random variable assigned to eachnode captures the state of a gene (DE or not) Incontrast in BAPA-IGGFD each random vari-able assigned to an edge is the probability thatup or down regulation of the genes at both endsof an interaction are concordant with the typeof interaction which can be activation or inhi-bition In both BPA and BAPA-IGGFD eachrandom variable is assumed to follow a bino-mial distribution whose probability of successfollows a beta distribution However these twomethods use different approaches in represent-ing the multivariate distribution of the corre-sponding random vector BPA assumes thatthe random vector has a multinomial distribu-tion which is the generalization of the bino-mial distribution In this case the vector of thesuccess probability follows the Dirichlet dis-tribution which is the multivariate extensionof the beta distribution In contrast BAPA-IGGFD assumes the random variables areindependent therefore the multivariate distri-butions are calculated by multiplying the dis-tributions of the random variables in the vec-tor It is worth mentioning that the assumptionof independence in BAPA-IGGFD is contra-dicted by evidence specifically in the case ofedges that share nodes

Other approachesThe four methods DRAGEN GGEA

ToPASeq and BLMA follow strategies thatare very different from those of the other 30methods As such BLMA implements a bi-level meta-analysis approach that can be ap-plied in conjunction with any of the four sta-tistical approaches SPIA (Tarca et al 2009)GSA (Efron amp Tibshirani 2007) ORA (Tava-zoie et al 1999) or PADOG (Tarca DraghiciBhatti amp Romero 2012) The package al-lows users to perform pathway analysis withone dataset or with multiple datasets Simi-larly ToPASeq provides an R package thatruns TopologyGSA DEGraph clipper SPIATBScore PWEA and TAPPA This methodmodels the input (biological networks and ex-periment data) in such a way that it can be usedfor any of the seven different analyses

DRAGEN is an analysis method that scoresthe interactions rather then the genes and usesa regression model to detect differential regu-lation DRAGEN fits each edge of a pathwayinto two linear models (for case and control)and then computes a p-value that representsthe difference between the two models Foreach pathway a summary statistic is computedby combining the p-values of the edges us-ing a weighted Fisherrsquos method (Fisher 1925)

AnalyzingMolecularInteractions

82515

Current Protocols in Bioinformatics Supplement 61

GGEA on the other hand uses Petri Net (Mu-rata 1989) to model the pathway The sum-mary statistic named consistency is computedfrom the fuzzy similarity between the observedgene expression and Petri net with fuzzy logic(PNFL Kuffner Petri Windhager amp Zimmer2010) For both DRAGEN and GGEA the p-value of the pathway is calculated by com-paring the observed summary statistic againstthe null distribution that is constructed bypermutation

CHALLENGES IN PATHWAYANALYSIS

Pathway analysis has become the firstchoice for gaining insights into the underly-ing biology of a phenotype due its explanatorypower However there are outstanding annota-tion and methodological limitations that havenot been addressed (Kelder Conklin Eveloamp Pico 2010 Khatri et al 2012) There arethree main limitations of current knowledgebases First existing knowledge bases are un-able to keep up with the information availablein data obtained from recent technologies Forexample RNA-Seq data allows us to iden-tify transcripts that are active under certainconditions Alternatively spliced transcriptseven if they originate from the same genemay have distinct or even opposite functions(Wang et al 2008) However most knowl-edge bases provide pathway annotation onlyat the gene level Second there is a lack ofcondition and cell-specific information ieinformation about cell type conditions andtime points Finally current pathway annota-tions are neither complete nor perfectly accu-rate (Khatri et al 2012 Rhee Wood Dolinskiamp Draghici 2008) For example the number ofgenes in KEGG have remained around 5000despite being updated continuously in the past10 years while there are approximately 19000human genes annotated with at least one GOterm The number of protein-coding genes isestimated to be between 19000 and 20000(Ezkurdia et al 2014) most of which areincluded in DNA microarray assays such asAffymetrix HG U133 plus 20

Another challenge is the oversimplificationthat characterizes many of the models pro-vided by pathway databases In principle eachtype of tissue might have different mecha-nisms so generic organism-level pathwayspresent a somewhat simplistic description ofthe phenomena Furthermore signaling andmetabolic processes can also be different fromone condition to another or even from one

patient to another Understanding the specificpathways that are impacted in a given pheno-type or sub-group of patients should be anothergoal for the next generation of pathway anal-ysis tools See Khatri et al (2012) for a moredetailed discussion of annotation limitationsof existing knowledge bases

Here we focus on challenges of pathwayanalysis from computational perspectives Wedemonstrate that there is a systematic bias inpathway analysis (Nguyen Mitrea Tagett ampDraghici 2017) This leads to the unreliabilityof most if not all pathway analysis approachesWe also discuss the lack of benchmark datasetsor pipelines to assess the performance of ex-isting approaches

Systematic Bias of Pathway AnalysisMethods

Pathway analysis approaches often rely onhypothesis testing to identify the pathways thatare impacted under the effects of different dis-eases Null distributions are used to modelpopulations so that statistical tests can deter-mine whether an observation is unlikely to oc-cur by chance In principle the p-values pro-duced by a sound statistical test must be uni-formly distributed under the null hypothesis(Barton Crozier Lillycrop Godfrey amp Inskip2013 Bland 2013 Fodor Tickle amp Richard-son 2007 Storey amp Tibshirani 2003) For ex-ample the p-values that result from comparingtwo groups using a t-test should be distributeduniformly if the data are normally distributed(Bland 2013) When the assumptions of sta-tistical models do not hold the resulting p-values are not uniformly distributed under thenull hypothesis This makes classical methodssuch as t-test inaccurate since gene expressionvalues do not necessary follow their assump-tions Here we also show that the problem isextended to pathway analysis as the pathwayp-values obtained from statistical approachesare not uniformly distributed under the nullhypothesis This might lead to severe bias to-wards well-studied diseases such as cancerand thus make the results unreliable (Nguyenet al 2017)

Consider three pathway analysis methodsthat represent three different classes ofmethods for pathway analysis Gene SetAnalysis (GSA Efron amp Tibshirani 2007)is a Functional Class Scoring method (Efronamp Tibshirani 2007 Mootha et al 2003Subramanian et al 2005 Tarca et al2012) Down-weighting of OverlappingGenes (PADOG Tarca et al 2012) is anenrichment method (Beiẞbarth amp Speed

Network-BasedApproaches forPathway Level

Analysis

82516

Supplement 61 Current Protocols in Bioinformatics

2004 Draghici et al 2003 Khatri DraghiciOstermeier amp Krawetz 2002) and Sig-naling Pathway Impact Analysis (SPIATarca et al 2009) is a topology-awaremethod (Draghici et al 2007 Tarca et al2009) To simulate the null distribution wedownload and process the data from ninepublic datasets GSE14924 CD4 GSE14924CD8 GSE17054 GSE12662 GSE57194GSE33223 GSE42140 GSE8023 andGSE15061 Using 140 control samples fromthe nine datasets we simulate 40000 datasetsas follows We randomly label 70 samples ascontrol samples and the remaining 70 samplesas disease samples We repeat this procedure10000 times to generate different groups of70 control and 70 disease samples To makethe simulation more general we also create10000 datasets consisting of 10 control and10 disease samples 10000 datasets consistingof 10 control and 20 disease samples and10000 datasets consisting of 20 controland 10 disease samples We then calculatethe p-values of the KEGG human signalingpathways using each of the three methods

The effect of combining control (iehealthy) samples from different experimentsis to uniformly distribute all sources of biasamong the random groups of samples If wecompare groups of control samples based onexperiments there could be true differencesdue to batch effects By pooling them togetherwe form a population which is considered thereference population This approach is simi-lar to selecting from a large group of peoplethat may contain different sub-groups (egdifferent ethnicities gender race life styleor living conditions) When we randomly se-lect samples (for the two random groups tobe compared) from the reference populationwe expect all bias (eg ethnic subgroups) tobe represented equally in both random groupsand therefore we should see no difference be-tween these random groups no matter howmany distinct ethnic subgroups were presentin the population at large Therefore the p-values of a test for difference between the tworandomly selected groups should be equallyprobable between zero and one

Figure 8254 displays the empirical nulldistributions of p-values using PADOG GSAand SPIA The horizontal axes represent p-values while the vertical axes represent p-value densities Green panels (A0-A6) showp-value distributions from PADOG while blue(B0-B6) and purple (C0-C6) panels show p-value distributions from GSA and SPIA re-spectively For each method the larger panel

(A0 B0 C0) shows the cumulative p-valuesfrom all KEGG signaling pathways The smallpanels six per method display extreme ex-amples of nonuniform p-value distributionsfor specific pathways For each method weshow three distributions severely biased to-wards zero (eg A1-A3) and three distri-butions severely biased towards one (egA4-A6)

These results show that contrary to gen-erally accepted beliefs the p-values are notuniformly distributed for the three methodsconsidered Therefore one should expect avery strong and systematic bias in identifyingsignificant pathways for each of these meth-ods Pathways that have p-values biased to-wards zero will often be falsely identified assignificant (false positives) Likewise path-ways that have p-values biased towards one arelikely to rarely meet the significance require-ments even when they are truly implicatedin the given phenotype (false negatives) Sys-tematic bias due to nonuniformity of p-valuedistributions results in failure of the statisti-cal methods to correctly identify the biologicalpathways implicated in the condition and alsoleads to inconsistent and incorrect results

Querying Data from Knowledge BasesIndependent research groups have tried

different strategies to model complex bio-molecular phenomena These independent ef-forts have led to variation among pathwaydatabases complicating the task of devel-oping pathway analysis methods Dependingon the database there may be differencesin information sources experiment interpre-tation models of molecular interactions orboundaries of the pathways Therefore it ispossible that pathways with the same desig-nation and aiming to describe the same phe-nomena may have different topologies in dif-ferent databases As an example one couldcompare the insulin signaling pathways ofKEGG and BioCarta BioCarta includes fewernodes and emphasizes the effect of insulin ontranscription while KEGG includes transcrip-tion regulation as well as apoptosis and otherbiological processes Differences in graphmodels for molecular interactions are partic-ularly apparent when comparing the signal-ing pathways in KEGG and NCI-PID WhileKEGG represents the interaction informationusing the directed edges themselves NCI-PIDintroduces ldquoprocess nodesrdquo to model interac-tions (see Fig 8252) Developers are facingthe challenge of modifying methods to accept

AnalyzingMolecularInteractions

82517

Current Protocols in Bioinformatics Supplement 61

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 10: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

Table 8252 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the SurveyedMethods is Presented continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring

TAPPA Measured genesexpression

KEGG metabolic Single-typeundirected

Hierarchicallyaggregated

MetPA DE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

MetaboAnalyst DE genes change andDE metaboliteschange

KEGG metabolic Single-type directed Hierarchicallyaggregated

aExperiment input shows the experiment data input format for each method ldquoDErdquo means differentially expressed ldquoChangerdquo means fold-change valueor t-statistics when comparing genemetabolites values between two phenotypes ldquoMeasuredrdquo means the list of all the genesmetabolites measured inthe experiment ldquoListrdquo represents a list of genesmetabolites identifiers (eg symbols) ldquoWith cut-offrdquo show methods that take as input the list of allmeasured genes and in the analysis they mark the DE genesbPathway database shows the name of the knowledge source for biological interactionscGraph model is a characteristic that shows if the graph has one or multiple types of nodes as well as if directed or undirecteddThe package ROntoTools (Bioconductor) allows for analysis using expression values of all geneseToPASeq provides an R package that runs TopologyGSA DEGraph clipper SPIA TBScore PWEA TAPPAfBLMA provides an R package for bi-level meta-analysis that runs SPIA ORA GSA and PADOG using one or multiple expression datasets

independent studies or measurements oftenproduce different sets of differentially ex-pressed (DE) genes (Ein-Dor Kela GetzGivol amp Domany 2005 Ein-Dor Zuk amp Do-many 2006 Tan et al 2003) This makes ap-proaches that use DE genes as input appeareven more unreliable

Usually pathways are sets of genes andorgene products that interact with each other ina coordinated way to accomplish a given bi-ological function A typical signaling path-way (in KEGG for instance) uses nodes torepresent genes or gene products and edgesto represent signals such as activation or re-pression that go from one gene to anotherA typical metabolic pathway uses nodes torepresent biochemical compounds and edgesto represent reactions that transform one ormore compound(s) into one or more othercompounds These reactions are usually car-ried out or controlled by enzymes which arein turn coded by genes Hence in a metabolicpathway genes or gene products are associatedwith edges rather than nodes as in a signal-ing pathway The immediate consequence ofthis difference is that many techniques cannotbe applied directly on all available pathwaysThis is why the analysis of metabolic pathwaysis still generally and arguably underdeveloped(only 128 Google Scholar citations to date forthe original ScorePage paper Rahnenfuhreret al 2004) and there are not many othermethods available for metabolic pathway anal-ysis However the topology-based analysis ofsignaling pathways has been very successful

(over 1200 citations to date for the two impactanalysis papers mentioned above Draghiciet al 2007 Tarca et al 2009) and over 30other methods have been developed since

There are other types of biological net-works that incorporate genome-wide inter-actions between genes or proteins such asprotein-protein interaction (PPI) networksThese networks are not restricted to specificbiological functions The main caveat relatedto PPI data is that most such data are obtainedfrom a bait-prey laboratory assays rather thanfrom in vivo or in vitro studies The fact thattwo proteins stick to each other in an assayperformed in an artificial environment can bemisleading since the two proteins may neverbe present at the same time in the same tissueor the same part of the cell

Publicly available curated pathwaydatabases used by methods listed in Table8252 are KEGG (Ogata et al 1999)NCI-PID (Schaefer et al 2009) BioCarta(httpswwwbiocartacom) WikiPathways(Pico et al 2008) PANTHER (Mi et al2005) and Reactome (Joshi-Tope et al2005) These knowledge bases are built bymanually curating experiments performed indifferent cell types under different conditionsThese curated knowledgebases are morereliable than protein interaction networksbut do not include all known genes andtheir interactions As an example despitebeing continuously updated KEGG includesonly about 5000 human genes in signalingpathways while the number of protein-coding

Network-BasedApproaches forPathway Level

Analysis

82510

Supplement 61 Current Protocols in Bioinformatics

Figure 8252 Five biological network graph models used by public databases are displayedThe KEGG database contains both signaling and metabolic networks Signaling networks havegenesgene products as nodes and regulatory signals as edges Types of regulatory signalsinclude activation inhibition phosphorylation and many others Metabolic networks have bio-chemical compounds as nodes and chemical reactions as edges Enzymes (specialized pro-teinsgene products) catalyze biochemical reactions therefore genes are linked to the edges inthese networks The Reactome database is a collection of biochemical reactions that are groupedin functionally related sets to form a pathway There are two types of nodes biochemical com-pounds and reactions Reaction nodes link biochemical compounds as reactants and productsNCI-PID signaling networks also have two types of nodes component nodes and process nodesComponent nodes are usually biomolecular components Process nodes are usually biochemicalreactions or biological processes In these networks process nodes link two or more componentnodes through directed edges Process nodes are assigned one of the following states positiveor negative regulation or involved in Protein-protein interaction (PPI) networks have proteinsas nodes and physical binding as edges or interactions Two-hybrid assays are typically used todetermine protein-protein interactions PPIs can be directed in the bait-prey orientation when thebait-prey relation is considered (top) or undirected when is not (bottom) The Biological PathwayExchange (BioPAX) format has physical entities as nodes and conversions as edges The ad-vantage of this representation is that it is generic and provides a lot of flexibility to accommodatevarious types of interactions In addition it provides a machine-readable standard that can beused for all databases to provide their data in a unified format Network nodes can be genesgene products complexes or non-coding RNA Network edges can be assembly of a complexdisassembly of a complex or biochemical reactions among others

genes is estimated to be between 19000 and20000 (Ezkurdia et al 2014)

The implementation of analysis methodsconstrains the software to accept a specific in-put pathway data format while the underlyinggraph models in the methods are independentof the input format Regardless of the pathwayformat this information must be parsed into acomputer-readable graph data structure beforebeing processed The implementation may in-corporate a parser or this may be up to the userFor instance SPIA accepts any signaling path-way or network if it can be transformed into anadjacency matrix representing a directed graphwhere all nodes are components and all edgesare interactions NetGSA is similarly flexiblewith regard to signaling and metabolic path-ways SPIA provides KEGG signaling path-ways as a set of pre-parsed adjacency matri-ces The methods described in this chapter maybe restricted to only one pathway database ormay accept several

To the best of our knowledge there havebeen no comprehensive approaches to com-

pare the pros and cons of existing knowledgebases It is completely up to researchers tochoose the tools that are able to work withthe database they trust Intuitively methodsthat are able to exploit the complementary in-formation available from different databaseshave an edge over methods that work with onesingle database

Graph ModelsPathway analysis approaches use two major

graph models to represent biological networksobtained from knowledge bases (i) single-type and (ii) multiple-type The first modelallows only one type of node ie a gene orprotein with edges representing molecular in-teractions between the nodes (KEGG signal-ing in Fig 8252) Models that contain di-rected graphs are more suitable for analysesthat include signal propagation of gene pertur-bation The second graph model allows mul-tiple type of nodes such as components andinteractions (NCI-PID in Fig 8252) Multi-type graph models are more complex than

AnalyzingMolecularInteractions

82511

Current Protocols in Bioinformatics Supplement 61

single-type but they are expected to capturemore pathway characteristics For examplesingle-type models are limited when tryingto describe ldquoallrdquo and ldquoanyrdquo relations betweenmultiple components that are involved in thesame interaction Bipartite graphs which con-tain two types of nodes and allow connectiononly between nodes of different types are aparticular case of multi-type graph models

The majority of analysis methods surveyedhere use a single-type graph model Some ap-ply the analysis on a directed or un-directedsingle-type network built using the input path-way while others transform the pathways intographs with specific characteristics An ex-ample of the later is TopologyGSA whichtransforms the directed input pathway intoan undirected decomposable graph which hasthe advantage of being easily broken downinto separate modules (Lauritzen 1996) Inthis method decomposable graphs are usedto find ldquoimportantrdquo submodulesmdashthose thatdrive the changes across the whole pathwayFor each pathway TopologyGSA creates anundirected moral graph from the underlyingdirected acyclic graph (DAG) by connectingthe parents of each child and removing theedge direction [The moral graph of a DAGis the undirected graph created by adding an(undirected) edge between all parents of thesame node (sometimes called marrying) andthen replacing all directed edges by undirectededges The name stems from the fact that ina moral graph two nodes that have a com-mon child are required to be married by shar-ing an edge] The moral graph is then usedto test the hypothesis that the underlying net-work is changed significantly between the twophenotypes If the the research hypothesis isrejected a decomposabletriangulated graph isgenerated from the moral graph by adding newedges This graph is broken into the maximalpossible submodules and the hypothesis is re-tested on each of them

PathOlogist and PARADIGM are thetwo surveyed methods that use multi-typegraph models PathOlogist uses a bipartitegraph model with component and interac-tion nodes PARADIGM conceptually moti-vated by the central dogma of molecular bi-ology takes a pathway graph as input andconverts it into a more detailed graph whereeach component node is replaced by sev-eral more specific nodes biological entitynodes interaction nodes and nodes contain-ing observed experiment data The observedexperiment nodes could in principle containgene-expression and copy-number informa-

tion Biological entity nodes are DNA mRNAprotein and active protein The interactionnodes are transcription translation or pro-tein activation among others Biological en-tity and interaction node values are derivedfrom these data and specify the probability ofthe node being active These are the hiddenstates of the model From the mathematicalmodel perspective models that allow multipletypes of nodesmdashcomponent nodes and inter-action nodesmdashare more flexible and are able tomodel both AND and OR gates which are verycommon when describing cellular processes

Pathway Scoring StrategiesThe goal of the scoring method is to com-

pute a score for each pathway based on the ex-pression change and the graph model resultingin a ranked list of pathways or sub-pathwaysThere are a variety of approaches to quantifythe changes in a pathway Some of the analysismethods use a hierarchically aggregated scor-ing algorithm where on the first level a scoreis calculated and assigned to each node or pairof nodes (component andor interaction) Onthe second level these scores are aggregated tocompute the score of the pathway On the lastlevel the statistical significance of the path-way score is assessed using univariate hypoth-esis testing Another approach assigns a ran-dom variable to each node and a multivariateprobability distribution is calculated for eachpathway The output score can be calculated intwo ways One way is to use multivariate hy-pothesis testing to assess the statistical signif-icance of changes in the pathway distributionbetween the two phenotypes The other way isto estimate the distribution parameters basedon the Bayesian network model and use thisdistribution to compute a probabilistic score tomeasure the changes In this section we pro-vide details regarding the scoring algorithmsof the surveyed methods See Figure 8253 forscoring algorithms categories

Hierarchically aggregated scoringThe workflow of the hierarchically aggre-

gated scoring strategy has three levels nodestatistic computation pathway statistic com-putation and the evaluation of the significancefor the pathway statistic

Node level scoringMost of the surveyed methods incorpo-

rate pathway topology information in the nodescores The node level scoring can be di-vided into four categories (i) graph mea-sures (centrality) (ii) similarity measures

Network-BasedApproaches forPathway Level

Analysis

82512

Supplement 61 Current Protocols in Bioinformatics

Figure 8253 A summary of the statistical models is presented for the surveyed methods Mostmethods use the hierarchically aggregated scoring strategy in which the score is computed atthe node level before being aggregated at the pathway level for significance assessment In theleft panel the rows show a node-level statistical model while the columns show the aggregatingstrategies (linear non-linear or weighted gene set) Approaches in multivariate analysis andBayesian network categories use multivariate modeling and Bayesian network respectively tofind pathways that are most likely to be impacted Both BLMA and ToPASeq provide R packagesthat run multiple algorithms DRAGEN follows a linear regression strategy while GGEA appliesfuzzy modeling to rank the pathways

(iii) probabilistic graphical models and (iv)normalized node value (NNV) Approaches inthe first category use centrality measures or avariation of these measures to score nodes in agiven pathway Centrality measures representthe importance of a node relative to all othernodes in a network There are several central-ity measures that can be applied to networksof genes and their interactions degree central-ity closeness between-ness and eigenvectorcentrality Degree centrality accounts for thenumber of directed edges that enter and leaveeach node Closeness sums the shortest dis-tance from each node to all other nodes inthe network Node between-ness measures theimportance of a node according to the numberof shortest paths that pass through it Eigen-vector centrality uses the network adjacencymatrix of a graph to determine a dominanteigenvector each element of this vector is ascore for the corresponding node Thus eachscore is influenced by the scores of neighbor-ing nodes In the case of directed graphs anode that has many downstream genes hasmore influence and receives a higher score

Methods in this category include MetaCoreSPIA iPathwayGuide MetPA SPATIAL To-poGSA DEAP pDis mirIntegrator Pathway-Express and BLMA

Approaches in the second category usesimilarity measures in their node level scor-ing Similarity measures estimate the co-expression behavioral similarity or co-regulation of pairs of components Theirvalues can be correlation coefficients covari-ances or dot products of the gene expressionprofile across time or samples In these meth-ods the pathways with clusters of highly cor-related genes are considered more significantAt the node level a score is assigned to eachpair of nodes in the network which is the ra-tio of one similarity measure over the shortestpath distance between these nodes Thus thetopology information is captured in the nodescore by incorporating the shortest path dis-tance of the pair Methods in this category in-clude ScorePAGE and PWEA

Approaches in the third category incorpo-rate the topology in the node level scoringusing a probabilistic graphical model In this

AnalyzingMolecularInteractions

82513

Current Protocols in Bioinformatics Supplement 61

model nodes are random variables and edgesdefine the conditional dependency of the nodesthey link For example PARADIGM takes ob-served experimental data and calculates scoresfor all component nodes in both observed andhidden states from the detailed network cre-ated by the method based on the input path-way For each node score a positive or negativevalue denotes how likely it is for the node tobe active or inactive respectively The scoresare calculated to maximize the probability ofthe observed values A p-value is associatedwith each score of each sample such that eachnode can be tagged as significantly activesignificantly inactive or not significant Foreach network a matrix of p-values is outputin which columns are samples and rows arecomponent nodes Methods in this categoryinclude PARADIGM and PathOlogist

Approaches in the fourth category simplycompute the score for each node using the in-formation obtained from the experiment inputFor example TAPPA calculates the score ofeach node as the square root of the normalizedlog gene expressions (node value) while ACSTand TBScore calculate the node level score us-ing a sign statistic and log fold-change

Pathway level scoringThere are three different ways to compute

the pathway score (i) linear (ii) non-linearand iii) weighted gene set Most methods ag-gregate node level statistics to pathway levelstatistics using linear functions such as av-eraging or summation For example iPath-wayGuide computes the scores of pathway asthe sum of all genes while TBScore weightsthe pathway DE genes based on their log foldchange and the number of distinct DE genesdirectly downstream of them using a depth-first search algorithm

Approaches in the second group (non-linear) use a non-linear function to computethe pathway scores For example TAPPA com-putes the pathway score for each sample as aweighted sum of the product of all node pairscores in the pathway The weight coefficientis 0 when there is no edge between a pair Forany connected node pair the weight is a signfunction which represents joint up- or down-regulation of the pair Another example is En-richNet in which pathway scores measure thedifference of the node score distribution for apathway and a background networkgene setwhich consists of all pathways At the nodelevel the distance of all DE genes to the path-way is measured and summarized as a distancedistribution The method assumes that the most

relevant pathway is the one with the greatestdifference between the pathway node scoredistribution and the background score distri-bution The difference between the two distri-butions is measured by the weighted averagingof the difference between the two discretizedand normalized distributions The averagingmethod down-weights the higher distances andemphasizes the lower distance nodes

Methods in the third group (weighted geneset) design scoring techniques that incorpo-rate existing gene set analysis methods suchas GSEA (Subramanian et al 2005) GSA(Efron amp Tibshirani 2007) or LRPath (SartorLeikauf amp Medvedovic 2009) Pathway-levelscores can be calculated using node scoresthat represent the topology characteristic of thepathway as weight adjustments to a gene setanalysis method PWEA GANPA THINK-Back-DS and CePa use this approach and werefer to them as weighted gene set analysismethods

Pathway significance assessmentPathway scores are intended to provide in-

formation regarding the amount of changeincurred in the pathway between two phe-notypes However the amount of change isnot meaningful by itself since any amount ofchange can take place just by chance An as-sessment of the significance of the measuredchanges is thus required

TopoGSA MetPA and EnrichNet outputscores without any significance assessmentleaving it up to the user to interpret the re-sults This is problematic because users do nothave any instrument to distinguish betweenchanges due to noise or random causes andmeaningful changes that are unlikely to occurjust by chance and that therefore are possi-bly related to the phenotype The rest of theanalysis methods perform a hypothesis test foreach pathway The null hypothesis is that thevalue of the observed statistic is due to randomnoise or chance alone The research hypothe-sis is that the observed values are substantialenough that they are potentially related to thephenotype A p-value for calculated score isthen computed and a user-defined thresholdon the p-value is used to decide whether thethe null hypothesis can be rejected or not foreach pathway Finally a correction for multi-ple comparisons should be performed

Typically pathway analysis methods com-pute one score per pathway The distribution ofthis score under the null hypothesis can be con-structed and compared to the observed scoreHowever there are often too few samples to

Network-BasedApproaches forPathway Level

Analysis

82514

Supplement 61 Current Protocols in Bioinformatics

calculate this distribution so it is assumedthat the distribution is known For example inMetaCore and many other techniques whenthe pathway score is the number of DE nodesthat fall on the pathway the distribution isassumed to be hypergeometric However thehypergeometric distribution assumes that thevariables (genes in this case) are independentwhich is incorrect as witnessed by the factthat the pathway graph structure itself is de-signed to reflect the specific ways in which thegenes influence each other Another approachto identify the distribution is to use statisticaltechniques such as the bootstrap (Efron 1979)Bootstrapping can be done either at the sam-ple level by permuting the sample labels orat gene-set level by permuting the the valuesassigned to the genes in the set

Multivariate analysis and Bayesiannetwork

Multivariate analysis methods mostly usemultivariate probability distributions to com-pute pathway-level statistics and these can begrouped into two subcategories Methods inthe first category use multivariate hypothesistesting while methods in the second categoryare based on Bayesian network

NetGSA TopologyGSA DEGraph clip-per and microGraphite are methods based onmultivariate hypothesis testing These anal-ysis methods assume the vectors of geneexpression values in each (sub)pathway arerandom vectors with multivariate normal dis-tributions The network topology informationis stored in the covariance matrix of the cor-responding distribution For a network if thetwo distributions of the gene expression vec-tors corresponding to the two phenotypes aresignificantly different the network is assumedto be significantly impacted when comparingthe two phenotypes The significance assess-ment is done by a multivariate hypothesis testThe definition of the null hypothesis for thestatistical tests and the techniques to calcu-late the parameters of the distributions are themain differences between these three analysismethods

BPA and BAPA-IGGFD are two methodsbased on Bayesian networks In a Bayesiannetwork which is a special case of proba-bilistic graphical models a random variableis assigned to each node of a directed acyclic(DAG) graph The edges in the graph representthe conditional probabilities between nodesso that the children are independent from eachother and the rest of the graph when condi-tioned on the parents In BPA the value of

the Bayesian random variable assigned to eachnode captures the state of a gene (DE or not) Incontrast in BAPA-IGGFD each random vari-able assigned to an edge is the probability thatup or down regulation of the genes at both endsof an interaction are concordant with the typeof interaction which can be activation or inhi-bition In both BPA and BAPA-IGGFD eachrandom variable is assumed to follow a bino-mial distribution whose probability of successfollows a beta distribution However these twomethods use different approaches in represent-ing the multivariate distribution of the corre-sponding random vector BPA assumes thatthe random vector has a multinomial distribu-tion which is the generalization of the bino-mial distribution In this case the vector of thesuccess probability follows the Dirichlet dis-tribution which is the multivariate extensionof the beta distribution In contrast BAPA-IGGFD assumes the random variables areindependent therefore the multivariate distri-butions are calculated by multiplying the dis-tributions of the random variables in the vec-tor It is worth mentioning that the assumptionof independence in BAPA-IGGFD is contra-dicted by evidence specifically in the case ofedges that share nodes

Other approachesThe four methods DRAGEN GGEA

ToPASeq and BLMA follow strategies thatare very different from those of the other 30methods As such BLMA implements a bi-level meta-analysis approach that can be ap-plied in conjunction with any of the four sta-tistical approaches SPIA (Tarca et al 2009)GSA (Efron amp Tibshirani 2007) ORA (Tava-zoie et al 1999) or PADOG (Tarca DraghiciBhatti amp Romero 2012) The package al-lows users to perform pathway analysis withone dataset or with multiple datasets Simi-larly ToPASeq provides an R package thatruns TopologyGSA DEGraph clipper SPIATBScore PWEA and TAPPA This methodmodels the input (biological networks and ex-periment data) in such a way that it can be usedfor any of the seven different analyses

DRAGEN is an analysis method that scoresthe interactions rather then the genes and usesa regression model to detect differential regu-lation DRAGEN fits each edge of a pathwayinto two linear models (for case and control)and then computes a p-value that representsthe difference between the two models Foreach pathway a summary statistic is computedby combining the p-values of the edges us-ing a weighted Fisherrsquos method (Fisher 1925)

AnalyzingMolecularInteractions

82515

Current Protocols in Bioinformatics Supplement 61

GGEA on the other hand uses Petri Net (Mu-rata 1989) to model the pathway The sum-mary statistic named consistency is computedfrom the fuzzy similarity between the observedgene expression and Petri net with fuzzy logic(PNFL Kuffner Petri Windhager amp Zimmer2010) For both DRAGEN and GGEA the p-value of the pathway is calculated by com-paring the observed summary statistic againstthe null distribution that is constructed bypermutation

CHALLENGES IN PATHWAYANALYSIS

Pathway analysis has become the firstchoice for gaining insights into the underly-ing biology of a phenotype due its explanatorypower However there are outstanding annota-tion and methodological limitations that havenot been addressed (Kelder Conklin Eveloamp Pico 2010 Khatri et al 2012) There arethree main limitations of current knowledgebases First existing knowledge bases are un-able to keep up with the information availablein data obtained from recent technologies Forexample RNA-Seq data allows us to iden-tify transcripts that are active under certainconditions Alternatively spliced transcriptseven if they originate from the same genemay have distinct or even opposite functions(Wang et al 2008) However most knowl-edge bases provide pathway annotation onlyat the gene level Second there is a lack ofcondition and cell-specific information ieinformation about cell type conditions andtime points Finally current pathway annota-tions are neither complete nor perfectly accu-rate (Khatri et al 2012 Rhee Wood Dolinskiamp Draghici 2008) For example the number ofgenes in KEGG have remained around 5000despite being updated continuously in the past10 years while there are approximately 19000human genes annotated with at least one GOterm The number of protein-coding genes isestimated to be between 19000 and 20000(Ezkurdia et al 2014) most of which areincluded in DNA microarray assays such asAffymetrix HG U133 plus 20

Another challenge is the oversimplificationthat characterizes many of the models pro-vided by pathway databases In principle eachtype of tissue might have different mecha-nisms so generic organism-level pathwayspresent a somewhat simplistic description ofthe phenomena Furthermore signaling andmetabolic processes can also be different fromone condition to another or even from one

patient to another Understanding the specificpathways that are impacted in a given pheno-type or sub-group of patients should be anothergoal for the next generation of pathway anal-ysis tools See Khatri et al (2012) for a moredetailed discussion of annotation limitationsof existing knowledge bases

Here we focus on challenges of pathwayanalysis from computational perspectives Wedemonstrate that there is a systematic bias inpathway analysis (Nguyen Mitrea Tagett ampDraghici 2017) This leads to the unreliabilityof most if not all pathway analysis approachesWe also discuss the lack of benchmark datasetsor pipelines to assess the performance of ex-isting approaches

Systematic Bias of Pathway AnalysisMethods

Pathway analysis approaches often rely onhypothesis testing to identify the pathways thatare impacted under the effects of different dis-eases Null distributions are used to modelpopulations so that statistical tests can deter-mine whether an observation is unlikely to oc-cur by chance In principle the p-values pro-duced by a sound statistical test must be uni-formly distributed under the null hypothesis(Barton Crozier Lillycrop Godfrey amp Inskip2013 Bland 2013 Fodor Tickle amp Richard-son 2007 Storey amp Tibshirani 2003) For ex-ample the p-values that result from comparingtwo groups using a t-test should be distributeduniformly if the data are normally distributed(Bland 2013) When the assumptions of sta-tistical models do not hold the resulting p-values are not uniformly distributed under thenull hypothesis This makes classical methodssuch as t-test inaccurate since gene expressionvalues do not necessary follow their assump-tions Here we also show that the problem isextended to pathway analysis as the pathwayp-values obtained from statistical approachesare not uniformly distributed under the nullhypothesis This might lead to severe bias to-wards well-studied diseases such as cancerand thus make the results unreliable (Nguyenet al 2017)

Consider three pathway analysis methodsthat represent three different classes ofmethods for pathway analysis Gene SetAnalysis (GSA Efron amp Tibshirani 2007)is a Functional Class Scoring method (Efronamp Tibshirani 2007 Mootha et al 2003Subramanian et al 2005 Tarca et al2012) Down-weighting of OverlappingGenes (PADOG Tarca et al 2012) is anenrichment method (Beiẞbarth amp Speed

Network-BasedApproaches forPathway Level

Analysis

82516

Supplement 61 Current Protocols in Bioinformatics

2004 Draghici et al 2003 Khatri DraghiciOstermeier amp Krawetz 2002) and Sig-naling Pathway Impact Analysis (SPIATarca et al 2009) is a topology-awaremethod (Draghici et al 2007 Tarca et al2009) To simulate the null distribution wedownload and process the data from ninepublic datasets GSE14924 CD4 GSE14924CD8 GSE17054 GSE12662 GSE57194GSE33223 GSE42140 GSE8023 andGSE15061 Using 140 control samples fromthe nine datasets we simulate 40000 datasetsas follows We randomly label 70 samples ascontrol samples and the remaining 70 samplesas disease samples We repeat this procedure10000 times to generate different groups of70 control and 70 disease samples To makethe simulation more general we also create10000 datasets consisting of 10 control and10 disease samples 10000 datasets consistingof 10 control and 20 disease samples and10000 datasets consisting of 20 controland 10 disease samples We then calculatethe p-values of the KEGG human signalingpathways using each of the three methods

The effect of combining control (iehealthy) samples from different experimentsis to uniformly distribute all sources of biasamong the random groups of samples If wecompare groups of control samples based onexperiments there could be true differencesdue to batch effects By pooling them togetherwe form a population which is considered thereference population This approach is simi-lar to selecting from a large group of peoplethat may contain different sub-groups (egdifferent ethnicities gender race life styleor living conditions) When we randomly se-lect samples (for the two random groups tobe compared) from the reference populationwe expect all bias (eg ethnic subgroups) tobe represented equally in both random groupsand therefore we should see no difference be-tween these random groups no matter howmany distinct ethnic subgroups were presentin the population at large Therefore the p-values of a test for difference between the tworandomly selected groups should be equallyprobable between zero and one

Figure 8254 displays the empirical nulldistributions of p-values using PADOG GSAand SPIA The horizontal axes represent p-values while the vertical axes represent p-value densities Green panels (A0-A6) showp-value distributions from PADOG while blue(B0-B6) and purple (C0-C6) panels show p-value distributions from GSA and SPIA re-spectively For each method the larger panel

(A0 B0 C0) shows the cumulative p-valuesfrom all KEGG signaling pathways The smallpanels six per method display extreme ex-amples of nonuniform p-value distributionsfor specific pathways For each method weshow three distributions severely biased to-wards zero (eg A1-A3) and three distri-butions severely biased towards one (egA4-A6)

These results show that contrary to gen-erally accepted beliefs the p-values are notuniformly distributed for the three methodsconsidered Therefore one should expect avery strong and systematic bias in identifyingsignificant pathways for each of these meth-ods Pathways that have p-values biased to-wards zero will often be falsely identified assignificant (false positives) Likewise path-ways that have p-values biased towards one arelikely to rarely meet the significance require-ments even when they are truly implicatedin the given phenotype (false negatives) Sys-tematic bias due to nonuniformity of p-valuedistributions results in failure of the statisti-cal methods to correctly identify the biologicalpathways implicated in the condition and alsoleads to inconsistent and incorrect results

Querying Data from Knowledge BasesIndependent research groups have tried

different strategies to model complex bio-molecular phenomena These independent ef-forts have led to variation among pathwaydatabases complicating the task of devel-oping pathway analysis methods Dependingon the database there may be differencesin information sources experiment interpre-tation models of molecular interactions orboundaries of the pathways Therefore it ispossible that pathways with the same desig-nation and aiming to describe the same phe-nomena may have different topologies in dif-ferent databases As an example one couldcompare the insulin signaling pathways ofKEGG and BioCarta BioCarta includes fewernodes and emphasizes the effect of insulin ontranscription while KEGG includes transcrip-tion regulation as well as apoptosis and otherbiological processes Differences in graphmodels for molecular interactions are partic-ularly apparent when comparing the signal-ing pathways in KEGG and NCI-PID WhileKEGG represents the interaction informationusing the directed edges themselves NCI-PIDintroduces ldquoprocess nodesrdquo to model interac-tions (see Fig 8252) Developers are facingthe challenge of modifying methods to accept

AnalyzingMolecularInteractions

82517

Current Protocols in Bioinformatics Supplement 61

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 11: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

Figure 8252 Five biological network graph models used by public databases are displayedThe KEGG database contains both signaling and metabolic networks Signaling networks havegenesgene products as nodes and regulatory signals as edges Types of regulatory signalsinclude activation inhibition phosphorylation and many others Metabolic networks have bio-chemical compounds as nodes and chemical reactions as edges Enzymes (specialized pro-teinsgene products) catalyze biochemical reactions therefore genes are linked to the edges inthese networks The Reactome database is a collection of biochemical reactions that are groupedin functionally related sets to form a pathway There are two types of nodes biochemical com-pounds and reactions Reaction nodes link biochemical compounds as reactants and productsNCI-PID signaling networks also have two types of nodes component nodes and process nodesComponent nodes are usually biomolecular components Process nodes are usually biochemicalreactions or biological processes In these networks process nodes link two or more componentnodes through directed edges Process nodes are assigned one of the following states positiveor negative regulation or involved in Protein-protein interaction (PPI) networks have proteinsas nodes and physical binding as edges or interactions Two-hybrid assays are typically used todetermine protein-protein interactions PPIs can be directed in the bait-prey orientation when thebait-prey relation is considered (top) or undirected when is not (bottom) The Biological PathwayExchange (BioPAX) format has physical entities as nodes and conversions as edges The ad-vantage of this representation is that it is generic and provides a lot of flexibility to accommodatevarious types of interactions In addition it provides a machine-readable standard that can beused for all databases to provide their data in a unified format Network nodes can be genesgene products complexes or non-coding RNA Network edges can be assembly of a complexdisassembly of a complex or biochemical reactions among others

genes is estimated to be between 19000 and20000 (Ezkurdia et al 2014)

The implementation of analysis methodsconstrains the software to accept a specific in-put pathway data format while the underlyinggraph models in the methods are independentof the input format Regardless of the pathwayformat this information must be parsed into acomputer-readable graph data structure beforebeing processed The implementation may in-corporate a parser or this may be up to the userFor instance SPIA accepts any signaling path-way or network if it can be transformed into anadjacency matrix representing a directed graphwhere all nodes are components and all edgesare interactions NetGSA is similarly flexiblewith regard to signaling and metabolic path-ways SPIA provides KEGG signaling path-ways as a set of pre-parsed adjacency matri-ces The methods described in this chapter maybe restricted to only one pathway database ormay accept several

To the best of our knowledge there havebeen no comprehensive approaches to com-

pare the pros and cons of existing knowledgebases It is completely up to researchers tochoose the tools that are able to work withthe database they trust Intuitively methodsthat are able to exploit the complementary in-formation available from different databaseshave an edge over methods that work with onesingle database

Graph ModelsPathway analysis approaches use two major

graph models to represent biological networksobtained from knowledge bases (i) single-type and (ii) multiple-type The first modelallows only one type of node ie a gene orprotein with edges representing molecular in-teractions between the nodes (KEGG signal-ing in Fig 8252) Models that contain di-rected graphs are more suitable for analysesthat include signal propagation of gene pertur-bation The second graph model allows mul-tiple type of nodes such as components andinteractions (NCI-PID in Fig 8252) Multi-type graph models are more complex than

AnalyzingMolecularInteractions

82511

Current Protocols in Bioinformatics Supplement 61

single-type but they are expected to capturemore pathway characteristics For examplesingle-type models are limited when tryingto describe ldquoallrdquo and ldquoanyrdquo relations betweenmultiple components that are involved in thesame interaction Bipartite graphs which con-tain two types of nodes and allow connectiononly between nodes of different types are aparticular case of multi-type graph models

The majority of analysis methods surveyedhere use a single-type graph model Some ap-ply the analysis on a directed or un-directedsingle-type network built using the input path-way while others transform the pathways intographs with specific characteristics An ex-ample of the later is TopologyGSA whichtransforms the directed input pathway intoan undirected decomposable graph which hasthe advantage of being easily broken downinto separate modules (Lauritzen 1996) Inthis method decomposable graphs are usedto find ldquoimportantrdquo submodulesmdashthose thatdrive the changes across the whole pathwayFor each pathway TopologyGSA creates anundirected moral graph from the underlyingdirected acyclic graph (DAG) by connectingthe parents of each child and removing theedge direction [The moral graph of a DAGis the undirected graph created by adding an(undirected) edge between all parents of thesame node (sometimes called marrying) andthen replacing all directed edges by undirectededges The name stems from the fact that ina moral graph two nodes that have a com-mon child are required to be married by shar-ing an edge] The moral graph is then usedto test the hypothesis that the underlying net-work is changed significantly between the twophenotypes If the the research hypothesis isrejected a decomposabletriangulated graph isgenerated from the moral graph by adding newedges This graph is broken into the maximalpossible submodules and the hypothesis is re-tested on each of them

PathOlogist and PARADIGM are thetwo surveyed methods that use multi-typegraph models PathOlogist uses a bipartitegraph model with component and interac-tion nodes PARADIGM conceptually moti-vated by the central dogma of molecular bi-ology takes a pathway graph as input andconverts it into a more detailed graph whereeach component node is replaced by sev-eral more specific nodes biological entitynodes interaction nodes and nodes contain-ing observed experiment data The observedexperiment nodes could in principle containgene-expression and copy-number informa-

tion Biological entity nodes are DNA mRNAprotein and active protein The interactionnodes are transcription translation or pro-tein activation among others Biological en-tity and interaction node values are derivedfrom these data and specify the probability ofthe node being active These are the hiddenstates of the model From the mathematicalmodel perspective models that allow multipletypes of nodesmdashcomponent nodes and inter-action nodesmdashare more flexible and are able tomodel both AND and OR gates which are verycommon when describing cellular processes

Pathway Scoring StrategiesThe goal of the scoring method is to com-

pute a score for each pathway based on the ex-pression change and the graph model resultingin a ranked list of pathways or sub-pathwaysThere are a variety of approaches to quantifythe changes in a pathway Some of the analysismethods use a hierarchically aggregated scor-ing algorithm where on the first level a scoreis calculated and assigned to each node or pairof nodes (component andor interaction) Onthe second level these scores are aggregated tocompute the score of the pathway On the lastlevel the statistical significance of the path-way score is assessed using univariate hypoth-esis testing Another approach assigns a ran-dom variable to each node and a multivariateprobability distribution is calculated for eachpathway The output score can be calculated intwo ways One way is to use multivariate hy-pothesis testing to assess the statistical signif-icance of changes in the pathway distributionbetween the two phenotypes The other way isto estimate the distribution parameters basedon the Bayesian network model and use thisdistribution to compute a probabilistic score tomeasure the changes In this section we pro-vide details regarding the scoring algorithmsof the surveyed methods See Figure 8253 forscoring algorithms categories

Hierarchically aggregated scoringThe workflow of the hierarchically aggre-

gated scoring strategy has three levels nodestatistic computation pathway statistic com-putation and the evaluation of the significancefor the pathway statistic

Node level scoringMost of the surveyed methods incorpo-

rate pathway topology information in the nodescores The node level scoring can be di-vided into four categories (i) graph mea-sures (centrality) (ii) similarity measures

Network-BasedApproaches forPathway Level

Analysis

82512

Supplement 61 Current Protocols in Bioinformatics

Figure 8253 A summary of the statistical models is presented for the surveyed methods Mostmethods use the hierarchically aggregated scoring strategy in which the score is computed atthe node level before being aggregated at the pathway level for significance assessment In theleft panel the rows show a node-level statistical model while the columns show the aggregatingstrategies (linear non-linear or weighted gene set) Approaches in multivariate analysis andBayesian network categories use multivariate modeling and Bayesian network respectively tofind pathways that are most likely to be impacted Both BLMA and ToPASeq provide R packagesthat run multiple algorithms DRAGEN follows a linear regression strategy while GGEA appliesfuzzy modeling to rank the pathways

(iii) probabilistic graphical models and (iv)normalized node value (NNV) Approaches inthe first category use centrality measures or avariation of these measures to score nodes in agiven pathway Centrality measures representthe importance of a node relative to all othernodes in a network There are several central-ity measures that can be applied to networksof genes and their interactions degree central-ity closeness between-ness and eigenvectorcentrality Degree centrality accounts for thenumber of directed edges that enter and leaveeach node Closeness sums the shortest dis-tance from each node to all other nodes inthe network Node between-ness measures theimportance of a node according to the numberof shortest paths that pass through it Eigen-vector centrality uses the network adjacencymatrix of a graph to determine a dominanteigenvector each element of this vector is ascore for the corresponding node Thus eachscore is influenced by the scores of neighbor-ing nodes In the case of directed graphs anode that has many downstream genes hasmore influence and receives a higher score

Methods in this category include MetaCoreSPIA iPathwayGuide MetPA SPATIAL To-poGSA DEAP pDis mirIntegrator Pathway-Express and BLMA

Approaches in the second category usesimilarity measures in their node level scor-ing Similarity measures estimate the co-expression behavioral similarity or co-regulation of pairs of components Theirvalues can be correlation coefficients covari-ances or dot products of the gene expressionprofile across time or samples In these meth-ods the pathways with clusters of highly cor-related genes are considered more significantAt the node level a score is assigned to eachpair of nodes in the network which is the ra-tio of one similarity measure over the shortestpath distance between these nodes Thus thetopology information is captured in the nodescore by incorporating the shortest path dis-tance of the pair Methods in this category in-clude ScorePAGE and PWEA

Approaches in the third category incorpo-rate the topology in the node level scoringusing a probabilistic graphical model In this

AnalyzingMolecularInteractions

82513

Current Protocols in Bioinformatics Supplement 61

model nodes are random variables and edgesdefine the conditional dependency of the nodesthey link For example PARADIGM takes ob-served experimental data and calculates scoresfor all component nodes in both observed andhidden states from the detailed network cre-ated by the method based on the input path-way For each node score a positive or negativevalue denotes how likely it is for the node tobe active or inactive respectively The scoresare calculated to maximize the probability ofthe observed values A p-value is associatedwith each score of each sample such that eachnode can be tagged as significantly activesignificantly inactive or not significant Foreach network a matrix of p-values is outputin which columns are samples and rows arecomponent nodes Methods in this categoryinclude PARADIGM and PathOlogist

Approaches in the fourth category simplycompute the score for each node using the in-formation obtained from the experiment inputFor example TAPPA calculates the score ofeach node as the square root of the normalizedlog gene expressions (node value) while ACSTand TBScore calculate the node level score us-ing a sign statistic and log fold-change

Pathway level scoringThere are three different ways to compute

the pathway score (i) linear (ii) non-linearand iii) weighted gene set Most methods ag-gregate node level statistics to pathway levelstatistics using linear functions such as av-eraging or summation For example iPath-wayGuide computes the scores of pathway asthe sum of all genes while TBScore weightsthe pathway DE genes based on their log foldchange and the number of distinct DE genesdirectly downstream of them using a depth-first search algorithm

Approaches in the second group (non-linear) use a non-linear function to computethe pathway scores For example TAPPA com-putes the pathway score for each sample as aweighted sum of the product of all node pairscores in the pathway The weight coefficientis 0 when there is no edge between a pair Forany connected node pair the weight is a signfunction which represents joint up- or down-regulation of the pair Another example is En-richNet in which pathway scores measure thedifference of the node score distribution for apathway and a background networkgene setwhich consists of all pathways At the nodelevel the distance of all DE genes to the path-way is measured and summarized as a distancedistribution The method assumes that the most

relevant pathway is the one with the greatestdifference between the pathway node scoredistribution and the background score distri-bution The difference between the two distri-butions is measured by the weighted averagingof the difference between the two discretizedand normalized distributions The averagingmethod down-weights the higher distances andemphasizes the lower distance nodes

Methods in the third group (weighted geneset) design scoring techniques that incorpo-rate existing gene set analysis methods suchas GSEA (Subramanian et al 2005) GSA(Efron amp Tibshirani 2007) or LRPath (SartorLeikauf amp Medvedovic 2009) Pathway-levelscores can be calculated using node scoresthat represent the topology characteristic of thepathway as weight adjustments to a gene setanalysis method PWEA GANPA THINK-Back-DS and CePa use this approach and werefer to them as weighted gene set analysismethods

Pathway significance assessmentPathway scores are intended to provide in-

formation regarding the amount of changeincurred in the pathway between two phe-notypes However the amount of change isnot meaningful by itself since any amount ofchange can take place just by chance An as-sessment of the significance of the measuredchanges is thus required

TopoGSA MetPA and EnrichNet outputscores without any significance assessmentleaving it up to the user to interpret the re-sults This is problematic because users do nothave any instrument to distinguish betweenchanges due to noise or random causes andmeaningful changes that are unlikely to occurjust by chance and that therefore are possi-bly related to the phenotype The rest of theanalysis methods perform a hypothesis test foreach pathway The null hypothesis is that thevalue of the observed statistic is due to randomnoise or chance alone The research hypothe-sis is that the observed values are substantialenough that they are potentially related to thephenotype A p-value for calculated score isthen computed and a user-defined thresholdon the p-value is used to decide whether thethe null hypothesis can be rejected or not foreach pathway Finally a correction for multi-ple comparisons should be performed

Typically pathway analysis methods com-pute one score per pathway The distribution ofthis score under the null hypothesis can be con-structed and compared to the observed scoreHowever there are often too few samples to

Network-BasedApproaches forPathway Level

Analysis

82514

Supplement 61 Current Protocols in Bioinformatics

calculate this distribution so it is assumedthat the distribution is known For example inMetaCore and many other techniques whenthe pathway score is the number of DE nodesthat fall on the pathway the distribution isassumed to be hypergeometric However thehypergeometric distribution assumes that thevariables (genes in this case) are independentwhich is incorrect as witnessed by the factthat the pathway graph structure itself is de-signed to reflect the specific ways in which thegenes influence each other Another approachto identify the distribution is to use statisticaltechniques such as the bootstrap (Efron 1979)Bootstrapping can be done either at the sam-ple level by permuting the sample labels orat gene-set level by permuting the the valuesassigned to the genes in the set

Multivariate analysis and Bayesiannetwork

Multivariate analysis methods mostly usemultivariate probability distributions to com-pute pathway-level statistics and these can begrouped into two subcategories Methods inthe first category use multivariate hypothesistesting while methods in the second categoryare based on Bayesian network

NetGSA TopologyGSA DEGraph clip-per and microGraphite are methods based onmultivariate hypothesis testing These anal-ysis methods assume the vectors of geneexpression values in each (sub)pathway arerandom vectors with multivariate normal dis-tributions The network topology informationis stored in the covariance matrix of the cor-responding distribution For a network if thetwo distributions of the gene expression vec-tors corresponding to the two phenotypes aresignificantly different the network is assumedto be significantly impacted when comparingthe two phenotypes The significance assess-ment is done by a multivariate hypothesis testThe definition of the null hypothesis for thestatistical tests and the techniques to calcu-late the parameters of the distributions are themain differences between these three analysismethods

BPA and BAPA-IGGFD are two methodsbased on Bayesian networks In a Bayesiannetwork which is a special case of proba-bilistic graphical models a random variableis assigned to each node of a directed acyclic(DAG) graph The edges in the graph representthe conditional probabilities between nodesso that the children are independent from eachother and the rest of the graph when condi-tioned on the parents In BPA the value of

the Bayesian random variable assigned to eachnode captures the state of a gene (DE or not) Incontrast in BAPA-IGGFD each random vari-able assigned to an edge is the probability thatup or down regulation of the genes at both endsof an interaction are concordant with the typeof interaction which can be activation or inhi-bition In both BPA and BAPA-IGGFD eachrandom variable is assumed to follow a bino-mial distribution whose probability of successfollows a beta distribution However these twomethods use different approaches in represent-ing the multivariate distribution of the corre-sponding random vector BPA assumes thatthe random vector has a multinomial distribu-tion which is the generalization of the bino-mial distribution In this case the vector of thesuccess probability follows the Dirichlet dis-tribution which is the multivariate extensionof the beta distribution In contrast BAPA-IGGFD assumes the random variables areindependent therefore the multivariate distri-butions are calculated by multiplying the dis-tributions of the random variables in the vec-tor It is worth mentioning that the assumptionof independence in BAPA-IGGFD is contra-dicted by evidence specifically in the case ofedges that share nodes

Other approachesThe four methods DRAGEN GGEA

ToPASeq and BLMA follow strategies thatare very different from those of the other 30methods As such BLMA implements a bi-level meta-analysis approach that can be ap-plied in conjunction with any of the four sta-tistical approaches SPIA (Tarca et al 2009)GSA (Efron amp Tibshirani 2007) ORA (Tava-zoie et al 1999) or PADOG (Tarca DraghiciBhatti amp Romero 2012) The package al-lows users to perform pathway analysis withone dataset or with multiple datasets Simi-larly ToPASeq provides an R package thatruns TopologyGSA DEGraph clipper SPIATBScore PWEA and TAPPA This methodmodels the input (biological networks and ex-periment data) in such a way that it can be usedfor any of the seven different analyses

DRAGEN is an analysis method that scoresthe interactions rather then the genes and usesa regression model to detect differential regu-lation DRAGEN fits each edge of a pathwayinto two linear models (for case and control)and then computes a p-value that representsthe difference between the two models Foreach pathway a summary statistic is computedby combining the p-values of the edges us-ing a weighted Fisherrsquos method (Fisher 1925)

AnalyzingMolecularInteractions

82515

Current Protocols in Bioinformatics Supplement 61

GGEA on the other hand uses Petri Net (Mu-rata 1989) to model the pathway The sum-mary statistic named consistency is computedfrom the fuzzy similarity between the observedgene expression and Petri net with fuzzy logic(PNFL Kuffner Petri Windhager amp Zimmer2010) For both DRAGEN and GGEA the p-value of the pathway is calculated by com-paring the observed summary statistic againstthe null distribution that is constructed bypermutation

CHALLENGES IN PATHWAYANALYSIS

Pathway analysis has become the firstchoice for gaining insights into the underly-ing biology of a phenotype due its explanatorypower However there are outstanding annota-tion and methodological limitations that havenot been addressed (Kelder Conklin Eveloamp Pico 2010 Khatri et al 2012) There arethree main limitations of current knowledgebases First existing knowledge bases are un-able to keep up with the information availablein data obtained from recent technologies Forexample RNA-Seq data allows us to iden-tify transcripts that are active under certainconditions Alternatively spliced transcriptseven if they originate from the same genemay have distinct or even opposite functions(Wang et al 2008) However most knowl-edge bases provide pathway annotation onlyat the gene level Second there is a lack ofcondition and cell-specific information ieinformation about cell type conditions andtime points Finally current pathway annota-tions are neither complete nor perfectly accu-rate (Khatri et al 2012 Rhee Wood Dolinskiamp Draghici 2008) For example the number ofgenes in KEGG have remained around 5000despite being updated continuously in the past10 years while there are approximately 19000human genes annotated with at least one GOterm The number of protein-coding genes isestimated to be between 19000 and 20000(Ezkurdia et al 2014) most of which areincluded in DNA microarray assays such asAffymetrix HG U133 plus 20

Another challenge is the oversimplificationthat characterizes many of the models pro-vided by pathway databases In principle eachtype of tissue might have different mecha-nisms so generic organism-level pathwayspresent a somewhat simplistic description ofthe phenomena Furthermore signaling andmetabolic processes can also be different fromone condition to another or even from one

patient to another Understanding the specificpathways that are impacted in a given pheno-type or sub-group of patients should be anothergoal for the next generation of pathway anal-ysis tools See Khatri et al (2012) for a moredetailed discussion of annotation limitationsof existing knowledge bases

Here we focus on challenges of pathwayanalysis from computational perspectives Wedemonstrate that there is a systematic bias inpathway analysis (Nguyen Mitrea Tagett ampDraghici 2017) This leads to the unreliabilityof most if not all pathway analysis approachesWe also discuss the lack of benchmark datasetsor pipelines to assess the performance of ex-isting approaches

Systematic Bias of Pathway AnalysisMethods

Pathway analysis approaches often rely onhypothesis testing to identify the pathways thatare impacted under the effects of different dis-eases Null distributions are used to modelpopulations so that statistical tests can deter-mine whether an observation is unlikely to oc-cur by chance In principle the p-values pro-duced by a sound statistical test must be uni-formly distributed under the null hypothesis(Barton Crozier Lillycrop Godfrey amp Inskip2013 Bland 2013 Fodor Tickle amp Richard-son 2007 Storey amp Tibshirani 2003) For ex-ample the p-values that result from comparingtwo groups using a t-test should be distributeduniformly if the data are normally distributed(Bland 2013) When the assumptions of sta-tistical models do not hold the resulting p-values are not uniformly distributed under thenull hypothesis This makes classical methodssuch as t-test inaccurate since gene expressionvalues do not necessary follow their assump-tions Here we also show that the problem isextended to pathway analysis as the pathwayp-values obtained from statistical approachesare not uniformly distributed under the nullhypothesis This might lead to severe bias to-wards well-studied diseases such as cancerand thus make the results unreliable (Nguyenet al 2017)

Consider three pathway analysis methodsthat represent three different classes ofmethods for pathway analysis Gene SetAnalysis (GSA Efron amp Tibshirani 2007)is a Functional Class Scoring method (Efronamp Tibshirani 2007 Mootha et al 2003Subramanian et al 2005 Tarca et al2012) Down-weighting of OverlappingGenes (PADOG Tarca et al 2012) is anenrichment method (Beiẞbarth amp Speed

Network-BasedApproaches forPathway Level

Analysis

82516

Supplement 61 Current Protocols in Bioinformatics

2004 Draghici et al 2003 Khatri DraghiciOstermeier amp Krawetz 2002) and Sig-naling Pathway Impact Analysis (SPIATarca et al 2009) is a topology-awaremethod (Draghici et al 2007 Tarca et al2009) To simulate the null distribution wedownload and process the data from ninepublic datasets GSE14924 CD4 GSE14924CD8 GSE17054 GSE12662 GSE57194GSE33223 GSE42140 GSE8023 andGSE15061 Using 140 control samples fromthe nine datasets we simulate 40000 datasetsas follows We randomly label 70 samples ascontrol samples and the remaining 70 samplesas disease samples We repeat this procedure10000 times to generate different groups of70 control and 70 disease samples To makethe simulation more general we also create10000 datasets consisting of 10 control and10 disease samples 10000 datasets consistingof 10 control and 20 disease samples and10000 datasets consisting of 20 controland 10 disease samples We then calculatethe p-values of the KEGG human signalingpathways using each of the three methods

The effect of combining control (iehealthy) samples from different experimentsis to uniformly distribute all sources of biasamong the random groups of samples If wecompare groups of control samples based onexperiments there could be true differencesdue to batch effects By pooling them togetherwe form a population which is considered thereference population This approach is simi-lar to selecting from a large group of peoplethat may contain different sub-groups (egdifferent ethnicities gender race life styleor living conditions) When we randomly se-lect samples (for the two random groups tobe compared) from the reference populationwe expect all bias (eg ethnic subgroups) tobe represented equally in both random groupsand therefore we should see no difference be-tween these random groups no matter howmany distinct ethnic subgroups were presentin the population at large Therefore the p-values of a test for difference between the tworandomly selected groups should be equallyprobable between zero and one

Figure 8254 displays the empirical nulldistributions of p-values using PADOG GSAand SPIA The horizontal axes represent p-values while the vertical axes represent p-value densities Green panels (A0-A6) showp-value distributions from PADOG while blue(B0-B6) and purple (C0-C6) panels show p-value distributions from GSA and SPIA re-spectively For each method the larger panel

(A0 B0 C0) shows the cumulative p-valuesfrom all KEGG signaling pathways The smallpanels six per method display extreme ex-amples of nonuniform p-value distributionsfor specific pathways For each method weshow three distributions severely biased to-wards zero (eg A1-A3) and three distri-butions severely biased towards one (egA4-A6)

These results show that contrary to gen-erally accepted beliefs the p-values are notuniformly distributed for the three methodsconsidered Therefore one should expect avery strong and systematic bias in identifyingsignificant pathways for each of these meth-ods Pathways that have p-values biased to-wards zero will often be falsely identified assignificant (false positives) Likewise path-ways that have p-values biased towards one arelikely to rarely meet the significance require-ments even when they are truly implicatedin the given phenotype (false negatives) Sys-tematic bias due to nonuniformity of p-valuedistributions results in failure of the statisti-cal methods to correctly identify the biologicalpathways implicated in the condition and alsoleads to inconsistent and incorrect results

Querying Data from Knowledge BasesIndependent research groups have tried

different strategies to model complex bio-molecular phenomena These independent ef-forts have led to variation among pathwaydatabases complicating the task of devel-oping pathway analysis methods Dependingon the database there may be differencesin information sources experiment interpre-tation models of molecular interactions orboundaries of the pathways Therefore it ispossible that pathways with the same desig-nation and aiming to describe the same phe-nomena may have different topologies in dif-ferent databases As an example one couldcompare the insulin signaling pathways ofKEGG and BioCarta BioCarta includes fewernodes and emphasizes the effect of insulin ontranscription while KEGG includes transcrip-tion regulation as well as apoptosis and otherbiological processes Differences in graphmodels for molecular interactions are partic-ularly apparent when comparing the signal-ing pathways in KEGG and NCI-PID WhileKEGG represents the interaction informationusing the directed edges themselves NCI-PIDintroduces ldquoprocess nodesrdquo to model interac-tions (see Fig 8252) Developers are facingthe challenge of modifying methods to accept

AnalyzingMolecularInteractions

82517

Current Protocols in Bioinformatics Supplement 61

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 12: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

single-type but they are expected to capturemore pathway characteristics For examplesingle-type models are limited when tryingto describe ldquoallrdquo and ldquoanyrdquo relations betweenmultiple components that are involved in thesame interaction Bipartite graphs which con-tain two types of nodes and allow connectiononly between nodes of different types are aparticular case of multi-type graph models

The majority of analysis methods surveyedhere use a single-type graph model Some ap-ply the analysis on a directed or un-directedsingle-type network built using the input path-way while others transform the pathways intographs with specific characteristics An ex-ample of the later is TopologyGSA whichtransforms the directed input pathway intoan undirected decomposable graph which hasthe advantage of being easily broken downinto separate modules (Lauritzen 1996) Inthis method decomposable graphs are usedto find ldquoimportantrdquo submodulesmdashthose thatdrive the changes across the whole pathwayFor each pathway TopologyGSA creates anundirected moral graph from the underlyingdirected acyclic graph (DAG) by connectingthe parents of each child and removing theedge direction [The moral graph of a DAGis the undirected graph created by adding an(undirected) edge between all parents of thesame node (sometimes called marrying) andthen replacing all directed edges by undirectededges The name stems from the fact that ina moral graph two nodes that have a com-mon child are required to be married by shar-ing an edge] The moral graph is then usedto test the hypothesis that the underlying net-work is changed significantly between the twophenotypes If the the research hypothesis isrejected a decomposabletriangulated graph isgenerated from the moral graph by adding newedges This graph is broken into the maximalpossible submodules and the hypothesis is re-tested on each of them

PathOlogist and PARADIGM are thetwo surveyed methods that use multi-typegraph models PathOlogist uses a bipartitegraph model with component and interac-tion nodes PARADIGM conceptually moti-vated by the central dogma of molecular bi-ology takes a pathway graph as input andconverts it into a more detailed graph whereeach component node is replaced by sev-eral more specific nodes biological entitynodes interaction nodes and nodes contain-ing observed experiment data The observedexperiment nodes could in principle containgene-expression and copy-number informa-

tion Biological entity nodes are DNA mRNAprotein and active protein The interactionnodes are transcription translation or pro-tein activation among others Biological en-tity and interaction node values are derivedfrom these data and specify the probability ofthe node being active These are the hiddenstates of the model From the mathematicalmodel perspective models that allow multipletypes of nodesmdashcomponent nodes and inter-action nodesmdashare more flexible and are able tomodel both AND and OR gates which are verycommon when describing cellular processes

Pathway Scoring StrategiesThe goal of the scoring method is to com-

pute a score for each pathway based on the ex-pression change and the graph model resultingin a ranked list of pathways or sub-pathwaysThere are a variety of approaches to quantifythe changes in a pathway Some of the analysismethods use a hierarchically aggregated scor-ing algorithm where on the first level a scoreis calculated and assigned to each node or pairof nodes (component andor interaction) Onthe second level these scores are aggregated tocompute the score of the pathway On the lastlevel the statistical significance of the path-way score is assessed using univariate hypoth-esis testing Another approach assigns a ran-dom variable to each node and a multivariateprobability distribution is calculated for eachpathway The output score can be calculated intwo ways One way is to use multivariate hy-pothesis testing to assess the statistical signif-icance of changes in the pathway distributionbetween the two phenotypes The other way isto estimate the distribution parameters basedon the Bayesian network model and use thisdistribution to compute a probabilistic score tomeasure the changes In this section we pro-vide details regarding the scoring algorithmsof the surveyed methods See Figure 8253 forscoring algorithms categories

Hierarchically aggregated scoringThe workflow of the hierarchically aggre-

gated scoring strategy has three levels nodestatistic computation pathway statistic com-putation and the evaluation of the significancefor the pathway statistic

Node level scoringMost of the surveyed methods incorpo-

rate pathway topology information in the nodescores The node level scoring can be di-vided into four categories (i) graph mea-sures (centrality) (ii) similarity measures

Network-BasedApproaches forPathway Level

Analysis

82512

Supplement 61 Current Protocols in Bioinformatics

Figure 8253 A summary of the statistical models is presented for the surveyed methods Mostmethods use the hierarchically aggregated scoring strategy in which the score is computed atthe node level before being aggregated at the pathway level for significance assessment In theleft panel the rows show a node-level statistical model while the columns show the aggregatingstrategies (linear non-linear or weighted gene set) Approaches in multivariate analysis andBayesian network categories use multivariate modeling and Bayesian network respectively tofind pathways that are most likely to be impacted Both BLMA and ToPASeq provide R packagesthat run multiple algorithms DRAGEN follows a linear regression strategy while GGEA appliesfuzzy modeling to rank the pathways

(iii) probabilistic graphical models and (iv)normalized node value (NNV) Approaches inthe first category use centrality measures or avariation of these measures to score nodes in agiven pathway Centrality measures representthe importance of a node relative to all othernodes in a network There are several central-ity measures that can be applied to networksof genes and their interactions degree central-ity closeness between-ness and eigenvectorcentrality Degree centrality accounts for thenumber of directed edges that enter and leaveeach node Closeness sums the shortest dis-tance from each node to all other nodes inthe network Node between-ness measures theimportance of a node according to the numberof shortest paths that pass through it Eigen-vector centrality uses the network adjacencymatrix of a graph to determine a dominanteigenvector each element of this vector is ascore for the corresponding node Thus eachscore is influenced by the scores of neighbor-ing nodes In the case of directed graphs anode that has many downstream genes hasmore influence and receives a higher score

Methods in this category include MetaCoreSPIA iPathwayGuide MetPA SPATIAL To-poGSA DEAP pDis mirIntegrator Pathway-Express and BLMA

Approaches in the second category usesimilarity measures in their node level scor-ing Similarity measures estimate the co-expression behavioral similarity or co-regulation of pairs of components Theirvalues can be correlation coefficients covari-ances or dot products of the gene expressionprofile across time or samples In these meth-ods the pathways with clusters of highly cor-related genes are considered more significantAt the node level a score is assigned to eachpair of nodes in the network which is the ra-tio of one similarity measure over the shortestpath distance between these nodes Thus thetopology information is captured in the nodescore by incorporating the shortest path dis-tance of the pair Methods in this category in-clude ScorePAGE and PWEA

Approaches in the third category incorpo-rate the topology in the node level scoringusing a probabilistic graphical model In this

AnalyzingMolecularInteractions

82513

Current Protocols in Bioinformatics Supplement 61

model nodes are random variables and edgesdefine the conditional dependency of the nodesthey link For example PARADIGM takes ob-served experimental data and calculates scoresfor all component nodes in both observed andhidden states from the detailed network cre-ated by the method based on the input path-way For each node score a positive or negativevalue denotes how likely it is for the node tobe active or inactive respectively The scoresare calculated to maximize the probability ofthe observed values A p-value is associatedwith each score of each sample such that eachnode can be tagged as significantly activesignificantly inactive or not significant Foreach network a matrix of p-values is outputin which columns are samples and rows arecomponent nodes Methods in this categoryinclude PARADIGM and PathOlogist

Approaches in the fourth category simplycompute the score for each node using the in-formation obtained from the experiment inputFor example TAPPA calculates the score ofeach node as the square root of the normalizedlog gene expressions (node value) while ACSTand TBScore calculate the node level score us-ing a sign statistic and log fold-change

Pathway level scoringThere are three different ways to compute

the pathway score (i) linear (ii) non-linearand iii) weighted gene set Most methods ag-gregate node level statistics to pathway levelstatistics using linear functions such as av-eraging or summation For example iPath-wayGuide computes the scores of pathway asthe sum of all genes while TBScore weightsthe pathway DE genes based on their log foldchange and the number of distinct DE genesdirectly downstream of them using a depth-first search algorithm

Approaches in the second group (non-linear) use a non-linear function to computethe pathway scores For example TAPPA com-putes the pathway score for each sample as aweighted sum of the product of all node pairscores in the pathway The weight coefficientis 0 when there is no edge between a pair Forany connected node pair the weight is a signfunction which represents joint up- or down-regulation of the pair Another example is En-richNet in which pathway scores measure thedifference of the node score distribution for apathway and a background networkgene setwhich consists of all pathways At the nodelevel the distance of all DE genes to the path-way is measured and summarized as a distancedistribution The method assumes that the most

relevant pathway is the one with the greatestdifference between the pathway node scoredistribution and the background score distri-bution The difference between the two distri-butions is measured by the weighted averagingof the difference between the two discretizedand normalized distributions The averagingmethod down-weights the higher distances andemphasizes the lower distance nodes

Methods in the third group (weighted geneset) design scoring techniques that incorpo-rate existing gene set analysis methods suchas GSEA (Subramanian et al 2005) GSA(Efron amp Tibshirani 2007) or LRPath (SartorLeikauf amp Medvedovic 2009) Pathway-levelscores can be calculated using node scoresthat represent the topology characteristic of thepathway as weight adjustments to a gene setanalysis method PWEA GANPA THINK-Back-DS and CePa use this approach and werefer to them as weighted gene set analysismethods

Pathway significance assessmentPathway scores are intended to provide in-

formation regarding the amount of changeincurred in the pathway between two phe-notypes However the amount of change isnot meaningful by itself since any amount ofchange can take place just by chance An as-sessment of the significance of the measuredchanges is thus required

TopoGSA MetPA and EnrichNet outputscores without any significance assessmentleaving it up to the user to interpret the re-sults This is problematic because users do nothave any instrument to distinguish betweenchanges due to noise or random causes andmeaningful changes that are unlikely to occurjust by chance and that therefore are possi-bly related to the phenotype The rest of theanalysis methods perform a hypothesis test foreach pathway The null hypothesis is that thevalue of the observed statistic is due to randomnoise or chance alone The research hypothe-sis is that the observed values are substantialenough that they are potentially related to thephenotype A p-value for calculated score isthen computed and a user-defined thresholdon the p-value is used to decide whether thethe null hypothesis can be rejected or not foreach pathway Finally a correction for multi-ple comparisons should be performed

Typically pathway analysis methods com-pute one score per pathway The distribution ofthis score under the null hypothesis can be con-structed and compared to the observed scoreHowever there are often too few samples to

Network-BasedApproaches forPathway Level

Analysis

82514

Supplement 61 Current Protocols in Bioinformatics

calculate this distribution so it is assumedthat the distribution is known For example inMetaCore and many other techniques whenthe pathway score is the number of DE nodesthat fall on the pathway the distribution isassumed to be hypergeometric However thehypergeometric distribution assumes that thevariables (genes in this case) are independentwhich is incorrect as witnessed by the factthat the pathway graph structure itself is de-signed to reflect the specific ways in which thegenes influence each other Another approachto identify the distribution is to use statisticaltechniques such as the bootstrap (Efron 1979)Bootstrapping can be done either at the sam-ple level by permuting the sample labels orat gene-set level by permuting the the valuesassigned to the genes in the set

Multivariate analysis and Bayesiannetwork

Multivariate analysis methods mostly usemultivariate probability distributions to com-pute pathway-level statistics and these can begrouped into two subcategories Methods inthe first category use multivariate hypothesistesting while methods in the second categoryare based on Bayesian network

NetGSA TopologyGSA DEGraph clip-per and microGraphite are methods based onmultivariate hypothesis testing These anal-ysis methods assume the vectors of geneexpression values in each (sub)pathway arerandom vectors with multivariate normal dis-tributions The network topology informationis stored in the covariance matrix of the cor-responding distribution For a network if thetwo distributions of the gene expression vec-tors corresponding to the two phenotypes aresignificantly different the network is assumedto be significantly impacted when comparingthe two phenotypes The significance assess-ment is done by a multivariate hypothesis testThe definition of the null hypothesis for thestatistical tests and the techniques to calcu-late the parameters of the distributions are themain differences between these three analysismethods

BPA and BAPA-IGGFD are two methodsbased on Bayesian networks In a Bayesiannetwork which is a special case of proba-bilistic graphical models a random variableis assigned to each node of a directed acyclic(DAG) graph The edges in the graph representthe conditional probabilities between nodesso that the children are independent from eachother and the rest of the graph when condi-tioned on the parents In BPA the value of

the Bayesian random variable assigned to eachnode captures the state of a gene (DE or not) Incontrast in BAPA-IGGFD each random vari-able assigned to an edge is the probability thatup or down regulation of the genes at both endsof an interaction are concordant with the typeof interaction which can be activation or inhi-bition In both BPA and BAPA-IGGFD eachrandom variable is assumed to follow a bino-mial distribution whose probability of successfollows a beta distribution However these twomethods use different approaches in represent-ing the multivariate distribution of the corre-sponding random vector BPA assumes thatthe random vector has a multinomial distribu-tion which is the generalization of the bino-mial distribution In this case the vector of thesuccess probability follows the Dirichlet dis-tribution which is the multivariate extensionof the beta distribution In contrast BAPA-IGGFD assumes the random variables areindependent therefore the multivariate distri-butions are calculated by multiplying the dis-tributions of the random variables in the vec-tor It is worth mentioning that the assumptionof independence in BAPA-IGGFD is contra-dicted by evidence specifically in the case ofedges that share nodes

Other approachesThe four methods DRAGEN GGEA

ToPASeq and BLMA follow strategies thatare very different from those of the other 30methods As such BLMA implements a bi-level meta-analysis approach that can be ap-plied in conjunction with any of the four sta-tistical approaches SPIA (Tarca et al 2009)GSA (Efron amp Tibshirani 2007) ORA (Tava-zoie et al 1999) or PADOG (Tarca DraghiciBhatti amp Romero 2012) The package al-lows users to perform pathway analysis withone dataset or with multiple datasets Simi-larly ToPASeq provides an R package thatruns TopologyGSA DEGraph clipper SPIATBScore PWEA and TAPPA This methodmodels the input (biological networks and ex-periment data) in such a way that it can be usedfor any of the seven different analyses

DRAGEN is an analysis method that scoresthe interactions rather then the genes and usesa regression model to detect differential regu-lation DRAGEN fits each edge of a pathwayinto two linear models (for case and control)and then computes a p-value that representsthe difference between the two models Foreach pathway a summary statistic is computedby combining the p-values of the edges us-ing a weighted Fisherrsquos method (Fisher 1925)

AnalyzingMolecularInteractions

82515

Current Protocols in Bioinformatics Supplement 61

GGEA on the other hand uses Petri Net (Mu-rata 1989) to model the pathway The sum-mary statistic named consistency is computedfrom the fuzzy similarity between the observedgene expression and Petri net with fuzzy logic(PNFL Kuffner Petri Windhager amp Zimmer2010) For both DRAGEN and GGEA the p-value of the pathway is calculated by com-paring the observed summary statistic againstthe null distribution that is constructed bypermutation

CHALLENGES IN PATHWAYANALYSIS

Pathway analysis has become the firstchoice for gaining insights into the underly-ing biology of a phenotype due its explanatorypower However there are outstanding annota-tion and methodological limitations that havenot been addressed (Kelder Conklin Eveloamp Pico 2010 Khatri et al 2012) There arethree main limitations of current knowledgebases First existing knowledge bases are un-able to keep up with the information availablein data obtained from recent technologies Forexample RNA-Seq data allows us to iden-tify transcripts that are active under certainconditions Alternatively spliced transcriptseven if they originate from the same genemay have distinct or even opposite functions(Wang et al 2008) However most knowl-edge bases provide pathway annotation onlyat the gene level Second there is a lack ofcondition and cell-specific information ieinformation about cell type conditions andtime points Finally current pathway annota-tions are neither complete nor perfectly accu-rate (Khatri et al 2012 Rhee Wood Dolinskiamp Draghici 2008) For example the number ofgenes in KEGG have remained around 5000despite being updated continuously in the past10 years while there are approximately 19000human genes annotated with at least one GOterm The number of protein-coding genes isestimated to be between 19000 and 20000(Ezkurdia et al 2014) most of which areincluded in DNA microarray assays such asAffymetrix HG U133 plus 20

Another challenge is the oversimplificationthat characterizes many of the models pro-vided by pathway databases In principle eachtype of tissue might have different mecha-nisms so generic organism-level pathwayspresent a somewhat simplistic description ofthe phenomena Furthermore signaling andmetabolic processes can also be different fromone condition to another or even from one

patient to another Understanding the specificpathways that are impacted in a given pheno-type or sub-group of patients should be anothergoal for the next generation of pathway anal-ysis tools See Khatri et al (2012) for a moredetailed discussion of annotation limitationsof existing knowledge bases

Here we focus on challenges of pathwayanalysis from computational perspectives Wedemonstrate that there is a systematic bias inpathway analysis (Nguyen Mitrea Tagett ampDraghici 2017) This leads to the unreliabilityof most if not all pathway analysis approachesWe also discuss the lack of benchmark datasetsor pipelines to assess the performance of ex-isting approaches

Systematic Bias of Pathway AnalysisMethods

Pathway analysis approaches often rely onhypothesis testing to identify the pathways thatare impacted under the effects of different dis-eases Null distributions are used to modelpopulations so that statistical tests can deter-mine whether an observation is unlikely to oc-cur by chance In principle the p-values pro-duced by a sound statistical test must be uni-formly distributed under the null hypothesis(Barton Crozier Lillycrop Godfrey amp Inskip2013 Bland 2013 Fodor Tickle amp Richard-son 2007 Storey amp Tibshirani 2003) For ex-ample the p-values that result from comparingtwo groups using a t-test should be distributeduniformly if the data are normally distributed(Bland 2013) When the assumptions of sta-tistical models do not hold the resulting p-values are not uniformly distributed under thenull hypothesis This makes classical methodssuch as t-test inaccurate since gene expressionvalues do not necessary follow their assump-tions Here we also show that the problem isextended to pathway analysis as the pathwayp-values obtained from statistical approachesare not uniformly distributed under the nullhypothesis This might lead to severe bias to-wards well-studied diseases such as cancerand thus make the results unreliable (Nguyenet al 2017)

Consider three pathway analysis methodsthat represent three different classes ofmethods for pathway analysis Gene SetAnalysis (GSA Efron amp Tibshirani 2007)is a Functional Class Scoring method (Efronamp Tibshirani 2007 Mootha et al 2003Subramanian et al 2005 Tarca et al2012) Down-weighting of OverlappingGenes (PADOG Tarca et al 2012) is anenrichment method (Beiẞbarth amp Speed

Network-BasedApproaches forPathway Level

Analysis

82516

Supplement 61 Current Protocols in Bioinformatics

2004 Draghici et al 2003 Khatri DraghiciOstermeier amp Krawetz 2002) and Sig-naling Pathway Impact Analysis (SPIATarca et al 2009) is a topology-awaremethod (Draghici et al 2007 Tarca et al2009) To simulate the null distribution wedownload and process the data from ninepublic datasets GSE14924 CD4 GSE14924CD8 GSE17054 GSE12662 GSE57194GSE33223 GSE42140 GSE8023 andGSE15061 Using 140 control samples fromthe nine datasets we simulate 40000 datasetsas follows We randomly label 70 samples ascontrol samples and the remaining 70 samplesas disease samples We repeat this procedure10000 times to generate different groups of70 control and 70 disease samples To makethe simulation more general we also create10000 datasets consisting of 10 control and10 disease samples 10000 datasets consistingof 10 control and 20 disease samples and10000 datasets consisting of 20 controland 10 disease samples We then calculatethe p-values of the KEGG human signalingpathways using each of the three methods

The effect of combining control (iehealthy) samples from different experimentsis to uniformly distribute all sources of biasamong the random groups of samples If wecompare groups of control samples based onexperiments there could be true differencesdue to batch effects By pooling them togetherwe form a population which is considered thereference population This approach is simi-lar to selecting from a large group of peoplethat may contain different sub-groups (egdifferent ethnicities gender race life styleor living conditions) When we randomly se-lect samples (for the two random groups tobe compared) from the reference populationwe expect all bias (eg ethnic subgroups) tobe represented equally in both random groupsand therefore we should see no difference be-tween these random groups no matter howmany distinct ethnic subgroups were presentin the population at large Therefore the p-values of a test for difference between the tworandomly selected groups should be equallyprobable between zero and one

Figure 8254 displays the empirical nulldistributions of p-values using PADOG GSAand SPIA The horizontal axes represent p-values while the vertical axes represent p-value densities Green panels (A0-A6) showp-value distributions from PADOG while blue(B0-B6) and purple (C0-C6) panels show p-value distributions from GSA and SPIA re-spectively For each method the larger panel

(A0 B0 C0) shows the cumulative p-valuesfrom all KEGG signaling pathways The smallpanels six per method display extreme ex-amples of nonuniform p-value distributionsfor specific pathways For each method weshow three distributions severely biased to-wards zero (eg A1-A3) and three distri-butions severely biased towards one (egA4-A6)

These results show that contrary to gen-erally accepted beliefs the p-values are notuniformly distributed for the three methodsconsidered Therefore one should expect avery strong and systematic bias in identifyingsignificant pathways for each of these meth-ods Pathways that have p-values biased to-wards zero will often be falsely identified assignificant (false positives) Likewise path-ways that have p-values biased towards one arelikely to rarely meet the significance require-ments even when they are truly implicatedin the given phenotype (false negatives) Sys-tematic bias due to nonuniformity of p-valuedistributions results in failure of the statisti-cal methods to correctly identify the biologicalpathways implicated in the condition and alsoleads to inconsistent and incorrect results

Querying Data from Knowledge BasesIndependent research groups have tried

different strategies to model complex bio-molecular phenomena These independent ef-forts have led to variation among pathwaydatabases complicating the task of devel-oping pathway analysis methods Dependingon the database there may be differencesin information sources experiment interpre-tation models of molecular interactions orboundaries of the pathways Therefore it ispossible that pathways with the same desig-nation and aiming to describe the same phe-nomena may have different topologies in dif-ferent databases As an example one couldcompare the insulin signaling pathways ofKEGG and BioCarta BioCarta includes fewernodes and emphasizes the effect of insulin ontranscription while KEGG includes transcrip-tion regulation as well as apoptosis and otherbiological processes Differences in graphmodels for molecular interactions are partic-ularly apparent when comparing the signal-ing pathways in KEGG and NCI-PID WhileKEGG represents the interaction informationusing the directed edges themselves NCI-PIDintroduces ldquoprocess nodesrdquo to model interac-tions (see Fig 8252) Developers are facingthe challenge of modifying methods to accept

AnalyzingMolecularInteractions

82517

Current Protocols in Bioinformatics Supplement 61

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 13: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

Figure 8253 A summary of the statistical models is presented for the surveyed methods Mostmethods use the hierarchically aggregated scoring strategy in which the score is computed atthe node level before being aggregated at the pathway level for significance assessment In theleft panel the rows show a node-level statistical model while the columns show the aggregatingstrategies (linear non-linear or weighted gene set) Approaches in multivariate analysis andBayesian network categories use multivariate modeling and Bayesian network respectively tofind pathways that are most likely to be impacted Both BLMA and ToPASeq provide R packagesthat run multiple algorithms DRAGEN follows a linear regression strategy while GGEA appliesfuzzy modeling to rank the pathways

(iii) probabilistic graphical models and (iv)normalized node value (NNV) Approaches inthe first category use centrality measures or avariation of these measures to score nodes in agiven pathway Centrality measures representthe importance of a node relative to all othernodes in a network There are several central-ity measures that can be applied to networksof genes and their interactions degree central-ity closeness between-ness and eigenvectorcentrality Degree centrality accounts for thenumber of directed edges that enter and leaveeach node Closeness sums the shortest dis-tance from each node to all other nodes inthe network Node between-ness measures theimportance of a node according to the numberof shortest paths that pass through it Eigen-vector centrality uses the network adjacencymatrix of a graph to determine a dominanteigenvector each element of this vector is ascore for the corresponding node Thus eachscore is influenced by the scores of neighbor-ing nodes In the case of directed graphs anode that has many downstream genes hasmore influence and receives a higher score

Methods in this category include MetaCoreSPIA iPathwayGuide MetPA SPATIAL To-poGSA DEAP pDis mirIntegrator Pathway-Express and BLMA

Approaches in the second category usesimilarity measures in their node level scor-ing Similarity measures estimate the co-expression behavioral similarity or co-regulation of pairs of components Theirvalues can be correlation coefficients covari-ances or dot products of the gene expressionprofile across time or samples In these meth-ods the pathways with clusters of highly cor-related genes are considered more significantAt the node level a score is assigned to eachpair of nodes in the network which is the ra-tio of one similarity measure over the shortestpath distance between these nodes Thus thetopology information is captured in the nodescore by incorporating the shortest path dis-tance of the pair Methods in this category in-clude ScorePAGE and PWEA

Approaches in the third category incorpo-rate the topology in the node level scoringusing a probabilistic graphical model In this

AnalyzingMolecularInteractions

82513

Current Protocols in Bioinformatics Supplement 61

model nodes are random variables and edgesdefine the conditional dependency of the nodesthey link For example PARADIGM takes ob-served experimental data and calculates scoresfor all component nodes in both observed andhidden states from the detailed network cre-ated by the method based on the input path-way For each node score a positive or negativevalue denotes how likely it is for the node tobe active or inactive respectively The scoresare calculated to maximize the probability ofthe observed values A p-value is associatedwith each score of each sample such that eachnode can be tagged as significantly activesignificantly inactive or not significant Foreach network a matrix of p-values is outputin which columns are samples and rows arecomponent nodes Methods in this categoryinclude PARADIGM and PathOlogist

Approaches in the fourth category simplycompute the score for each node using the in-formation obtained from the experiment inputFor example TAPPA calculates the score ofeach node as the square root of the normalizedlog gene expressions (node value) while ACSTand TBScore calculate the node level score us-ing a sign statistic and log fold-change

Pathway level scoringThere are three different ways to compute

the pathway score (i) linear (ii) non-linearand iii) weighted gene set Most methods ag-gregate node level statistics to pathway levelstatistics using linear functions such as av-eraging or summation For example iPath-wayGuide computes the scores of pathway asthe sum of all genes while TBScore weightsthe pathway DE genes based on their log foldchange and the number of distinct DE genesdirectly downstream of them using a depth-first search algorithm

Approaches in the second group (non-linear) use a non-linear function to computethe pathway scores For example TAPPA com-putes the pathway score for each sample as aweighted sum of the product of all node pairscores in the pathway The weight coefficientis 0 when there is no edge between a pair Forany connected node pair the weight is a signfunction which represents joint up- or down-regulation of the pair Another example is En-richNet in which pathway scores measure thedifference of the node score distribution for apathway and a background networkgene setwhich consists of all pathways At the nodelevel the distance of all DE genes to the path-way is measured and summarized as a distancedistribution The method assumes that the most

relevant pathway is the one with the greatestdifference between the pathway node scoredistribution and the background score distri-bution The difference between the two distri-butions is measured by the weighted averagingof the difference between the two discretizedand normalized distributions The averagingmethod down-weights the higher distances andemphasizes the lower distance nodes

Methods in the third group (weighted geneset) design scoring techniques that incorpo-rate existing gene set analysis methods suchas GSEA (Subramanian et al 2005) GSA(Efron amp Tibshirani 2007) or LRPath (SartorLeikauf amp Medvedovic 2009) Pathway-levelscores can be calculated using node scoresthat represent the topology characteristic of thepathway as weight adjustments to a gene setanalysis method PWEA GANPA THINK-Back-DS and CePa use this approach and werefer to them as weighted gene set analysismethods

Pathway significance assessmentPathway scores are intended to provide in-

formation regarding the amount of changeincurred in the pathway between two phe-notypes However the amount of change isnot meaningful by itself since any amount ofchange can take place just by chance An as-sessment of the significance of the measuredchanges is thus required

TopoGSA MetPA and EnrichNet outputscores without any significance assessmentleaving it up to the user to interpret the re-sults This is problematic because users do nothave any instrument to distinguish betweenchanges due to noise or random causes andmeaningful changes that are unlikely to occurjust by chance and that therefore are possi-bly related to the phenotype The rest of theanalysis methods perform a hypothesis test foreach pathway The null hypothesis is that thevalue of the observed statistic is due to randomnoise or chance alone The research hypothe-sis is that the observed values are substantialenough that they are potentially related to thephenotype A p-value for calculated score isthen computed and a user-defined thresholdon the p-value is used to decide whether thethe null hypothesis can be rejected or not foreach pathway Finally a correction for multi-ple comparisons should be performed

Typically pathway analysis methods com-pute one score per pathway The distribution ofthis score under the null hypothesis can be con-structed and compared to the observed scoreHowever there are often too few samples to

Network-BasedApproaches forPathway Level

Analysis

82514

Supplement 61 Current Protocols in Bioinformatics

calculate this distribution so it is assumedthat the distribution is known For example inMetaCore and many other techniques whenthe pathway score is the number of DE nodesthat fall on the pathway the distribution isassumed to be hypergeometric However thehypergeometric distribution assumes that thevariables (genes in this case) are independentwhich is incorrect as witnessed by the factthat the pathway graph structure itself is de-signed to reflect the specific ways in which thegenes influence each other Another approachto identify the distribution is to use statisticaltechniques such as the bootstrap (Efron 1979)Bootstrapping can be done either at the sam-ple level by permuting the sample labels orat gene-set level by permuting the the valuesassigned to the genes in the set

Multivariate analysis and Bayesiannetwork

Multivariate analysis methods mostly usemultivariate probability distributions to com-pute pathway-level statistics and these can begrouped into two subcategories Methods inthe first category use multivariate hypothesistesting while methods in the second categoryare based on Bayesian network

NetGSA TopologyGSA DEGraph clip-per and microGraphite are methods based onmultivariate hypothesis testing These anal-ysis methods assume the vectors of geneexpression values in each (sub)pathway arerandom vectors with multivariate normal dis-tributions The network topology informationis stored in the covariance matrix of the cor-responding distribution For a network if thetwo distributions of the gene expression vec-tors corresponding to the two phenotypes aresignificantly different the network is assumedto be significantly impacted when comparingthe two phenotypes The significance assess-ment is done by a multivariate hypothesis testThe definition of the null hypothesis for thestatistical tests and the techniques to calcu-late the parameters of the distributions are themain differences between these three analysismethods

BPA and BAPA-IGGFD are two methodsbased on Bayesian networks In a Bayesiannetwork which is a special case of proba-bilistic graphical models a random variableis assigned to each node of a directed acyclic(DAG) graph The edges in the graph representthe conditional probabilities between nodesso that the children are independent from eachother and the rest of the graph when condi-tioned on the parents In BPA the value of

the Bayesian random variable assigned to eachnode captures the state of a gene (DE or not) Incontrast in BAPA-IGGFD each random vari-able assigned to an edge is the probability thatup or down regulation of the genes at both endsof an interaction are concordant with the typeof interaction which can be activation or inhi-bition In both BPA and BAPA-IGGFD eachrandom variable is assumed to follow a bino-mial distribution whose probability of successfollows a beta distribution However these twomethods use different approaches in represent-ing the multivariate distribution of the corre-sponding random vector BPA assumes thatthe random vector has a multinomial distribu-tion which is the generalization of the bino-mial distribution In this case the vector of thesuccess probability follows the Dirichlet dis-tribution which is the multivariate extensionof the beta distribution In contrast BAPA-IGGFD assumes the random variables areindependent therefore the multivariate distri-butions are calculated by multiplying the dis-tributions of the random variables in the vec-tor It is worth mentioning that the assumptionof independence in BAPA-IGGFD is contra-dicted by evidence specifically in the case ofedges that share nodes

Other approachesThe four methods DRAGEN GGEA

ToPASeq and BLMA follow strategies thatare very different from those of the other 30methods As such BLMA implements a bi-level meta-analysis approach that can be ap-plied in conjunction with any of the four sta-tistical approaches SPIA (Tarca et al 2009)GSA (Efron amp Tibshirani 2007) ORA (Tava-zoie et al 1999) or PADOG (Tarca DraghiciBhatti amp Romero 2012) The package al-lows users to perform pathway analysis withone dataset or with multiple datasets Simi-larly ToPASeq provides an R package thatruns TopologyGSA DEGraph clipper SPIATBScore PWEA and TAPPA This methodmodels the input (biological networks and ex-periment data) in such a way that it can be usedfor any of the seven different analyses

DRAGEN is an analysis method that scoresthe interactions rather then the genes and usesa regression model to detect differential regu-lation DRAGEN fits each edge of a pathwayinto two linear models (for case and control)and then computes a p-value that representsthe difference between the two models Foreach pathway a summary statistic is computedby combining the p-values of the edges us-ing a weighted Fisherrsquos method (Fisher 1925)

AnalyzingMolecularInteractions

82515

Current Protocols in Bioinformatics Supplement 61

GGEA on the other hand uses Petri Net (Mu-rata 1989) to model the pathway The sum-mary statistic named consistency is computedfrom the fuzzy similarity between the observedgene expression and Petri net with fuzzy logic(PNFL Kuffner Petri Windhager amp Zimmer2010) For both DRAGEN and GGEA the p-value of the pathway is calculated by com-paring the observed summary statistic againstthe null distribution that is constructed bypermutation

CHALLENGES IN PATHWAYANALYSIS

Pathway analysis has become the firstchoice for gaining insights into the underly-ing biology of a phenotype due its explanatorypower However there are outstanding annota-tion and methodological limitations that havenot been addressed (Kelder Conklin Eveloamp Pico 2010 Khatri et al 2012) There arethree main limitations of current knowledgebases First existing knowledge bases are un-able to keep up with the information availablein data obtained from recent technologies Forexample RNA-Seq data allows us to iden-tify transcripts that are active under certainconditions Alternatively spliced transcriptseven if they originate from the same genemay have distinct or even opposite functions(Wang et al 2008) However most knowl-edge bases provide pathway annotation onlyat the gene level Second there is a lack ofcondition and cell-specific information ieinformation about cell type conditions andtime points Finally current pathway annota-tions are neither complete nor perfectly accu-rate (Khatri et al 2012 Rhee Wood Dolinskiamp Draghici 2008) For example the number ofgenes in KEGG have remained around 5000despite being updated continuously in the past10 years while there are approximately 19000human genes annotated with at least one GOterm The number of protein-coding genes isestimated to be between 19000 and 20000(Ezkurdia et al 2014) most of which areincluded in DNA microarray assays such asAffymetrix HG U133 plus 20

Another challenge is the oversimplificationthat characterizes many of the models pro-vided by pathway databases In principle eachtype of tissue might have different mecha-nisms so generic organism-level pathwayspresent a somewhat simplistic description ofthe phenomena Furthermore signaling andmetabolic processes can also be different fromone condition to another or even from one

patient to another Understanding the specificpathways that are impacted in a given pheno-type or sub-group of patients should be anothergoal for the next generation of pathway anal-ysis tools See Khatri et al (2012) for a moredetailed discussion of annotation limitationsof existing knowledge bases

Here we focus on challenges of pathwayanalysis from computational perspectives Wedemonstrate that there is a systematic bias inpathway analysis (Nguyen Mitrea Tagett ampDraghici 2017) This leads to the unreliabilityof most if not all pathway analysis approachesWe also discuss the lack of benchmark datasetsor pipelines to assess the performance of ex-isting approaches

Systematic Bias of Pathway AnalysisMethods

Pathway analysis approaches often rely onhypothesis testing to identify the pathways thatare impacted under the effects of different dis-eases Null distributions are used to modelpopulations so that statistical tests can deter-mine whether an observation is unlikely to oc-cur by chance In principle the p-values pro-duced by a sound statistical test must be uni-formly distributed under the null hypothesis(Barton Crozier Lillycrop Godfrey amp Inskip2013 Bland 2013 Fodor Tickle amp Richard-son 2007 Storey amp Tibshirani 2003) For ex-ample the p-values that result from comparingtwo groups using a t-test should be distributeduniformly if the data are normally distributed(Bland 2013) When the assumptions of sta-tistical models do not hold the resulting p-values are not uniformly distributed under thenull hypothesis This makes classical methodssuch as t-test inaccurate since gene expressionvalues do not necessary follow their assump-tions Here we also show that the problem isextended to pathway analysis as the pathwayp-values obtained from statistical approachesare not uniformly distributed under the nullhypothesis This might lead to severe bias to-wards well-studied diseases such as cancerand thus make the results unreliable (Nguyenet al 2017)

Consider three pathway analysis methodsthat represent three different classes ofmethods for pathway analysis Gene SetAnalysis (GSA Efron amp Tibshirani 2007)is a Functional Class Scoring method (Efronamp Tibshirani 2007 Mootha et al 2003Subramanian et al 2005 Tarca et al2012) Down-weighting of OverlappingGenes (PADOG Tarca et al 2012) is anenrichment method (Beiẞbarth amp Speed

Network-BasedApproaches forPathway Level

Analysis

82516

Supplement 61 Current Protocols in Bioinformatics

2004 Draghici et al 2003 Khatri DraghiciOstermeier amp Krawetz 2002) and Sig-naling Pathway Impact Analysis (SPIATarca et al 2009) is a topology-awaremethod (Draghici et al 2007 Tarca et al2009) To simulate the null distribution wedownload and process the data from ninepublic datasets GSE14924 CD4 GSE14924CD8 GSE17054 GSE12662 GSE57194GSE33223 GSE42140 GSE8023 andGSE15061 Using 140 control samples fromthe nine datasets we simulate 40000 datasetsas follows We randomly label 70 samples ascontrol samples and the remaining 70 samplesas disease samples We repeat this procedure10000 times to generate different groups of70 control and 70 disease samples To makethe simulation more general we also create10000 datasets consisting of 10 control and10 disease samples 10000 datasets consistingof 10 control and 20 disease samples and10000 datasets consisting of 20 controland 10 disease samples We then calculatethe p-values of the KEGG human signalingpathways using each of the three methods

The effect of combining control (iehealthy) samples from different experimentsis to uniformly distribute all sources of biasamong the random groups of samples If wecompare groups of control samples based onexperiments there could be true differencesdue to batch effects By pooling them togetherwe form a population which is considered thereference population This approach is simi-lar to selecting from a large group of peoplethat may contain different sub-groups (egdifferent ethnicities gender race life styleor living conditions) When we randomly se-lect samples (for the two random groups tobe compared) from the reference populationwe expect all bias (eg ethnic subgroups) tobe represented equally in both random groupsand therefore we should see no difference be-tween these random groups no matter howmany distinct ethnic subgroups were presentin the population at large Therefore the p-values of a test for difference between the tworandomly selected groups should be equallyprobable between zero and one

Figure 8254 displays the empirical nulldistributions of p-values using PADOG GSAand SPIA The horizontal axes represent p-values while the vertical axes represent p-value densities Green panels (A0-A6) showp-value distributions from PADOG while blue(B0-B6) and purple (C0-C6) panels show p-value distributions from GSA and SPIA re-spectively For each method the larger panel

(A0 B0 C0) shows the cumulative p-valuesfrom all KEGG signaling pathways The smallpanels six per method display extreme ex-amples of nonuniform p-value distributionsfor specific pathways For each method weshow three distributions severely biased to-wards zero (eg A1-A3) and three distri-butions severely biased towards one (egA4-A6)

These results show that contrary to gen-erally accepted beliefs the p-values are notuniformly distributed for the three methodsconsidered Therefore one should expect avery strong and systematic bias in identifyingsignificant pathways for each of these meth-ods Pathways that have p-values biased to-wards zero will often be falsely identified assignificant (false positives) Likewise path-ways that have p-values biased towards one arelikely to rarely meet the significance require-ments even when they are truly implicatedin the given phenotype (false negatives) Sys-tematic bias due to nonuniformity of p-valuedistributions results in failure of the statisti-cal methods to correctly identify the biologicalpathways implicated in the condition and alsoleads to inconsistent and incorrect results

Querying Data from Knowledge BasesIndependent research groups have tried

different strategies to model complex bio-molecular phenomena These independent ef-forts have led to variation among pathwaydatabases complicating the task of devel-oping pathway analysis methods Dependingon the database there may be differencesin information sources experiment interpre-tation models of molecular interactions orboundaries of the pathways Therefore it ispossible that pathways with the same desig-nation and aiming to describe the same phe-nomena may have different topologies in dif-ferent databases As an example one couldcompare the insulin signaling pathways ofKEGG and BioCarta BioCarta includes fewernodes and emphasizes the effect of insulin ontranscription while KEGG includes transcrip-tion regulation as well as apoptosis and otherbiological processes Differences in graphmodels for molecular interactions are partic-ularly apparent when comparing the signal-ing pathways in KEGG and NCI-PID WhileKEGG represents the interaction informationusing the directed edges themselves NCI-PIDintroduces ldquoprocess nodesrdquo to model interac-tions (see Fig 8252) Developers are facingthe challenge of modifying methods to accept

AnalyzingMolecularInteractions

82517

Current Protocols in Bioinformatics Supplement 61

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 14: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

model nodes are random variables and edgesdefine the conditional dependency of the nodesthey link For example PARADIGM takes ob-served experimental data and calculates scoresfor all component nodes in both observed andhidden states from the detailed network cre-ated by the method based on the input path-way For each node score a positive or negativevalue denotes how likely it is for the node tobe active or inactive respectively The scoresare calculated to maximize the probability ofthe observed values A p-value is associatedwith each score of each sample such that eachnode can be tagged as significantly activesignificantly inactive or not significant Foreach network a matrix of p-values is outputin which columns are samples and rows arecomponent nodes Methods in this categoryinclude PARADIGM and PathOlogist

Approaches in the fourth category simplycompute the score for each node using the in-formation obtained from the experiment inputFor example TAPPA calculates the score ofeach node as the square root of the normalizedlog gene expressions (node value) while ACSTand TBScore calculate the node level score us-ing a sign statistic and log fold-change

Pathway level scoringThere are three different ways to compute

the pathway score (i) linear (ii) non-linearand iii) weighted gene set Most methods ag-gregate node level statistics to pathway levelstatistics using linear functions such as av-eraging or summation For example iPath-wayGuide computes the scores of pathway asthe sum of all genes while TBScore weightsthe pathway DE genes based on their log foldchange and the number of distinct DE genesdirectly downstream of them using a depth-first search algorithm

Approaches in the second group (non-linear) use a non-linear function to computethe pathway scores For example TAPPA com-putes the pathway score for each sample as aweighted sum of the product of all node pairscores in the pathway The weight coefficientis 0 when there is no edge between a pair Forany connected node pair the weight is a signfunction which represents joint up- or down-regulation of the pair Another example is En-richNet in which pathway scores measure thedifference of the node score distribution for apathway and a background networkgene setwhich consists of all pathways At the nodelevel the distance of all DE genes to the path-way is measured and summarized as a distancedistribution The method assumes that the most

relevant pathway is the one with the greatestdifference between the pathway node scoredistribution and the background score distri-bution The difference between the two distri-butions is measured by the weighted averagingof the difference between the two discretizedand normalized distributions The averagingmethod down-weights the higher distances andemphasizes the lower distance nodes

Methods in the third group (weighted geneset) design scoring techniques that incorpo-rate existing gene set analysis methods suchas GSEA (Subramanian et al 2005) GSA(Efron amp Tibshirani 2007) or LRPath (SartorLeikauf amp Medvedovic 2009) Pathway-levelscores can be calculated using node scoresthat represent the topology characteristic of thepathway as weight adjustments to a gene setanalysis method PWEA GANPA THINK-Back-DS and CePa use this approach and werefer to them as weighted gene set analysismethods

Pathway significance assessmentPathway scores are intended to provide in-

formation regarding the amount of changeincurred in the pathway between two phe-notypes However the amount of change isnot meaningful by itself since any amount ofchange can take place just by chance An as-sessment of the significance of the measuredchanges is thus required

TopoGSA MetPA and EnrichNet outputscores without any significance assessmentleaving it up to the user to interpret the re-sults This is problematic because users do nothave any instrument to distinguish betweenchanges due to noise or random causes andmeaningful changes that are unlikely to occurjust by chance and that therefore are possi-bly related to the phenotype The rest of theanalysis methods perform a hypothesis test foreach pathway The null hypothesis is that thevalue of the observed statistic is due to randomnoise or chance alone The research hypothe-sis is that the observed values are substantialenough that they are potentially related to thephenotype A p-value for calculated score isthen computed and a user-defined thresholdon the p-value is used to decide whether thethe null hypothesis can be rejected or not foreach pathway Finally a correction for multi-ple comparisons should be performed

Typically pathway analysis methods com-pute one score per pathway The distribution ofthis score under the null hypothesis can be con-structed and compared to the observed scoreHowever there are often too few samples to

Network-BasedApproaches forPathway Level

Analysis

82514

Supplement 61 Current Protocols in Bioinformatics

calculate this distribution so it is assumedthat the distribution is known For example inMetaCore and many other techniques whenthe pathway score is the number of DE nodesthat fall on the pathway the distribution isassumed to be hypergeometric However thehypergeometric distribution assumes that thevariables (genes in this case) are independentwhich is incorrect as witnessed by the factthat the pathway graph structure itself is de-signed to reflect the specific ways in which thegenes influence each other Another approachto identify the distribution is to use statisticaltechniques such as the bootstrap (Efron 1979)Bootstrapping can be done either at the sam-ple level by permuting the sample labels orat gene-set level by permuting the the valuesassigned to the genes in the set

Multivariate analysis and Bayesiannetwork

Multivariate analysis methods mostly usemultivariate probability distributions to com-pute pathway-level statistics and these can begrouped into two subcategories Methods inthe first category use multivariate hypothesistesting while methods in the second categoryare based on Bayesian network

NetGSA TopologyGSA DEGraph clip-per and microGraphite are methods based onmultivariate hypothesis testing These anal-ysis methods assume the vectors of geneexpression values in each (sub)pathway arerandom vectors with multivariate normal dis-tributions The network topology informationis stored in the covariance matrix of the cor-responding distribution For a network if thetwo distributions of the gene expression vec-tors corresponding to the two phenotypes aresignificantly different the network is assumedto be significantly impacted when comparingthe two phenotypes The significance assess-ment is done by a multivariate hypothesis testThe definition of the null hypothesis for thestatistical tests and the techniques to calcu-late the parameters of the distributions are themain differences between these three analysismethods

BPA and BAPA-IGGFD are two methodsbased on Bayesian networks In a Bayesiannetwork which is a special case of proba-bilistic graphical models a random variableis assigned to each node of a directed acyclic(DAG) graph The edges in the graph representthe conditional probabilities between nodesso that the children are independent from eachother and the rest of the graph when condi-tioned on the parents In BPA the value of

the Bayesian random variable assigned to eachnode captures the state of a gene (DE or not) Incontrast in BAPA-IGGFD each random vari-able assigned to an edge is the probability thatup or down regulation of the genes at both endsof an interaction are concordant with the typeof interaction which can be activation or inhi-bition In both BPA and BAPA-IGGFD eachrandom variable is assumed to follow a bino-mial distribution whose probability of successfollows a beta distribution However these twomethods use different approaches in represent-ing the multivariate distribution of the corre-sponding random vector BPA assumes thatthe random vector has a multinomial distribu-tion which is the generalization of the bino-mial distribution In this case the vector of thesuccess probability follows the Dirichlet dis-tribution which is the multivariate extensionof the beta distribution In contrast BAPA-IGGFD assumes the random variables areindependent therefore the multivariate distri-butions are calculated by multiplying the dis-tributions of the random variables in the vec-tor It is worth mentioning that the assumptionof independence in BAPA-IGGFD is contra-dicted by evidence specifically in the case ofedges that share nodes

Other approachesThe four methods DRAGEN GGEA

ToPASeq and BLMA follow strategies thatare very different from those of the other 30methods As such BLMA implements a bi-level meta-analysis approach that can be ap-plied in conjunction with any of the four sta-tistical approaches SPIA (Tarca et al 2009)GSA (Efron amp Tibshirani 2007) ORA (Tava-zoie et al 1999) or PADOG (Tarca DraghiciBhatti amp Romero 2012) The package al-lows users to perform pathway analysis withone dataset or with multiple datasets Simi-larly ToPASeq provides an R package thatruns TopologyGSA DEGraph clipper SPIATBScore PWEA and TAPPA This methodmodels the input (biological networks and ex-periment data) in such a way that it can be usedfor any of the seven different analyses

DRAGEN is an analysis method that scoresthe interactions rather then the genes and usesa regression model to detect differential regu-lation DRAGEN fits each edge of a pathwayinto two linear models (for case and control)and then computes a p-value that representsthe difference between the two models Foreach pathway a summary statistic is computedby combining the p-values of the edges us-ing a weighted Fisherrsquos method (Fisher 1925)

AnalyzingMolecularInteractions

82515

Current Protocols in Bioinformatics Supplement 61

GGEA on the other hand uses Petri Net (Mu-rata 1989) to model the pathway The sum-mary statistic named consistency is computedfrom the fuzzy similarity between the observedgene expression and Petri net with fuzzy logic(PNFL Kuffner Petri Windhager amp Zimmer2010) For both DRAGEN and GGEA the p-value of the pathway is calculated by com-paring the observed summary statistic againstthe null distribution that is constructed bypermutation

CHALLENGES IN PATHWAYANALYSIS

Pathway analysis has become the firstchoice for gaining insights into the underly-ing biology of a phenotype due its explanatorypower However there are outstanding annota-tion and methodological limitations that havenot been addressed (Kelder Conklin Eveloamp Pico 2010 Khatri et al 2012) There arethree main limitations of current knowledgebases First existing knowledge bases are un-able to keep up with the information availablein data obtained from recent technologies Forexample RNA-Seq data allows us to iden-tify transcripts that are active under certainconditions Alternatively spliced transcriptseven if they originate from the same genemay have distinct or even opposite functions(Wang et al 2008) However most knowl-edge bases provide pathway annotation onlyat the gene level Second there is a lack ofcondition and cell-specific information ieinformation about cell type conditions andtime points Finally current pathway annota-tions are neither complete nor perfectly accu-rate (Khatri et al 2012 Rhee Wood Dolinskiamp Draghici 2008) For example the number ofgenes in KEGG have remained around 5000despite being updated continuously in the past10 years while there are approximately 19000human genes annotated with at least one GOterm The number of protein-coding genes isestimated to be between 19000 and 20000(Ezkurdia et al 2014) most of which areincluded in DNA microarray assays such asAffymetrix HG U133 plus 20

Another challenge is the oversimplificationthat characterizes many of the models pro-vided by pathway databases In principle eachtype of tissue might have different mecha-nisms so generic organism-level pathwayspresent a somewhat simplistic description ofthe phenomena Furthermore signaling andmetabolic processes can also be different fromone condition to another or even from one

patient to another Understanding the specificpathways that are impacted in a given pheno-type or sub-group of patients should be anothergoal for the next generation of pathway anal-ysis tools See Khatri et al (2012) for a moredetailed discussion of annotation limitationsof existing knowledge bases

Here we focus on challenges of pathwayanalysis from computational perspectives Wedemonstrate that there is a systematic bias inpathway analysis (Nguyen Mitrea Tagett ampDraghici 2017) This leads to the unreliabilityof most if not all pathway analysis approachesWe also discuss the lack of benchmark datasetsor pipelines to assess the performance of ex-isting approaches

Systematic Bias of Pathway AnalysisMethods

Pathway analysis approaches often rely onhypothesis testing to identify the pathways thatare impacted under the effects of different dis-eases Null distributions are used to modelpopulations so that statistical tests can deter-mine whether an observation is unlikely to oc-cur by chance In principle the p-values pro-duced by a sound statistical test must be uni-formly distributed under the null hypothesis(Barton Crozier Lillycrop Godfrey amp Inskip2013 Bland 2013 Fodor Tickle amp Richard-son 2007 Storey amp Tibshirani 2003) For ex-ample the p-values that result from comparingtwo groups using a t-test should be distributeduniformly if the data are normally distributed(Bland 2013) When the assumptions of sta-tistical models do not hold the resulting p-values are not uniformly distributed under thenull hypothesis This makes classical methodssuch as t-test inaccurate since gene expressionvalues do not necessary follow their assump-tions Here we also show that the problem isextended to pathway analysis as the pathwayp-values obtained from statistical approachesare not uniformly distributed under the nullhypothesis This might lead to severe bias to-wards well-studied diseases such as cancerand thus make the results unreliable (Nguyenet al 2017)

Consider three pathway analysis methodsthat represent three different classes ofmethods for pathway analysis Gene SetAnalysis (GSA Efron amp Tibshirani 2007)is a Functional Class Scoring method (Efronamp Tibshirani 2007 Mootha et al 2003Subramanian et al 2005 Tarca et al2012) Down-weighting of OverlappingGenes (PADOG Tarca et al 2012) is anenrichment method (Beiẞbarth amp Speed

Network-BasedApproaches forPathway Level

Analysis

82516

Supplement 61 Current Protocols in Bioinformatics

2004 Draghici et al 2003 Khatri DraghiciOstermeier amp Krawetz 2002) and Sig-naling Pathway Impact Analysis (SPIATarca et al 2009) is a topology-awaremethod (Draghici et al 2007 Tarca et al2009) To simulate the null distribution wedownload and process the data from ninepublic datasets GSE14924 CD4 GSE14924CD8 GSE17054 GSE12662 GSE57194GSE33223 GSE42140 GSE8023 andGSE15061 Using 140 control samples fromthe nine datasets we simulate 40000 datasetsas follows We randomly label 70 samples ascontrol samples and the remaining 70 samplesas disease samples We repeat this procedure10000 times to generate different groups of70 control and 70 disease samples To makethe simulation more general we also create10000 datasets consisting of 10 control and10 disease samples 10000 datasets consistingof 10 control and 20 disease samples and10000 datasets consisting of 20 controland 10 disease samples We then calculatethe p-values of the KEGG human signalingpathways using each of the three methods

The effect of combining control (iehealthy) samples from different experimentsis to uniformly distribute all sources of biasamong the random groups of samples If wecompare groups of control samples based onexperiments there could be true differencesdue to batch effects By pooling them togetherwe form a population which is considered thereference population This approach is simi-lar to selecting from a large group of peoplethat may contain different sub-groups (egdifferent ethnicities gender race life styleor living conditions) When we randomly se-lect samples (for the two random groups tobe compared) from the reference populationwe expect all bias (eg ethnic subgroups) tobe represented equally in both random groupsand therefore we should see no difference be-tween these random groups no matter howmany distinct ethnic subgroups were presentin the population at large Therefore the p-values of a test for difference between the tworandomly selected groups should be equallyprobable between zero and one

Figure 8254 displays the empirical nulldistributions of p-values using PADOG GSAand SPIA The horizontal axes represent p-values while the vertical axes represent p-value densities Green panels (A0-A6) showp-value distributions from PADOG while blue(B0-B6) and purple (C0-C6) panels show p-value distributions from GSA and SPIA re-spectively For each method the larger panel

(A0 B0 C0) shows the cumulative p-valuesfrom all KEGG signaling pathways The smallpanels six per method display extreme ex-amples of nonuniform p-value distributionsfor specific pathways For each method weshow three distributions severely biased to-wards zero (eg A1-A3) and three distri-butions severely biased towards one (egA4-A6)

These results show that contrary to gen-erally accepted beliefs the p-values are notuniformly distributed for the three methodsconsidered Therefore one should expect avery strong and systematic bias in identifyingsignificant pathways for each of these meth-ods Pathways that have p-values biased to-wards zero will often be falsely identified assignificant (false positives) Likewise path-ways that have p-values biased towards one arelikely to rarely meet the significance require-ments even when they are truly implicatedin the given phenotype (false negatives) Sys-tematic bias due to nonuniformity of p-valuedistributions results in failure of the statisti-cal methods to correctly identify the biologicalpathways implicated in the condition and alsoleads to inconsistent and incorrect results

Querying Data from Knowledge BasesIndependent research groups have tried

different strategies to model complex bio-molecular phenomena These independent ef-forts have led to variation among pathwaydatabases complicating the task of devel-oping pathway analysis methods Dependingon the database there may be differencesin information sources experiment interpre-tation models of molecular interactions orboundaries of the pathways Therefore it ispossible that pathways with the same desig-nation and aiming to describe the same phe-nomena may have different topologies in dif-ferent databases As an example one couldcompare the insulin signaling pathways ofKEGG and BioCarta BioCarta includes fewernodes and emphasizes the effect of insulin ontranscription while KEGG includes transcrip-tion regulation as well as apoptosis and otherbiological processes Differences in graphmodels for molecular interactions are partic-ularly apparent when comparing the signal-ing pathways in KEGG and NCI-PID WhileKEGG represents the interaction informationusing the directed edges themselves NCI-PIDintroduces ldquoprocess nodesrdquo to model interac-tions (see Fig 8252) Developers are facingthe challenge of modifying methods to accept

AnalyzingMolecularInteractions

82517

Current Protocols in Bioinformatics Supplement 61

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 15: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

calculate this distribution so it is assumedthat the distribution is known For example inMetaCore and many other techniques whenthe pathway score is the number of DE nodesthat fall on the pathway the distribution isassumed to be hypergeometric However thehypergeometric distribution assumes that thevariables (genes in this case) are independentwhich is incorrect as witnessed by the factthat the pathway graph structure itself is de-signed to reflect the specific ways in which thegenes influence each other Another approachto identify the distribution is to use statisticaltechniques such as the bootstrap (Efron 1979)Bootstrapping can be done either at the sam-ple level by permuting the sample labels orat gene-set level by permuting the the valuesassigned to the genes in the set

Multivariate analysis and Bayesiannetwork

Multivariate analysis methods mostly usemultivariate probability distributions to com-pute pathway-level statistics and these can begrouped into two subcategories Methods inthe first category use multivariate hypothesistesting while methods in the second categoryare based on Bayesian network

NetGSA TopologyGSA DEGraph clip-per and microGraphite are methods based onmultivariate hypothesis testing These anal-ysis methods assume the vectors of geneexpression values in each (sub)pathway arerandom vectors with multivariate normal dis-tributions The network topology informationis stored in the covariance matrix of the cor-responding distribution For a network if thetwo distributions of the gene expression vec-tors corresponding to the two phenotypes aresignificantly different the network is assumedto be significantly impacted when comparingthe two phenotypes The significance assess-ment is done by a multivariate hypothesis testThe definition of the null hypothesis for thestatistical tests and the techniques to calcu-late the parameters of the distributions are themain differences between these three analysismethods

BPA and BAPA-IGGFD are two methodsbased on Bayesian networks In a Bayesiannetwork which is a special case of proba-bilistic graphical models a random variableis assigned to each node of a directed acyclic(DAG) graph The edges in the graph representthe conditional probabilities between nodesso that the children are independent from eachother and the rest of the graph when condi-tioned on the parents In BPA the value of

the Bayesian random variable assigned to eachnode captures the state of a gene (DE or not) Incontrast in BAPA-IGGFD each random vari-able assigned to an edge is the probability thatup or down regulation of the genes at both endsof an interaction are concordant with the typeof interaction which can be activation or inhi-bition In both BPA and BAPA-IGGFD eachrandom variable is assumed to follow a bino-mial distribution whose probability of successfollows a beta distribution However these twomethods use different approaches in represent-ing the multivariate distribution of the corre-sponding random vector BPA assumes thatthe random vector has a multinomial distribu-tion which is the generalization of the bino-mial distribution In this case the vector of thesuccess probability follows the Dirichlet dis-tribution which is the multivariate extensionof the beta distribution In contrast BAPA-IGGFD assumes the random variables areindependent therefore the multivariate distri-butions are calculated by multiplying the dis-tributions of the random variables in the vec-tor It is worth mentioning that the assumptionof independence in BAPA-IGGFD is contra-dicted by evidence specifically in the case ofedges that share nodes

Other approachesThe four methods DRAGEN GGEA

ToPASeq and BLMA follow strategies thatare very different from those of the other 30methods As such BLMA implements a bi-level meta-analysis approach that can be ap-plied in conjunction with any of the four sta-tistical approaches SPIA (Tarca et al 2009)GSA (Efron amp Tibshirani 2007) ORA (Tava-zoie et al 1999) or PADOG (Tarca DraghiciBhatti amp Romero 2012) The package al-lows users to perform pathway analysis withone dataset or with multiple datasets Simi-larly ToPASeq provides an R package thatruns TopologyGSA DEGraph clipper SPIATBScore PWEA and TAPPA This methodmodels the input (biological networks and ex-periment data) in such a way that it can be usedfor any of the seven different analyses

DRAGEN is an analysis method that scoresthe interactions rather then the genes and usesa regression model to detect differential regu-lation DRAGEN fits each edge of a pathwayinto two linear models (for case and control)and then computes a p-value that representsthe difference between the two models Foreach pathway a summary statistic is computedby combining the p-values of the edges us-ing a weighted Fisherrsquos method (Fisher 1925)

AnalyzingMolecularInteractions

82515

Current Protocols in Bioinformatics Supplement 61

GGEA on the other hand uses Petri Net (Mu-rata 1989) to model the pathway The sum-mary statistic named consistency is computedfrom the fuzzy similarity between the observedgene expression and Petri net with fuzzy logic(PNFL Kuffner Petri Windhager amp Zimmer2010) For both DRAGEN and GGEA the p-value of the pathway is calculated by com-paring the observed summary statistic againstthe null distribution that is constructed bypermutation

CHALLENGES IN PATHWAYANALYSIS

Pathway analysis has become the firstchoice for gaining insights into the underly-ing biology of a phenotype due its explanatorypower However there are outstanding annota-tion and methodological limitations that havenot been addressed (Kelder Conklin Eveloamp Pico 2010 Khatri et al 2012) There arethree main limitations of current knowledgebases First existing knowledge bases are un-able to keep up with the information availablein data obtained from recent technologies Forexample RNA-Seq data allows us to iden-tify transcripts that are active under certainconditions Alternatively spliced transcriptseven if they originate from the same genemay have distinct or even opposite functions(Wang et al 2008) However most knowl-edge bases provide pathway annotation onlyat the gene level Second there is a lack ofcondition and cell-specific information ieinformation about cell type conditions andtime points Finally current pathway annota-tions are neither complete nor perfectly accu-rate (Khatri et al 2012 Rhee Wood Dolinskiamp Draghici 2008) For example the number ofgenes in KEGG have remained around 5000despite being updated continuously in the past10 years while there are approximately 19000human genes annotated with at least one GOterm The number of protein-coding genes isestimated to be between 19000 and 20000(Ezkurdia et al 2014) most of which areincluded in DNA microarray assays such asAffymetrix HG U133 plus 20

Another challenge is the oversimplificationthat characterizes many of the models pro-vided by pathway databases In principle eachtype of tissue might have different mecha-nisms so generic organism-level pathwayspresent a somewhat simplistic description ofthe phenomena Furthermore signaling andmetabolic processes can also be different fromone condition to another or even from one

patient to another Understanding the specificpathways that are impacted in a given pheno-type or sub-group of patients should be anothergoal for the next generation of pathway anal-ysis tools See Khatri et al (2012) for a moredetailed discussion of annotation limitationsof existing knowledge bases

Here we focus on challenges of pathwayanalysis from computational perspectives Wedemonstrate that there is a systematic bias inpathway analysis (Nguyen Mitrea Tagett ampDraghici 2017) This leads to the unreliabilityof most if not all pathway analysis approachesWe also discuss the lack of benchmark datasetsor pipelines to assess the performance of ex-isting approaches

Systematic Bias of Pathway AnalysisMethods

Pathway analysis approaches often rely onhypothesis testing to identify the pathways thatare impacted under the effects of different dis-eases Null distributions are used to modelpopulations so that statistical tests can deter-mine whether an observation is unlikely to oc-cur by chance In principle the p-values pro-duced by a sound statistical test must be uni-formly distributed under the null hypothesis(Barton Crozier Lillycrop Godfrey amp Inskip2013 Bland 2013 Fodor Tickle amp Richard-son 2007 Storey amp Tibshirani 2003) For ex-ample the p-values that result from comparingtwo groups using a t-test should be distributeduniformly if the data are normally distributed(Bland 2013) When the assumptions of sta-tistical models do not hold the resulting p-values are not uniformly distributed under thenull hypothesis This makes classical methodssuch as t-test inaccurate since gene expressionvalues do not necessary follow their assump-tions Here we also show that the problem isextended to pathway analysis as the pathwayp-values obtained from statistical approachesare not uniformly distributed under the nullhypothesis This might lead to severe bias to-wards well-studied diseases such as cancerand thus make the results unreliable (Nguyenet al 2017)

Consider three pathway analysis methodsthat represent three different classes ofmethods for pathway analysis Gene SetAnalysis (GSA Efron amp Tibshirani 2007)is a Functional Class Scoring method (Efronamp Tibshirani 2007 Mootha et al 2003Subramanian et al 2005 Tarca et al2012) Down-weighting of OverlappingGenes (PADOG Tarca et al 2012) is anenrichment method (Beiẞbarth amp Speed

Network-BasedApproaches forPathway Level

Analysis

82516

Supplement 61 Current Protocols in Bioinformatics

2004 Draghici et al 2003 Khatri DraghiciOstermeier amp Krawetz 2002) and Sig-naling Pathway Impact Analysis (SPIATarca et al 2009) is a topology-awaremethod (Draghici et al 2007 Tarca et al2009) To simulate the null distribution wedownload and process the data from ninepublic datasets GSE14924 CD4 GSE14924CD8 GSE17054 GSE12662 GSE57194GSE33223 GSE42140 GSE8023 andGSE15061 Using 140 control samples fromthe nine datasets we simulate 40000 datasetsas follows We randomly label 70 samples ascontrol samples and the remaining 70 samplesas disease samples We repeat this procedure10000 times to generate different groups of70 control and 70 disease samples To makethe simulation more general we also create10000 datasets consisting of 10 control and10 disease samples 10000 datasets consistingof 10 control and 20 disease samples and10000 datasets consisting of 20 controland 10 disease samples We then calculatethe p-values of the KEGG human signalingpathways using each of the three methods

The effect of combining control (iehealthy) samples from different experimentsis to uniformly distribute all sources of biasamong the random groups of samples If wecompare groups of control samples based onexperiments there could be true differencesdue to batch effects By pooling them togetherwe form a population which is considered thereference population This approach is simi-lar to selecting from a large group of peoplethat may contain different sub-groups (egdifferent ethnicities gender race life styleor living conditions) When we randomly se-lect samples (for the two random groups tobe compared) from the reference populationwe expect all bias (eg ethnic subgroups) tobe represented equally in both random groupsand therefore we should see no difference be-tween these random groups no matter howmany distinct ethnic subgroups were presentin the population at large Therefore the p-values of a test for difference between the tworandomly selected groups should be equallyprobable between zero and one

Figure 8254 displays the empirical nulldistributions of p-values using PADOG GSAand SPIA The horizontal axes represent p-values while the vertical axes represent p-value densities Green panels (A0-A6) showp-value distributions from PADOG while blue(B0-B6) and purple (C0-C6) panels show p-value distributions from GSA and SPIA re-spectively For each method the larger panel

(A0 B0 C0) shows the cumulative p-valuesfrom all KEGG signaling pathways The smallpanels six per method display extreme ex-amples of nonuniform p-value distributionsfor specific pathways For each method weshow three distributions severely biased to-wards zero (eg A1-A3) and three distri-butions severely biased towards one (egA4-A6)

These results show that contrary to gen-erally accepted beliefs the p-values are notuniformly distributed for the three methodsconsidered Therefore one should expect avery strong and systematic bias in identifyingsignificant pathways for each of these meth-ods Pathways that have p-values biased to-wards zero will often be falsely identified assignificant (false positives) Likewise path-ways that have p-values biased towards one arelikely to rarely meet the significance require-ments even when they are truly implicatedin the given phenotype (false negatives) Sys-tematic bias due to nonuniformity of p-valuedistributions results in failure of the statisti-cal methods to correctly identify the biologicalpathways implicated in the condition and alsoleads to inconsistent and incorrect results

Querying Data from Knowledge BasesIndependent research groups have tried

different strategies to model complex bio-molecular phenomena These independent ef-forts have led to variation among pathwaydatabases complicating the task of devel-oping pathway analysis methods Dependingon the database there may be differencesin information sources experiment interpre-tation models of molecular interactions orboundaries of the pathways Therefore it ispossible that pathways with the same desig-nation and aiming to describe the same phe-nomena may have different topologies in dif-ferent databases As an example one couldcompare the insulin signaling pathways ofKEGG and BioCarta BioCarta includes fewernodes and emphasizes the effect of insulin ontranscription while KEGG includes transcrip-tion regulation as well as apoptosis and otherbiological processes Differences in graphmodels for molecular interactions are partic-ularly apparent when comparing the signal-ing pathways in KEGG and NCI-PID WhileKEGG represents the interaction informationusing the directed edges themselves NCI-PIDintroduces ldquoprocess nodesrdquo to model interac-tions (see Fig 8252) Developers are facingthe challenge of modifying methods to accept

AnalyzingMolecularInteractions

82517

Current Protocols in Bioinformatics Supplement 61

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 16: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

GGEA on the other hand uses Petri Net (Mu-rata 1989) to model the pathway The sum-mary statistic named consistency is computedfrom the fuzzy similarity between the observedgene expression and Petri net with fuzzy logic(PNFL Kuffner Petri Windhager amp Zimmer2010) For both DRAGEN and GGEA the p-value of the pathway is calculated by com-paring the observed summary statistic againstthe null distribution that is constructed bypermutation

CHALLENGES IN PATHWAYANALYSIS

Pathway analysis has become the firstchoice for gaining insights into the underly-ing biology of a phenotype due its explanatorypower However there are outstanding annota-tion and methodological limitations that havenot been addressed (Kelder Conklin Eveloamp Pico 2010 Khatri et al 2012) There arethree main limitations of current knowledgebases First existing knowledge bases are un-able to keep up with the information availablein data obtained from recent technologies Forexample RNA-Seq data allows us to iden-tify transcripts that are active under certainconditions Alternatively spliced transcriptseven if they originate from the same genemay have distinct or even opposite functions(Wang et al 2008) However most knowl-edge bases provide pathway annotation onlyat the gene level Second there is a lack ofcondition and cell-specific information ieinformation about cell type conditions andtime points Finally current pathway annota-tions are neither complete nor perfectly accu-rate (Khatri et al 2012 Rhee Wood Dolinskiamp Draghici 2008) For example the number ofgenes in KEGG have remained around 5000despite being updated continuously in the past10 years while there are approximately 19000human genes annotated with at least one GOterm The number of protein-coding genes isestimated to be between 19000 and 20000(Ezkurdia et al 2014) most of which areincluded in DNA microarray assays such asAffymetrix HG U133 plus 20

Another challenge is the oversimplificationthat characterizes many of the models pro-vided by pathway databases In principle eachtype of tissue might have different mecha-nisms so generic organism-level pathwayspresent a somewhat simplistic description ofthe phenomena Furthermore signaling andmetabolic processes can also be different fromone condition to another or even from one

patient to another Understanding the specificpathways that are impacted in a given pheno-type or sub-group of patients should be anothergoal for the next generation of pathway anal-ysis tools See Khatri et al (2012) for a moredetailed discussion of annotation limitationsof existing knowledge bases

Here we focus on challenges of pathwayanalysis from computational perspectives Wedemonstrate that there is a systematic bias inpathway analysis (Nguyen Mitrea Tagett ampDraghici 2017) This leads to the unreliabilityof most if not all pathway analysis approachesWe also discuss the lack of benchmark datasetsor pipelines to assess the performance of ex-isting approaches

Systematic Bias of Pathway AnalysisMethods

Pathway analysis approaches often rely onhypothesis testing to identify the pathways thatare impacted under the effects of different dis-eases Null distributions are used to modelpopulations so that statistical tests can deter-mine whether an observation is unlikely to oc-cur by chance In principle the p-values pro-duced by a sound statistical test must be uni-formly distributed under the null hypothesis(Barton Crozier Lillycrop Godfrey amp Inskip2013 Bland 2013 Fodor Tickle amp Richard-son 2007 Storey amp Tibshirani 2003) For ex-ample the p-values that result from comparingtwo groups using a t-test should be distributeduniformly if the data are normally distributed(Bland 2013) When the assumptions of sta-tistical models do not hold the resulting p-values are not uniformly distributed under thenull hypothesis This makes classical methodssuch as t-test inaccurate since gene expressionvalues do not necessary follow their assump-tions Here we also show that the problem isextended to pathway analysis as the pathwayp-values obtained from statistical approachesare not uniformly distributed under the nullhypothesis This might lead to severe bias to-wards well-studied diseases such as cancerand thus make the results unreliable (Nguyenet al 2017)

Consider three pathway analysis methodsthat represent three different classes ofmethods for pathway analysis Gene SetAnalysis (GSA Efron amp Tibshirani 2007)is a Functional Class Scoring method (Efronamp Tibshirani 2007 Mootha et al 2003Subramanian et al 2005 Tarca et al2012) Down-weighting of OverlappingGenes (PADOG Tarca et al 2012) is anenrichment method (Beiẞbarth amp Speed

Network-BasedApproaches forPathway Level

Analysis

82516

Supplement 61 Current Protocols in Bioinformatics

2004 Draghici et al 2003 Khatri DraghiciOstermeier amp Krawetz 2002) and Sig-naling Pathway Impact Analysis (SPIATarca et al 2009) is a topology-awaremethod (Draghici et al 2007 Tarca et al2009) To simulate the null distribution wedownload and process the data from ninepublic datasets GSE14924 CD4 GSE14924CD8 GSE17054 GSE12662 GSE57194GSE33223 GSE42140 GSE8023 andGSE15061 Using 140 control samples fromthe nine datasets we simulate 40000 datasetsas follows We randomly label 70 samples ascontrol samples and the remaining 70 samplesas disease samples We repeat this procedure10000 times to generate different groups of70 control and 70 disease samples To makethe simulation more general we also create10000 datasets consisting of 10 control and10 disease samples 10000 datasets consistingof 10 control and 20 disease samples and10000 datasets consisting of 20 controland 10 disease samples We then calculatethe p-values of the KEGG human signalingpathways using each of the three methods

The effect of combining control (iehealthy) samples from different experimentsis to uniformly distribute all sources of biasamong the random groups of samples If wecompare groups of control samples based onexperiments there could be true differencesdue to batch effects By pooling them togetherwe form a population which is considered thereference population This approach is simi-lar to selecting from a large group of peoplethat may contain different sub-groups (egdifferent ethnicities gender race life styleor living conditions) When we randomly se-lect samples (for the two random groups tobe compared) from the reference populationwe expect all bias (eg ethnic subgroups) tobe represented equally in both random groupsand therefore we should see no difference be-tween these random groups no matter howmany distinct ethnic subgroups were presentin the population at large Therefore the p-values of a test for difference between the tworandomly selected groups should be equallyprobable between zero and one

Figure 8254 displays the empirical nulldistributions of p-values using PADOG GSAand SPIA The horizontal axes represent p-values while the vertical axes represent p-value densities Green panels (A0-A6) showp-value distributions from PADOG while blue(B0-B6) and purple (C0-C6) panels show p-value distributions from GSA and SPIA re-spectively For each method the larger panel

(A0 B0 C0) shows the cumulative p-valuesfrom all KEGG signaling pathways The smallpanels six per method display extreme ex-amples of nonuniform p-value distributionsfor specific pathways For each method weshow three distributions severely biased to-wards zero (eg A1-A3) and three distri-butions severely biased towards one (egA4-A6)

These results show that contrary to gen-erally accepted beliefs the p-values are notuniformly distributed for the three methodsconsidered Therefore one should expect avery strong and systematic bias in identifyingsignificant pathways for each of these meth-ods Pathways that have p-values biased to-wards zero will often be falsely identified assignificant (false positives) Likewise path-ways that have p-values biased towards one arelikely to rarely meet the significance require-ments even when they are truly implicatedin the given phenotype (false negatives) Sys-tematic bias due to nonuniformity of p-valuedistributions results in failure of the statisti-cal methods to correctly identify the biologicalpathways implicated in the condition and alsoleads to inconsistent and incorrect results

Querying Data from Knowledge BasesIndependent research groups have tried

different strategies to model complex bio-molecular phenomena These independent ef-forts have led to variation among pathwaydatabases complicating the task of devel-oping pathway analysis methods Dependingon the database there may be differencesin information sources experiment interpre-tation models of molecular interactions orboundaries of the pathways Therefore it ispossible that pathways with the same desig-nation and aiming to describe the same phe-nomena may have different topologies in dif-ferent databases As an example one couldcompare the insulin signaling pathways ofKEGG and BioCarta BioCarta includes fewernodes and emphasizes the effect of insulin ontranscription while KEGG includes transcrip-tion regulation as well as apoptosis and otherbiological processes Differences in graphmodels for molecular interactions are partic-ularly apparent when comparing the signal-ing pathways in KEGG and NCI-PID WhileKEGG represents the interaction informationusing the directed edges themselves NCI-PIDintroduces ldquoprocess nodesrdquo to model interac-tions (see Fig 8252) Developers are facingthe challenge of modifying methods to accept

AnalyzingMolecularInteractions

82517

Current Protocols in Bioinformatics Supplement 61

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 17: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

2004 Draghici et al 2003 Khatri DraghiciOstermeier amp Krawetz 2002) and Sig-naling Pathway Impact Analysis (SPIATarca et al 2009) is a topology-awaremethod (Draghici et al 2007 Tarca et al2009) To simulate the null distribution wedownload and process the data from ninepublic datasets GSE14924 CD4 GSE14924CD8 GSE17054 GSE12662 GSE57194GSE33223 GSE42140 GSE8023 andGSE15061 Using 140 control samples fromthe nine datasets we simulate 40000 datasetsas follows We randomly label 70 samples ascontrol samples and the remaining 70 samplesas disease samples We repeat this procedure10000 times to generate different groups of70 control and 70 disease samples To makethe simulation more general we also create10000 datasets consisting of 10 control and10 disease samples 10000 datasets consistingof 10 control and 20 disease samples and10000 datasets consisting of 20 controland 10 disease samples We then calculatethe p-values of the KEGG human signalingpathways using each of the three methods

The effect of combining control (iehealthy) samples from different experimentsis to uniformly distribute all sources of biasamong the random groups of samples If wecompare groups of control samples based onexperiments there could be true differencesdue to batch effects By pooling them togetherwe form a population which is considered thereference population This approach is simi-lar to selecting from a large group of peoplethat may contain different sub-groups (egdifferent ethnicities gender race life styleor living conditions) When we randomly se-lect samples (for the two random groups tobe compared) from the reference populationwe expect all bias (eg ethnic subgroups) tobe represented equally in both random groupsand therefore we should see no difference be-tween these random groups no matter howmany distinct ethnic subgroups were presentin the population at large Therefore the p-values of a test for difference between the tworandomly selected groups should be equallyprobable between zero and one

Figure 8254 displays the empirical nulldistributions of p-values using PADOG GSAand SPIA The horizontal axes represent p-values while the vertical axes represent p-value densities Green panels (A0-A6) showp-value distributions from PADOG while blue(B0-B6) and purple (C0-C6) panels show p-value distributions from GSA and SPIA re-spectively For each method the larger panel

(A0 B0 C0) shows the cumulative p-valuesfrom all KEGG signaling pathways The smallpanels six per method display extreme ex-amples of nonuniform p-value distributionsfor specific pathways For each method weshow three distributions severely biased to-wards zero (eg A1-A3) and three distri-butions severely biased towards one (egA4-A6)

These results show that contrary to gen-erally accepted beliefs the p-values are notuniformly distributed for the three methodsconsidered Therefore one should expect avery strong and systematic bias in identifyingsignificant pathways for each of these meth-ods Pathways that have p-values biased to-wards zero will often be falsely identified assignificant (false positives) Likewise path-ways that have p-values biased towards one arelikely to rarely meet the significance require-ments even when they are truly implicatedin the given phenotype (false negatives) Sys-tematic bias due to nonuniformity of p-valuedistributions results in failure of the statisti-cal methods to correctly identify the biologicalpathways implicated in the condition and alsoleads to inconsistent and incorrect results

Querying Data from Knowledge BasesIndependent research groups have tried

different strategies to model complex bio-molecular phenomena These independent ef-forts have led to variation among pathwaydatabases complicating the task of devel-oping pathway analysis methods Dependingon the database there may be differencesin information sources experiment interpre-tation models of molecular interactions orboundaries of the pathways Therefore it ispossible that pathways with the same desig-nation and aiming to describe the same phe-nomena may have different topologies in dif-ferent databases As an example one couldcompare the insulin signaling pathways ofKEGG and BioCarta BioCarta includes fewernodes and emphasizes the effect of insulin ontranscription while KEGG includes transcrip-tion regulation as well as apoptosis and otherbiological processes Differences in graphmodels for molecular interactions are partic-ularly apparent when comparing the signal-ing pathways in KEGG and NCI-PID WhileKEGG represents the interaction informationusing the directed edges themselves NCI-PIDintroduces ldquoprocess nodesrdquo to model interac-tions (see Fig 8252) Developers are facingthe challenge of modifying methods to accept

AnalyzingMolecularInteractions

82517

Current Protocols in Bioinformatics Supplement 61

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 18: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

Figure 8254 The empirical distributions of p-values using Down-weighting of Overlapping Genes (PADOG top)Gene Set Analysis (GSA middle) and Signaling Pathway Impact Analysis (SPIA bottom) The distributions are gener-ated by re-sampling from 140 control samples obtained from 9 AML datasets The horizontal axes display the p-valueswhile the vertical axes display the p-value densities Panels A0-A6 (green) show the distributions of p-values fromPADOG panels B0-B6 (blue) show the distribution of p-values from GSA panels C0-C6 (purple) show the distributionof p-values from SPIA The large panels on the left A0 B0 and C0 display the distributions of p-values cumulated fromall KEGG signaling pathways The smaller panels on the right display the p-value distributions of selected individualpathways which are extreme cases For each method the upper three distributions for example A1-A3 are biasedtowards zero and the lower three distributions for example A4-A6 are biased towards one Since none of these p-valuedistributions are uniform there will be systematic bias in identifying significant pathways using any one of the methodsPathways that have p-values biased towards zero will often be falsely identified as significant (false positives) Likewisepathways that have p-values biased towards one are more likely to be among false negative results even if they may beimplicated in the given phenotype

novel pathway databases or modifying the ac-tual pathway graphs to conform to the method

Pathway databases not only differ in theway that interactions are modeled but theirdata are provided in different formats aswell (Chuang Hofree amp Ideker 2010) Com-mon formats are Pathway Interaction DatabaseeXtensible Markup Language (PID XML)

KEGG Markup Language (KGML) Biolog-ical Pathway Exchange (BioPAX) Level 2and Level 3 System Biology Markup Lan-guage (SBML) and the Biological Connec-tion Markup Language (BCML Beltrameet al 2011) The NCI provides a uni-fied assembly of BioCarta and Reactome aswell as their in-house ldquoNCI-Nature curated

Network-BasedApproaches forPathway Level

Analysis

82518

Supplement 61 Current Protocols in Bioinformatics

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 19: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

pathwaysrdquo in NCI-PID format (Schaefer et al2009) In order to unify pathway databasespathway information should be provided in acommonly accepted format

Another challenge is that the same biologi-cal pathways are represented differently fromone pathway database to another None of thetools is compatible with all database formatsrequiring either modification of pathway inputor alteration of the underlying algorithm in or-der to accommodate the differences As an ex-ample a study by Vaske et al (2010) attemptsto compare SPIA (Tarca et al 2009) with theirtool PARADIGM by re-implementing SPIA inC and forcing its application on NCI-PID path-ways Implementation errors are present in theC version of SPIA invalidating the compar-ison A solution to overcome this challengecould be the development of a unified glob-ally accepted pathway format Another pos-sible solution is to build conversion softwaretools that can translate between pathway for-mats Some attempts exist to use BIO-PAX(Demir et al 2010) as the lingua franca forthis domain

Missing Benchmark Datasets andComparison

Newly developed approaches are typicallyassessed by simulated data or by well-studiedbiological datasets (Bayerlova et al 2015Varadan Mittal Vaske amp Benz 2012) Theadvantage of using simulation is that theground truth is known and can be used to com-pare the sensitivity and specificity of differentmethods However simulation is often biasedand does not fully reflect the complexity of liv-ing organisms On the other hand when usingreal biological data the biology is never fullyknown In addition many papers presentingnew pathway analysis methods include resultsobtained on only a couple of datasets and re-searchers are often influenced by the observer-expectancy effect (Sackett 1979) Thus suchresults are not objective and many times theycannot be reproduced

A better evaluation approach has been pro-posed by Tarca et al (2013) using 42 realdatasets This approach uses a target pathwaywhich is the pathway describing the condi-tion under study available For instance in anexperiment with colon cancer versus healthythe target pathway would be the colon can-cer pathway The datasets are chosen so thatthere is a target pathway associated with eachof the datasets The datasets are also all pub-lic so other methods can be compared on thevery same data in a reproducible and objec-

tive way The lower the rank and the p-valueof the target pathway in the method outputthe better the method This approach has sev-eral important advantages including the factthat it is reproducible and completely objec-tive and relies on more than just a couple ofdata sets that are assessed by the authors usingthe literature This approach was also used byBayerlova et al (2015) using a different set of36 real datasets as well by other more recentpapers

However in spite of its advantages andgreat superiority compared to the usual methodof only analyzing a couple of data sets eventhis benchmarking has important limitationsFirst not all conditions have a namesake path-way in existing databases or described in theliterature Second complex diseases are of-ten associated with not only one target path-way but with many biological processes Byits nature this assessment approach will ig-nore other pathways and their ranking eventhough they may be true positives More im-portantly these approaches fail to take intoconsideration the systematic bias of pathwayanalysis approaches In these review papersmost of the datasets are related to cancer Assuch 28 out of 42 datasets used in Tarca et al(2013) and 26 out of 36 datasets used in Bay-erlova et al (2015) are cancer datasets In thosecases methods that are biased towards cancerare very likely to identify cancer pathwayswhich are also the target pathways as signifi-cant For this reason the comparisons obtainedfrom these reviews are likely to be biased andnot reliable for assessing the performance ofexisting approaches

CONCLUSIONSPathway analysis is a core strategy of many

basic research clinical research and transla-tional medicine programs Emerging applica-tions range from targeting and modeling dis-ease networks to screening chemical or ligandlibraries to identification of drug target inter-actions for improved efficacy and safety Theintegration of molecular interaction informa-tion into pathway analysis represents a majoradvance in the development of mathematicaltechniques aimed at the evaluation of systemsperturbations in biological entities

This unit discussed and categorized 34existing network-based pathway analysis ap-proaches from different perspectives includ-ing experiment input graphical representationof pathways in knowledge bases and statisti-cal approaches to assess pathway significance

AnalyzingMolecularInteractions

82519

Current Protocols in Bioinformatics Supplement 61

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 20: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

Despite being widely used employing DEgenes as input makes the software sensitive tocutoff parameters Regarding the graph mod-els approaches using multiple types of nodesand bipartite graphs are more flexible andare able to model both AND and OR gateswhich are very common when describing cel-lular processes In addition tools that are ableto work with multiple knowledge bases areexpected to perform better due to their com-plementary and independent information thatcannot be obtained from individual databasesWe also pointed out that there has been noreliable benchmark to assess and rank exist-ing approaches Some initial efforts to assesspathway analysis methods in an objective andreproducible way do exist but they still fail totake into account the statistical bias of existingapproaches

Despite tremendous efforts in the fieldthere are outstanding challenges that need to beaddressed First current pathway databases areunable to provide transcript-level activity orinformation related to new types of data suchas SNP mutation or methylation Second in-complete annotation and the lack of condition-and cell-specific information hinder the accu-racy of downstream pathway analysis Thirdvariation among pathway databases and thelack of a standard format in which the pathwaydata is provided pose a real challenge for im-plementation Developers are facing the chal-lenge of modifying methods to accept novelpathway databases or modifying the actualpathway graphs to conform with their methodFourth there is a systematic bias due to thefact that certain conditions such as cancer aremuch more studied than others Using con-trol data obtained from nine mRNA datasetswe showed that the p-values obtained by threepathway analysis approaches that representthree mainstream strategies in pathway analy-sis (FCS enrichment and network-based) arenot uniformly distributed under the null Path-ways that have p-values biased towards zerowill often be falsely identified as significant(false positives) Likewise pathways that havep-values biased towards one are likely to rarelymeet the significance requirements even whenthey are truly implicated in the given phe-notype (false negatives) Systematic bias dueto non-uniformity of p-value distributions re-sults in failure of the statistical methods tocorrectly identify the biological pathways im-plicated in the condition and also leads toinconsistent and incorrect results

Finally there is a lack of benchmarks toassess the performance of each mathematical

approach as well as to validate existing knowl-edge bases and their graphical representationof pathways There have been some efforts toprovide benchmarks for this purpose but suchmethods are not fully reliable yet

CONFLICT-OF-INTERESTSTATEMENT

Sorin Draghici is the founder and CEOof Advaita Corporation a Wayne State Uni-versity spin-off that commercializes iPath-wayGuide one of the pathway analysis toolsmentioned in this chapter

ACKNOWLEDGMENTSThis work has been partially supported by

the following grants National Institutes ofHealth (R01 DK089167 R42 GM087013)National Science Foundation (DBI-0965741)and the Robert J Sokol MD EndowedChair in Systems Biology Any opinions find-ings and conclusions or recommendations ex-pressed in this material are those of the authorsand do not necessarily reflect the views of anyof the funding agencies

LITERATURE CITEDAnsari S Voichita C Donato M Tagett R amp

Draghici S (2017) A novel pathway analysisapproach based on the unexplained disregulationof genes Proceedings of the IEEE 105(3) 482ndash495 doi 101109JPROC20162531000

Ashburner M Ball C A Blake J A BotsteinD Butler H Cherry J M Sherlock G(2000) Gene Ontology Tool for the unifica-tion of biology Nature Genetics 25 25ndash29 doi10103875556

Barrett T Wilhite S E Ledoux P Evange-lista C Kim I F Tomashevsky M Soboleva A (2013) NCBI GEO Archivefor functional genomics data setsndashupdate Nu-cleic Acids Research 41(D1) D991ndashD995 doi101093nargks1193

Barton S J Crozier S R Lillycrop K AGodfrey K M amp Inskip H M (2013)Correction of unexpected distributions of Pvalues from analysis of whole genome ar-rays by rectifying violation of statistical as-sumptions BMC Genomics 14(1) 161 doi1011861471-2164-14-161

Bayerlova M Jung K Kramer F KlemmF Bleckmann A amp Beiẞbarth T (2015)Comparative study on gene set and path-way topology-based enrichment meth-ods BMC Bioinformatics 16(1) 334 doi101186s12859-015-0751-5

Beiẞbarth T amp Speed T P (2004) GOstatFind statistically overrepresented Gene Ontolo-gies within a group of genes Bioinformat-ics 20 1464ndash1465 doi 101093bioinformat-icsbth088

Network-BasedApproaches forPathway Level

Analysis

82520

Supplement 61 Current Protocols in Bioinformatics

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 21: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

Beltrame L Calura E Popovici R R RizzettoL Guedez D R Donato M Cavalieri D(2011) The biological connection markup lan-guage A SBGN-compliant format for visual-ization filtering and analysis of biological path-ways Bioinformatics 27(15) 2127ndash2133 doi101093bioinformaticsbtr339

Ben-Shaul Y Bergman H amp Soreq H (2005)Identifying subtle interrelated changes in func-tional gene categories using continuous mea-sures of gene expression Bioinformatics21(7) 1129ndash1137 doi 101093bioinformat-icsbti149

Bland M (2013) Do baseline p-values followa uniform distribution in randomised trialsPloS One 8(10) e76010 doi 101371jour-nalpone0076010

Bokanizad B Tagett R Ansari S Helmi BH amp Draghici S (2016) SPATIAL A system-level PAThway impact AnaLysis approach Nu-cleic Acids Research 44(11) 5034ndash5044 doi101093nargkw429

Brazma A Parkinson H Sarkans U Shojata-lab M Vilo J Abeygunawardena N Sansone S-A (2003) ArrayExpressmdasha publicrepository for microarray gene expression dataat the EBI Nucleic Acids Research 31(1) 68ndash71 doi 101093nargkg091

Calura E Martini P Sales G Beltrame LChiorino G DrsquoIncalci M Romualdi C(2014) Wiring miRNAs to pathways A topo-logical approach to integrate miRNA and mRNAexpression profiles Nucleic Acids Research42(11) e96 doi 101093nargku354

Cerami E Gao J Dogrusoz U GrossB E Sumer S O Aksoy B A Larsson E (2012) The cBio cancer ge-nomics portal An open platform for ex-ploring multidimensional cancer genomicsdata Cancer Discovery 2(5) 401ndash404 doi1011582159-8290CD-12-0095

Chuang H-Y Hofree M amp Ideker T (2010) Adecade of systems biology Annual Review ofCell and Developmental Biology 26 721ndash744doi 101146annurev-cellbio-100109-104122

Croft D Mundo A F Haw R Milacic MWeiser J Wu G DrsquoEustachio P (2014)The Reactome pathway knowledgebase Nu-cleic Acids Research 42(D1) D472ndashD477 doi101093nargkt1102

Demir E Cary M P Paley S Fukuda K LemerC Vastrik I Bader G D (2010) TheBioPAX community standard for pathway datasharing Nature Biotechnology 28(9) 935ndash942doi 101038nbt1666

Diaz D Donato M Nguyen T amp Draghici S(2016) MicroRNA-augmented pathways (mi-rAP) and their applications to pathway analy-sis and disease subtyping In Pacific Symposiumon Biocomputing Pacific Symposium on Bio-computing volume 22 page 390 NIH PublicAccess

Draghici S Khatri P Martins R P Os-termeier G C amp Krawetz S A (2003)Global functional profiling of gene ex-

pression Genomics 81(2) 98ndash104 doi101016S0888-7543(02)00021-6

Draghici S Khatri P Tarca A L Amin KDone A Voichita C Romero R (2007)A systems biology approach for pathway levelanalysis Genome Research 17(10) 1537ndash1545doi 101101gr6202607

Edgar R Domrachev M amp Lash A E (2002)Gene Expression Omnibus NCBI gene expres-sion and hybridization array data repositoryNucleic Acids Research 30(1) 207ndash210 doi101093nar301207

Efron B (1979) Bootstrap methods Another lookat the jackknife The Annals of Statistics 7(1)1ndash26 doi 101214aos1176344552

Efron B amp Tibshirani R (2007) On testingthe significance of sets of genes The An-nals of Applied Statistics 1(1) 107ndash129 doi10121407-AOAS101

Efroni S Schaefer C F amp Buetow K H (2007)Identification of key processes underlying can-cer phenotypes using biologic pathway analy-sis PLoS One 2(5) e425 doi 101371jour-nalpone0000425

Ein-Dor L Kela I Getz G Givol D amp Do-many E (2005) Outcome signature genes inbreast cancer Is there a unique set Bioinfor-matics 21(2) 171ndash178 doi 101093bioinfor-maticsbth469

Ein-Dor L Zuk O amp Domany E (2006)Thousands of samples are needed to gen-erate a robust gene list for predicting out-come in cancer In Proceedings of the NationalAcademy of Sciences 103(15) 5923ndash5928 doi101073pnas0601231103

Emmert-Streib F amp Glazko G V (2011) Path-way analysis of expression data Decipheringfunctional building blocks of complex diseasesPLoS Computational Biology 7(5) e1002053doi 101371journalpcbi1002053

Ezkurdia I Juan D Rodriguez J M FrankishA Diekhans M Harrow J Tress M L(2014) Multiple evidence strands suggest thatthere may be as few as 19 000 human protein-coding genes Human Molecular Genetics23(22) 5866ndash5878 doi 101093hmgddu309

Fang Z Tian W amp Ji H (2011) A network-based gene-weighting approach for pathwayanalysis Cell Research 22(3) 565ndash580 doi101038cr2011149

Farfan F Ma J Sartor M A Michai-lidis G amp Jagadish H V (2012) THINKBack Knowledge-based interpretation of highthroughput data BMC Bioinformatics 13(Suppl2) S4 doi 1011861471-2105-13-S2-S4

Fisher R A (1925) Statistical methods for re-search workers Edinburgh Oliver amp Boyd

Fodor A A Tickle T L amp Richardson C (2007)Towards the uniform distribution of null P val-ues on Affymetrix microarrays Genome Biol-ogy 8(5) R69 doi 101186gb-2007-8-5-r69

Gao J Aksoy B A Dogrusoz U Dresdner GGross B Sumer Sur Schultz N (2013)Integrative analysis of complex cancer genomics

AnalyzingMolecularInteractions

82521

Current Protocols in Bioinformatics Supplement 61

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 22: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

and clinical profiles using the cBioPortal Sci-ence Signaling 6(269) pl1 doi 101126scisig-nal2004088

Gao S amp Wang X (2007) TAPPA Topolog-ical analysis of pathway phenotype associa-tion Bioinformatics 23(22) 3100ndash3102 doi101093bioinformaticsbtm460

Geistlinger L Csaba G Kuffner R Mul-der N amp Zimmer R (2011) From sets tographs Towards a realistic enrichment analy-sis of transcriptomic systems Bioinformatics27(13) i366ndashi373 doi 101093bioinformat-icsbtr228

Glaab E Baudot A Krasnogor N amp ValenciaA (2010) TopoGSA Network topological geneset analysis Bioinformatics 26(9) 1271ndash1272doi 101093bioinformaticsbtq131

Glaab E Baudot A Krasnogor N SchneiderR amp Valencia A (2012) EnrichNet Network-based gene set enrichment analysis Bioinfor-matics 28(18) i451ndashi457 doi 101093bioin-formaticsbts389

Greenblum S Efroni S Schaefer C amp BuetowK (2011) The PathOlogist An automated toolfor pathway-centric analysis BMC Bioinformat-ics 12(1) 133 doi 1011861471-2105-12-133

Gu Z Liu J Cao K Zhang J ampWang J (2012) Centrality-based pathway en-richment A systematic approach for find-ing significant pathways dominated by keygenes BMC Systems Biology 6(1) 56 doi1011861752-0509-6-56

Haynes W A Higdon R Stanberry L CollinsD amp Kolker E (2013) Differential expres-sion analysis for pathways PLoS ComputationalBiology 9(3) e1002967 doi 101371jour-nalpcbi1002967

Hung J-H Whitfield T W Yang T-H HuZ Weng Z amp DeLisi C (2010) Identifica-tion of functional modules that correlate withphenotypic difference The influence of net-work topology Genome Biology 11(2) R23doi 101186gb-2010-11-2-r23

Ibrahim M A-H Jassim S Cawthorne MA amp Langlands K (2012) A topology-based score for pathway enrichment Journalof Computational Biology 19(5) 563ndash573 doi101089cmb20110182

Ihnatova I amp Budinska E (2015) ToPASeqAn R package for topology-based path-way analysis of microarray and RNA-Seqdata BMC Bioinformatics 16(1) 350 doi101186s12859-015-0763-1

Isci S Ozturk C Jones J amp Otu H H (2011)Pathway analysis of high-throughput biolog-ical data within a Bayesian network frame-work Bioinformatics 27(12) 1667ndash1674 doi101093bioinformaticsbtr269

Jacob L Neuvial P amp Dudoit S (2010)Gains in power from structured two-sampletests of means on graphs Arxiv preprintarXiv10095173

Joshi-Tope G Gillespie M Vastrik IDrsquoEustachio P Schmidt E de Bono B

Stein L (2005) REACTOME A knowledge-base of biological pathways Nucleic Acids Re-search 33(Database issue) D428ndashD432 doi101093nargki072

Kanehisa M amp Goto S (2000) KEGG Ky-oto encyclopedia of genes and genomesNucleic Acids Research 28(1) 27ndash30 doi101093nar28127

Kanehisa M Furumichi M Tanabe M Sato Yamp Morishima K (2017) KEGG New perspec-tives on genomes pathways diseases and drugsNucleic Acids Research 45(D1) D353ndashD361doi 101093nargkw1092

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Kelder T Conklin B R Evelo C T ampPico A R (2010) Finding the right ques-tions Exploratory pathway analysis to enhancebiological discovery in large datasets PLoSBiology 8(8) e1000472 doi 101371jour-nalpbio1000472

Khatri P Draghici S Ostermeier G C ampKrawetz S A (2002) Profiling gene expressionusing Onto-Express Genomics 79(2) 266ndash270doi 101006geno20026698

Khatri P Sirota M amp Butte A J (2012)Ten years of pathway analysis Current ap-proaches and outstanding challenges PLoSComputational Biology 8(2) e1002375 doi101371journalpcbi1002375

Khatri P Voichita C Kattan K Ansari N Kha-tri A Georgescu C Draghici S (2007)Onto-Tools New additions and improvementsin 2006 Nucleic Acids Research 35(WebServer issue) W206ndashW211 doi 101093nargkm327

Kuffner R Petri T Windhager L amp Zim-mer R (2010) Petri nets with fuzzy logic(PNFL) Reverse engineering and parametriza-tion PLoS One 5(9) e12807 doi 101371jour-nalpone0012807

Lauritzen S L (1996) Graphical models (Vol 17)Oxford University Press

Liberzon A Subramanian A Pinchback RThorvaldsdottir H Tamayo P amp MesirovJ P (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27(12) 1739ndash1740 doi 101093bioinformaticsbtr260

Ma S Jiang T amp Jiang R (2014) Differen-tial regulation enrichment analysis via the in-tegration of transcriptional regulatory networkand gene expression data Bioinformatics 31(4)563ndash571 doi 101093bioinformaticsbtu672

Martini P Sales G Massa M S Chiogna Mamp Romualdi C (2013) Along signal paths Anempirical gene set approach exploiting pathwaytopology Nucleic Acids Research 41(1) e19ndashe19 doi 101093nargks866

Massa M S Chiogna M amp Romualdi C (2010)Gene set analysis exploiting the topology of a

Network-BasedApproaches forPathway Level

Analysis

82522

Supplement 61 Current Protocols in Bioinformatics

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 23: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

pathway BMC Systems Biology 4(1) 121 doihttpsdoiorg1011861752-0509-4-121

Mi H Lazareva-Ulitsky B Loo R KejariwalA Vandergriff J Rabkin S Thomas PD (2005) The PANTHER database of proteinfamilies subfamilies functions and pathwaysNucleic Acids Research 33(Suppl 1) D284ndashD288 doi 101093nargki078

Mieczkowski J Swiatek-Machado K amp Kamin-ska B (2012) Identification of pathwayderegulationndashgene expression based analysis ofconsistent signal transduction PLoS One 7(7)e41541 doi 101371journalpone0041541

Misman M F Deris S Hashim S Z M JumaliR amp Mohamad M S (2009) Pathway-basedmicroarray analysis for defining statistical sig-nificant phenotype-related pathways A reviewof common approaches In Information Man-agement and Engineering 2009 ICIMErsquo09International Conference on pages 496ndash500Piscataway NJ IEEE

Mitrea C Taghavi Z Bokanizad B HanoudiS Tagett R Donato M DraghiciS (2013) Methods and approaches in thetopology-based analysis of biological path-ways Frontiers in Physiology 4 278 doi103389fphys201300278

Mootha V K Lindgren C M Eriksson K-F Subramanian A Sihag S Lehar J Groop L C (2003) PGC-11α-responsivegenes involved in oxidative phosphorylationare coordinately downregulated in human di-abetes Nature Genetics 34(3) 267ndash273 doi101038ng1180

Murata T (1989) Petri nets Properties analy-sis and applications Proceedings of the IEEE77(4) 541ndash580 doi 101109524143

Nam D amp Kim S-Y (2008) Gene-set ap-proach for expression pattern analysis Brief-ings in Bioinformatics 9(3) 189ndash197 doi101093bibbbn001

Nguyen T Diaz D Tagett R amp Draghici S(2016) Overcoming the matched-sample bottle-neck An orthogonal approach to integrate omicdata Nature Scientific Reports 6 29251 doi101038srep29251

Nguyen T Mitrea C Tagett R amp Draghici S(2017) DANUBE Data-driven meta-ANalysisusing UnBiased Empirical distributionsmdashapplied to biological pathway analysis Pro-ceedings of the IEEE 105(3) 496ndash515 doi101109JPROC20152507119

Nguyen T Tagett R Donato M Mitrea Camp Draghici S (2016) A novel bi-level meta-analysis approach-applied to biological pathwayanalysis Bioinformatics 32(3) 409ndash416 doi101093bioinformaticsbtv588

Ogata H Goto S Sato K Fujibuchi WBono H amp Kanehisa M (1999) KEGGKyoto Encyclopedia of Genes and GenomesNucleic Acids Research 27(1) 29ndash34 doi101093nar27129

Pan K-H Lih C-J amp Cohen S N (2005) Ef-fects of threshold choice on biological conclu-

sions reached during analysis of gene expres-sion by DNA microarrays Proceedings of theNational Academy of Sciences of the UnitedStates of America 102(25) 8961ndash8965 doi101073pnas0502674102

Pico A R Kelder T van Iersel M P HanspersK Conklin B R amp Evelo C (2008)WikiPathways Pathway editing for the peoplePLoS Biology 6(7) e184 doi 101371jour-nalpbio0060184

Rahnenfuhrer J Domingues F S Maydt J ampLengauer T (2004) Calculating the statisticalsignificance of changes in pathway activity fromgene expression data Statistical Applicationsin Genetics and Molecular Biology 3(1) doi1022021544-61151055

Rhee Y S Wood V Dolinski K amp DraghiciS (2008) Use and misuse of the Gene Ontol-ogy annotations Nature Reviews Genetics 9(7)509ndash515 doi 101038nrg2363

Rustici G Kolesnikov N Brandizi M Bur-dett T Dylag M Emam I Sarkans U(2013) ArrayExpress updatendashtrends in databasegrowth and links to data analysis tools Nu-cleic Acids Research 41(D1) D987ndashD990 doi101093nargks1174

Sackett D L (1979) Bias in analytic researchJournal of Chronic Diseases 32(1-2) 51ndash63doi 1010160021-9681(79)90012-2

Sandve G K Nekrutenko A Taylor J amp HovigE (2013) Ten simple rules for reproduciblecomputational research PLoS ComputationalBiology 9(10) e1003285 doi 101371jour-nalpcbi1003285

Sartor M A Leikauf G D amp Medvedovic M(2009) LRpath A logistic regression approachfor identifying enriched biological groups ingene expression data Bioinformatics 25(2)211ndash217 doi 101093bioinformaticsbtn592

Schaefer C F Anthony K Krupa S Buchoff JDay M Hannay T Buetow K H (2009)PID The pathway interaction database NucleicAcids Research 37(Suppl 1) D674ndashD679 doi101093nargkn653

Shojaie A amp Michailidis G (2009) Analysis ofgene sets based on the underlying regulatory net-work Journal of Computational Biology 16(3)407ndash426 doi 101089cmb20080081

Shojaie A amp Michailidis G (2010) Networkenrichment analysis in complex experimentsStatistical Applications in Genetics and Molec-ular Biology 9(1) doi 1022021544-61151483

Storey J D amp Tibshirani R (2003) Statisticalsignificance for genomewide studies Proceed-ings of the National Academy of Sciences of theUnited States of America 100(16) 9440ndash9445doi 101073pnas1530509100

Subramanian A Tamayo P Mootha V KMukherjee S Ebert B L Gillette M A Mesirov J P (2005) Gene set enrichmentanalysis A knowledge-based approach for inter-preting genome-wide expression profiles Pro-ceeding of The National Academy of Sciences of

AnalyzingMolecularInteractions

82523

Current Protocols in Bioinformatics Supplement 61

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics

Page 24: Network‐Based Approaches for ... - Bioinformatics LabNetwork-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department

the Unites States of America 102(43) 15545ndash15550 doi 101073pnas0506580102

Tan P K Downey T J Spitznagel E L Jr XuP Fu D Dimitrov D S Cam M C(2003) Evaluation of gene expression measure-ments from commercial microarray platformsNucleic Acids Research 31(19) 5676ndash5684doi 101093nargkg763

Tarca A L Bhatti G amp Romero R (2013)A comparison of gene set analysis meth-ods in terms of sensitivity prioritization andspecificity PloS One 8(11) e79217 doi101371journalpone0079217

Tarca A L Draghici S Bhatti G ampRomero R (2012) Down-weighting over-lapping genes improves gene set analy-sis BMC Bioinformatics 13(1) 136 doihttpsdoiorg1011861471-2105-13-136

Tarca A L Draghici S Khatri P HassanS S Mittal P Kim J-S Romero R(2009) A novel signaling pathway impact anal-ysis (SPIA) Bioinformatics 25(1) 75ndash82 doi101093bioinformaticsbtn577

Tarca A L Draghici S Khatri P Hassan S SMittal P Kim J-S Romero R (2009) Anovel signaling pathway impact analysis Bioin-formatics 25(1) 75ndash82 doi 101093bioinfor-maticsbtn577

Tavazoie S Hughes J D Campbell M J ChoR J amp Church G M (1999) Systematic de-termination of genetic network architecture Na-ture Genetics 22 281ndash285 doi 10103810343

Varadan V Mittal P Vaske C J amp Benz SC (2012) The integration of biological path-way knowledge in cancer genomics A reviewof existing computational approaches SignalProcessing Magazine IEEE 29(1) 35ndash50 doi101109MSP2011943037

Vaske C J Benz S C Sanborn J Z Earl DSzeto C Zhu J Stuart J M (2010) Infer-ence of patient-specific pathway activities frommulti-dimensional cancer genomics data us-

ing PARADIGM Bioinformatics 26(12) i237ndashi245 doi 101093bioinformaticsbtq182

Voichita C Donato M amp Draghici S (2012) In-corporating gene significance in the impact anal-ysis of signaling pathways In Machine Learningand Applications (ICMLA) 2012 11th Interna-tional Conference on volume 1 pages 126ndash131Boca Raton FL USA 12ndash15 December 2012Piscataway NJ IEEE

Wang E T Sandberg R Luo S Khrebtukova IZhang L Mayr C Burge C B (2008)Alternative isoform regulation in human tis-sue transcriptomes Nature 456(7221) 470 doi101038nature07509

Xia J amp Wishart D S (2010) MetPAA web-based metabolomics tool for path-way analysis and visualization Bioinformatics26(18) 2342ndash2344 doi 101093bioinformat-icsbtq418

Xia J amp Wishart D S (2011) Web-based infer-ence of biological patterns functions and path-ways from metabolomic data using Metabo-Analyst Nature Protocols 6(6) 743 doi101038nprot2011319

Xia J Sinelnikov I V Han B ampWishart D S (2015) MetaboAnalyst 30mdashmaking metabolomics more meaningful Nu-cleic Acids Research 43(W1) W251ndashW257doi 101093nargkv380

Zhao Y Chen M-H Pei B Rowe D Shin D-G Xie W Kuo L (2012) A Bayesian ap-proach to pathway analysis by integrating genendashgene functional directions and microarray dataStatistics in Biosciences 4(1) 105ndash131 doi101007s12561-011-9046-1

Internet ResourceshttpswwwbiocartacomBioCarta Charting Pathways of Life

httpsbioconductororgpackagesreleasebiocmanualsBLMAmanBLMApdf

BLMA A package for bi-level meta-analysis Rpackage version 1

Network-BasedApproaches forPathway Level

Analysis

82524

Supplement 61 Current Protocols in Bioinformatics