138
Bioinformatics approaches to biomarker and drug discovery in aging and disease by Kristen Fortney A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Medical Biophysics University of Toronto © Copyright by Kristen Fortney 2012

Bioinformatics approaches to biomarker and drug discovery ... · computational biology to identify improved biomarkers and therapeutics for aging and disease. In Chapter 2, I proposed

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Bioinformatics approaches to biomarker and drug

discovery in aging and disease

by

Kristen Fortney

A thesis submitted in conformity with the requirements

for the degree of Doctor of Philosophy

Graduate Department of Medical Biophysics

University of Toronto

© Copyright by Kristen Fortney 2012

ii

Bioinformatics approaches to biomarker and drug discovery in aging and disease

Kristen Fortney

Doctor of Philosophy

Graduate Department of Medical Biophyiscs

University of Toronto

2012

Abstract

Over the past two decades, high-throughput (HTP) technologies such as microarrays and mass

spectrometry have fundamentally changed the landscape of aging and disease biology. They have

revealed novel molecular markers of aging, disease state, and drug response. Some have been

translated into the clinic as tools for early disease diagnosis, prognosis, and individualized

treatment and response monitoring. Despite these successes, many challenges remain: HTP

platforms are often noisy and suffer from false positives and false negatives; optimal analysis

and successful validation require complex workflows; and the underlying biology of aging and

disease is heterogeneous and complex. Methods from integrative computational biology can help

diminish these challenges by creating new analytical methods and software tools that leverage

the large and diverse quantity of publicly available HTP data.

In this thesis I report on four projects that develop and apply strategies from integrative

computational biology to identify improved biomarkers and therapeutics for aging and disease.

In Chapter 2, I proposed a new network analysis method to identify gene expression biomarkers

of aging, and applied it to study the pathway-level effects of aging and infer the functions of

poorly-characterized longevity genes. In Chapter 4, I adapted gene-level HTP chemogenomic

iii

data to study drug response at the systems level; I connected drugs to pathways, phenotypes and

networks, and built the NetwoRx web portal to make these data publicly available. And in

Chapters 3 and 5, I developed a novel meta-analysis pipeline to identify new drugs that mimic

the beneficial gene expression changes seen with calorie restriction (Chapter 3), or that reverse

the pathological gene changes associated with lung cancer (Chapter 5).

The projects described in this thesis will help provide a systems-level understanding of the

causes and consequences of aging and disease, as well as new tools for diagnosis (biomarkers)

and treatment (therapeutics).

iv

Acknowledgments

It is a pleasure to thank the many people that helped make this work possible.

Chief among these is my supervisor, Dr. Igor Jurisica. Throughout my degree, he has provided

exceptional guidance and support. The exciting and collaborative environment of the Jurisica lab

has been a great place to learn and develop as a scientist.

I would like to thank my Supervisory Committee members Drs. Elisabeth Tillier and Thomas

Kislinger for their advice and encouragement over the course of my graduate program.

I owe much to the help and support of my talented colleagues in the Jurisica lab and at the

University. Special thanks to Dr. Max Kotlyar for several years of fun and productive scientific

collaborations. I would also like to thank my other wonderful collaborators on the projects in this

thesis: Josh Griesman, Wing Xie, Dr. Eric Morgen, and Yulia Kotseruba. In addition, I’m

grateful to Abraham, Fiona, Kevin, and Marc for reading parts of this thesis and offering

valuable comments, Christian for keeping the cluster running, and Daniela and Sara for keeping

MaRS occupied late into the night. I thank the entire Jurisica lab for their great advice and

companionship over the years.

Finally, I would like to thank my family and friends for their support, optimism, and

encouragement throughout my studies.

v

Table of Contents

Acknowledgments .......................................................................................................................... iv

Table of Contents ........................................................................................................................... iv

Table of Figures .............................................................................................................................. x

1 Introduction ................................................................................................................................ 1

1.1 High-throughput technologies for aging and disease .......................................................... 1

1.2 The challenges facing high-throughput biology ................................................................. 3

1.2.1 NOISE AND HETEROGENEITY ......................................................................... 3

1.2.2 ANALYSIS ............................................................................................................. 6

1.3 How integrative computational biology can address these challenges ............................... 8

1.3.1 DATA INTEGRATION ......................................................................................... 9

1.3.2 NETWORK ANALYSIS ...................................................................................... 14

1.4 Research contributions of this thesis ................................................................................. 17

2 Inferring the functions of longevity genes with modular subnetwork biomarkers of

Caenorhabditis elegans aging .................................................................................................. 22

2.1 Abstract ............................................................................................................................. 22

2.2 Introduction ....................................................................................................................... 23

2.2.1 Methods for extracting active subnetworks by integrating gene expression

data, network connectivity, and supervised class labels ....................................... 25

2.3 Results and Discussion ..................................................................................................... 26

2.3.1 Identifying active subnetworks in aging by trading off network modularity and

class relevance ...................................................................................................... 26

2.3.2 Identifying modular subnetworks ......................................................................... 27

2.3.3 Class relevance R .................................................................................................. 29

2.3.4 Network modularity M ......................................................................................... 29

vi

2.3.5 Comparing regular and modular subnetworks ...................................................... 30

2.3.6 Modular subnetworks are more robust across studies than regular subnetworks . 31

2.3.7 Modular subnetworks trained on aging gene expression data from wild-type

worms successfully predict age in fer-15 worms .................................................. 32

2.3.8 Subnetworks vs. genes .......................................................................................... 34

2.3.9 Modular vs. regular subnetworks .......................................................................... 35

2.3.10 The role of the modularity coefficient in machine learning .............................. 36

2.3.11 Modular subnetworks predict wild-type worm age with low mean-squared

error ....................................................................................................................... 38

2.3.12 Longevity genes play crucial roles in significant subnetworks ............................ 40

2.3.13 Significant subnetworks are enriched for known longevity genes ....................... 41

2.3.14 Examples of significant subnetworks containing known longevity genes ........... 41

2.3.15 Modular subnetworks participate in many different age-related biological

processes ............................................................................................................... 43

2.3.16 Modular subnetworks can be used to annotate longevity genes with novel

functions ................................................................................................................ 46

2.4 Conclusions ....................................................................................................................... 48

2.5 Materials and Methods ...................................................................................................... 49

2.5.1 Code ...................................................................................................................... 49

2.5.2 Data sets ................................................................................................................ 49

2.5.3 Subnetwork analyses ............................................................................................. 50

2.5.4 Machine learning comparisons ............................................................................. 52

2.5.5 GO and KEGG enrichment analyses .................................................................... 52

2.6 Abbreviations .................................................................................................................... 53

2.7 Supplementary Materials .................................................................................................. 53

2.8 Acknowledgments ............................................................................................................. 53

3 In silico drug screen in mouse liver identifies candidate calorie restriction mimetics ............ 54

vii

3.1 Abstract ............................................................................................................................. 54

3.2 Introduction ....................................................................................................................... 55

3.3 Materials and Methods ...................................................................................................... 56

3.3.1 Code ...................................................................................................................... 56

3.3.2 Drug-drug interaction network ............................................................................. 56

3.3.3 Acquiring transcriptional signatures of calorie restriction .................................... 56

3.3.4 Connectivity map analysis of CR signatures ........................................................ 57

3.3.5 Meta-analysis of drug-response data .................................................................... 58

3.4 Results ............................................................................................................................... 58

3.4.1 Transcriptional signatures of CR .......................................................................... 58

3.4.2 Meta-analysis identified fourteen candidate CR mimetics ................................... 59

3.5 Discussion ......................................................................................................................... 62

3.6 Acknowledgments ............................................................................................................. 62

4 NetwoRx: Connecting drugs to networks and phenotypes in S. cerevisiae ............................. 64

4.1 Abstract ............................................................................................................................. 64

4.2 Introduction ....................................................................................................................... 66

4.3 Materials and Methods ...................................................................................................... 67

4.3.1 NetwoRx methods ................................................................................................. 67

4.3.2 Use-case methods .................................................................................................. 71

4.4 NetwoRx content and functionality .................................................................................. 71

4.4.1 Database contents .................................................................................................. 71

4.4.2 Accessing data ...................................................................................................... 72

4.5 NetwoRx use case examples ............................................................................................. 73

4.5.1 Retrieving drugs that perturb phenotypes: oxidative stress .................................. 73

4.5.2 Focused searches identify drugs with shared mode of action: drugs that target

the same DNA damage pathways as Cisplatin ..................................................... 74

viii

4.5.3 Bipartite networks reveal that some gene sets are druggable hubs ....................... 75

4.5.4 Clustering the drug-pathway matrix identifies drug modules that share modes

of action ................................................................................................................ 75

4.5.5 User-defined gene sets: identifying new drugs that modulate yeast

chronological aging ............................................................................................... 76

4.6 Discussion ......................................................................................................................... 78

4.7 Acknowledgements ........................................................................................................... 79

5 Computationally repurposing drugs for lung cancer with CMapBatch: candidate

therapeutics from an integrative meta-analysis of cancer gene signatures and

chemogenomic data .................................................................................................................. 80

5.1 Abstract ............................................................................................................................. 80

5.2 Background ....................................................................................................................... 82

5.3 Results and discussion ...................................................................................................... 85

5.3.1 CMapBatch meta-analysis strategy: From individual cancer gene signatures to

candidate therapeutics ........................................................................................... 85

5.3.2 Candidate drugs identified via CMapBatch are more conserved across

signature subsets than candidate drugs identified from single gene signatures .... 86

5.3.3 Characterizing and prioritizing candidate lung cancer therapeutics ..................... 89

5.3.4 Candidate therapeutics inhibit growth in nine NSCLC cell lines ......................... 89

5.3.5 Prioritizing drugs by structural similarity: eleven significant drugs are highly

structurally similar to TOP drugs .......................................................................... 93

5.3.6 Prioritizing drugs by shared target: twenty-eight significant drugs share a

protein target with one or more TOP drugs .......................................................... 93

5.3.7 Common protein targets of significant drugs ........................................................ 95

5.3.8 Significant drugs are broad-acting: they affect more genes than other drugs ....... 97

5.3.9 Many drugs are indicated for lung cancer independently of subtype ................... 98

5.4 Conclusions ....................................................................................................................... 98

5.5 Methods ............................................................................................................................. 98

5.5.1 Code and software ................................................................................................. 98

ix

5.5.2 Data sources .......................................................................................................... 99

5.5.3 Connectivity map analysis of lung cancer signatures ........................................... 99

5.5.4 Meta-analysis of drug-response data .................................................................. 100

5.5.5 NCI-60 analysis of significant drugs .................................................................. 100

5.6 Acknowledgements ......................................................................................................... 101

5.7 Supplementary Material .................................................................................................. 101

6 General conclusions and significance .................................................................................... 103

6.1 Conclusions ..................................................................................................................... 103

6.1.1 From genes to pathways, phenotypes, and networks .......................................... 105

6.1.2 Integrating complementary HTP data sources .................................................... 106

6.2 Open questions and future work ..................................................................................... 106

6.2.1 Limitations in HTP data ...................................................................................... 106

6.2.2 Future work ......................................................................................................... 107

7 References .............................................................................................................................. 109

x

Table of Figures

Figure 2-1. High-scoring subnetworks fulfill two criteria: they are modular and related to aging.

....................................................................................................................................................... 27

Figure 2-2. Identifying modular subnetworks. ............................................................................. 28

Figure 2-3. Modular subnetworks are highly conserved across studies. ...................................... 32

Figure 2-4. Predicting worm age using machine learning. ........................................................... 34

Figure 2-5. Subnetworks and genes predict the age of fer-15 worms. ......................................... 36

Figure 2-6. Modular subnetwork biomarkers of aging predict the age of individual wild-type

worms. ........................................................................................................................................... 40

Figure 2-7. Some examples of significant longevity subnetworks. .............................................. 42

Figure 3-1. Fourteen drug treatments significantly mimic the effects of CR on hepatic gene

expression. .................................................................................................................................... 61

Figure 4-1.Gene set analysis of chemogenomic data. ................................................................... 70

Figure 4-2.Drugs that perturb oxidative stress pathways. ............................................................. 73

Figure 4-3.Mode of action analysis of the chemotherapeutic cisplatin. ....................................... 74

Figure 4-4.Bipartite network showing all connections between drugs and YEASTRACT targets

of transcription factors. ................................................................................................................. 75

Figure 4-5.Drug module identified by clustering the matrix of drug-drug similarity scores. ...... 76

xi

Figure 4-6. Drugs predicted by NetwoRx to modulate yeast chronological lifespan. .................. 78

Figure 5-1. CMapBatch meta-analysis pipeline. ........................................................................... 86

Figure 5-2. CMapBatch produces more stable lists of significant drugs than individual gene

signatures. ..................................................................................................................................... 88

Figure 5-3. Drug candidates inhibit growth in lung cancer cell lines more than other Connectivity

Map drugs. .................................................................................................................................... 90

Figure 5-4. Prioritizing drug candidates with GI50 values and chemical structures. ................... 92

Figure 5-5. Significant drugs share many protein targets. ............................................................ 94

Figure 5-6. Significant drugs affect more genes than other Connectivity Map drugs. ................. 97

xii

Abbreviations

CGH Comparative genomic hybridization

CR Calorie restriction

FDR False discovery rate

GSEA Gene set enrichment analysis

GSS Gene set score

HTP High-throughput

KS Kolmogorov-Smirnov

MoA Mode of action

MSE Mean-squared error

PCR Polymerase chain reaction

PPI Protein-protein interaction

SCC Squared correlation coefficient

SVR Support vector regression

RNAi RNA interference

Databases and resources used in research chapters

Biological pathway annotations and related data

Gene Ontology (GO)

http://www.geneontology.org/

Human Ageing Genomic Resources (HAGR)

http://genomics.senescence.info/

Kyoto Encyclopedia of Genes and Genomes (KEGG)

http://www.genome.jp/kegg/pathway.html

xiii

Saccharomyces Genome Database

http://www.yeastgenome.org/

Wormbase

http://www.wormbase.org/

YEASTRACT

http://www.yeastract.com/

Experimental data

Functional interactions

Wormnet

http://www.functionalnet.org/wormnet/

Gene expression

Gene Expression Omnibus (GEO)

http://www.ncbi.nlm.nih.gov/geo/

Cancer Data Integration Portal (CDIP)

http://ophid.utoronto.ca/cdip

Connectivity Map (CMap)

http://www.broadinstitute.org/cmap/

Oncomine

https://www.oncomine.org/

Network visualization software

NAViGaTOR

http://ophid.utoronto.ca/navigator/

1

1 Introduction

Parts of this chapter are based on:

Kristen Fortney and Igor Jurisica (2011). Integrative computational biology for cancer research.

Human Genetics 130(4):465-81.

1.1 High-throughput technologies for aging and disease

Since the commercialization of DNA microarray technology in the late 1990s, high-throughput

(HTP) data relevant to aging and disease have been accumulating at an increasing rate. For

example, there have been over 80 published large-scale gene expression studies tracking the

transcriptional changes that occur with aging in humans and model organisms [1]. These data

have led to crucial insights into aging biology relevant to human health, including tissue-specific

aging [2, 3] and mechanisms of lifespan extension through dietary restriction or pharmacological

intervention [4-6]. In cancer research, HTP technologies have successfully been applied to help

elucidate the mechanisms of tumorigenesis, metastasis, and drug resistance [7]. They have also

had enormous clinical impact, e.g., several cancers can now be split into therapeutic subsets with

unique prognostic outcomes based on their molecular phenotypes [8-14].

Despite these successes, many challenges remain. For example, in differential expression

analyses of aging microarray data, few if any genes or biological processes have been identified

as age-regulated among multiple species [3, 15]; in genome-wide association studies of human

longevity (e.g., comparing centenarians to normal individuals), most longevity SNPs identified

2

to date are not robust across studies or populations [16-18]. And in studies of disease, predictive

and prognostic biomarkers derived from HTP transcriptomic or proteomic data are notoriously

inconsistent from study to study (i.e., they show poor overlap), and often cannot be validated by

other methods or in new cohorts of patients [19-21].

Aging and disease are highly complex; more and higher-throughput data do not immediately

translate into a better understanding of their mechanisms. The challenges to HTP data

interpretation fall under two main categories. The first is noise: HTP platforms are inherently

noisy (experimental noise) – results vary substantially from run to run and from lab to lab, and

are prone to false positives and negatives – and there is substantial heterogeneity (biological

noise) in the systems we wish to model. The second challenge is analysis: simple methods – such

as univariate analyses of microarray data – often miss much of the signal in the data.

Integrative computational methods will continue to play a central role in addressing these

challenges. We need new analytical methods to identify complex signals in different data

sources, as well as to combine information from diverse experimental platforms and other

sources that offer different perspectives on the problem, e.g., gene and protein expression,

protein-protein interactions and pathways, chromosomal aberrations, mutation events, epigenetic

changes, and clinical information from drug trials and the bedside. These methods will depend

on advances in many areas such as statistics, knowledge representation and ontology, machine

learning, data mining, graph theory, and visualization.

Below I review some of the major challenges to the interpretation of HTP data relevant to aging

and disease and give examples of integrative computational methods and strategies that have

helped to meet them. Next I outline how my four research projects, Chapters 2 through 5,

3

address some of these challenges. My projects focus on identifying biomarkers and therapeutics

for aging and disease, two tasks of particular concern to translational medicine. For these

projects, I apply and integrate multiple HTP data sources, including RNA microarray data of

aging, disease state, and drug response; protein-protein and functional interaction networks;

RNAi assays; DNA barcode chemogenomic screens; and large-scale in vitro growth inhibition

drug screens performed on human cell lines.

1.2 The challenges facing high-throughput biology

A wealth of genomic and proteomic data is now available from HTP screens. While these data

have improved our understanding of aging and disease and some have even translated into

improved patient diagnosis and treatment, significant challenges remain. In this section I briefly

review some of the major obstacles to progress.

1.2.1 NOISE AND HETEROGENEITY

Aging and disease cause heterogeneous changes, and HTP platforms are noisy. Consequently,

there can be large variation in results from lab to lab [22, 23]; methods are needed to circumvent

this noise and to integrate different data sources to make them more reliable. The main noise

issues that plague HTP research are false negatives and false positives, biological heterogeneity,

and platform bias.

4

1.2.1.1 False negatives and false positives in HTP data

Many HTP screens suffer from noise – leading to both false positives and false negatives –

which must be resolved through complementary experiments and computational analysis [24,

25]. For example, while HTP protein-protein interaction screens can identify thousands of

protein interactions at once, they do so at the cost of either high false discovery rates or poor

sensitivity. The false discovery rate is the proportion of detected interactions that are false, and

sensitivity is the proportion of true interactions that are successfully detected. When interactions

detected by two HTP studies were tested in small-scale screens, they were found to have false

discovery rates of 22% [26] and 38% [27]. An evaluation of five HTP methods found that their

sensitivity rates ranged from only 21% to 36% at false discovery rates of 0-11% [28]. Similarly,

mass spectrometry analyses of human serum typically produce many false negatives [29]. One

problem is that human serum has a high dynamic range – protein concentrations are estimated to

vary over 10 orders of magnitude [29]. Mass spectrometers have a much smaller range of

detection, leading to false negatives: low abundance proteins are not detected. This challenge

may be somewhat diminished by extensive sample fractionation [30].

1.2.1.2 Biological heterogeneity

Tumor heterogeneity. Often only a single sample from each tumor is available for analysis. But

tumors are highly heterogeneous, so these samples may not be representative of the whole tumor

[31-33]. Tumors comprise cells belonging to several distinct subpopulations – for example,

tumor regions can be hypoxic to different extents, or be made up of different proportions of

tumor initiating cells (cancer stem cells) – and these differences have consequences for

predicting drug response and prognosis [32-34]. Intra-tumor heterogeneity can lead to different

5

cell populations expressing different levels of protein, or having different mutations and copy

number alterations, which complicates analyses. For example, variable tumor epithelial and

stromal cell content in breast tumor samples can significantly affect gene expression profiles and

signature accuracy [35, 36]. Some of these difficulties can be alleviated with techniques such as

laser-capture micro-dissection [37], which can isolate regions of a sample that contain a more

uniform population of cells, but these techniques remain costly and slow. In addition, they may

not result in a sufficient amount of material for follow-up experiments. Though single samples

from tumors can suffice for population-level studies [33], the fact that two samples from the

same patient can be quite different means that this heterogeneity poses a significant challenge to

personalized medicine. On top of that, in cancer studies there are issues with the control samples

available for analysis: often these “normal” samples come from tissue directly adjacent to the

tumor. The properties of these samples, such as gene expression profiles, may be quite different

from those of more distant tissue or of a healthy patient [38].

Transcriptional noise in aging. Variability in the expression of single genes – also called gene

expression noise – increases with age. Cell-level expression variability increases with age in

mouse heart [39] (assessed using single cell PCR), and population-level variability increases in

rat retina [40] and human kidney, skeletal muscle, and brain [41] (assessed using microarrays),

though not for every single gene. Aging-induced transcriptional noise is thought to result from

the accumulation of random nuclear DNA damage with age [39]. As a consequence of this noise,

the signal in HTP aging data can be difficult to detect by conventional analysis methods [42].

6

1.2.1.3 Different experimental platforms can disagree

Experimental platforms produced by different companies can yield conflicting results [43-45].

For example, different microarray platforms use different probe design and labeling and have

different dynamic ranges. A gene may be overexpressed in disease on one platform, yet under-

expressed on another, simply because the two platforms use different DNA sequences to “probe”

the same gene. On a given platform, some genes are represented by many probes while others by

one or none, and this representation is different for different platforms. Genes that are expressed

at low levels are particularly problematic for concordance across array platforms [46]. Another

problem is that genome annotations continue to grow and change [47]. Updated probe set

definitions can substantially affect the number and the identity of differentially expressed genes

[48].

1.2.2 ANALYSIS

A primary goal of integrative computational biology analysis in aging and disease is to identify

small groups of genes/proteins/microRNAs etc. that can be used to improve diagnosis, predict

survival or predict treatment response, i.e., to identify prognostic or predictive biomarkers. Their

identification in HTP data is challenging since basic analysis methods fail to capture the entire

signal in the data, and good signatures comprise not only the most differentially expressed

molecules. For this section we will focus our attention on gene expression microarray studies

(they have been the most extensively studied), but the criticisms apply equally well to similar

experimental designs, such as those using HTP protein or microRNA assays.

7

1.2.2.1 Lists of differentially expressed genes show poor overlap across studies

The most popular way of analyzing microarray data is to identify individual genes that are

differentially expressed in one condition vs. in another, e.g., in non-responders vs. responders to

some drug treatment. Differentially expressed genes are widely used in aging and disease

research, e.g. as biomarkers for diagnosis, prognosis, and drug response. Unfortunately, there are

major challenges with their identification and interpretation. Previous work has shown that lists

of differentially expressed genes are poorly reproduced across studies [49, 50]; even random

subsets of samples from one experiment can yield widely divergent gene lists. This problem is

caused by high dimensionality, small number of samples, and noise (biological and technical

variability), but can be exacerbated by the analysis method. For example, analyses that quantify

the differential expression of gene groups rather than individual genes show higher conservation

across platforms and studies [51].

1.2.2.2 The most differentially expressed genes do not yield the best signatures

Very often in microarray experiments, the most differentially expressed genes are used to

construct prognostic or predictive signature: machine learning methods are trained to use the

expression levels of those genes to predict, for example, the disease state of a patient and

probability of survival. The problem with this approach is that single-gene analyses overlook

multivariate effects. More sophisticated analyses are needed to identify sets of genes that

complement one another, i.e., ones whose combined expression levels yield the best-performing

signatures. Such analyses show that genes essential to a good prognostic signature are often not

highly differentially expressed on their own [52, 53].

8

1.2.2.3 Signatures validate poorly on other datasets

One of the most important contributions of HTP biology to disease and aging research has been

to develop prognostic and predictive signatures. Unfortunately, as with differentially expressed

genes, many signatures have failed to validate by other methods or in new cohorts of patients

[19, 54]. Existing prognostic and predictive biomarkers for the same condition overlap only

partially, and the set of biomarker genes identified depends strongly on the subset of patients

used to generate it [55]. One study estimated that to achieve 50% overlap in prognostic gene sets

for breast cancer patients would require several thousand samples [56]. Several factors contribute

to this problem, including: 1) diverse patients and heterogeneous samples, 2) different profiling

platforms, 3) diverse statistical and bioinformatics approaches to biomarker identification [57],

4) an insufficient number of samples [56], and 5) the existence of multiple equivalent signatures

[55, 58].

1.3 How integrative computational biology can address these

challenges

The field of integrative computational biology uses techniques from computer science,

mathematics, physics and engineering to comprehensively analyze and interpret biological data.

Through the creation of new analysis and visualization methods, software tools and databases, it

can help diminish the challenges to HTP biology. Here we present some successful applications

of integrative computational biology to understanding aging and treating disease, drawing

examples from the databases and algorithms central to the work in this thesis. These applications

fall under two main categories: data integration and network analysis.

9

1.3.1 DATA INTEGRATION

As we have seen, noise in HTP studies can arise both from biological and technological

variability. One of the most effective strategies for reducing both types of noise is data

integration. The idea is simple: we can be more confident about the result of an experiment if

similar experiments yielded similar results. We can integrate different experiments that measure

the same biological entity, such as microarray studies measuring tumor vs. normal gene

expression differences on different experimental platforms. We can also integrate different data

types, such as mutation, expression, and proteomic data. Clearly, data integration can increase

our confidence in results that are consistent across multiple studies and experimental modalities.

But data integration can also increase sensitivity, since different platforms and methods exhibit

different biases – e.g., protein interactions may be undetectable by some methods.

1.3.1.1 Integrating the same type of data across multiple platforms and studies

With microarray and similar data, small sample numbers and different experimental platforms

can lead to highly variable results. These problems can be addressed by combining data from

different studies and platforms, which increases the effective number of samples and helps

control for inter-platform heterogeneity.

Most approaches to integrating microarray data can be divided into two general classes, pooling

and meta-analyses. In pooling, multiple expression datasets are merged into a single dataset [59,

60]; typically, gene measurements from each separate study are transformed before pooling to

make the experiments more comparable [61]. Previous work found that pooling six breast cancer

datasets (over 900 samples in total) yielded better-performing signatures [59]. In contrast, for a

10

meta-analysis, statistics are computed for each dataset separately and then combined. Meta-

analyses identify gene changes that are seen consistently across many studies [7]. Meta-analyses

have recently identified core sets of genes commonly regulated with age [42] or with calorie

restriction [62] across multiple species. Several databases gather information from multiple

studies to facilitate meta-analysis, including, for aging research, Gene Aging Nexus [63] and the

Human Ageing Genomic Resources [64], and for cancer research, Oncomine [65], GeneSigDB

[66], and the Cancer Data Integration Portal (CDIP; http://ophid.utoronto.ca/cdip).

One example of a successful meta-analytic strategy is the Rank Product method [67-69], which

we apply in Chapters 3 and 5. In this method, to identify (for example) genes that are

consistently up-regulated across studies in disease vs. normal samples, the approach is: 1) Within

each study, rank all genes from most to least up-regulated (using some criterion, such as the t-

statistic) 2) Compute products of ranks for all genes across all studies; 3) Generate a background

distribution of randomized Rank Products by permuting expression values for genes in each

array and repeating (1)-(2) many times.

Formally, say that there are G genes, K studies, and B randomizations. Let rgk be the rank of

gene g in the kth study. Then the Rank Product of gene g is:

(∏

)

Permuting expression values and repeating calculations yields randomized Rank Products RPg*b

;

we can then assign a P value to each gene g as:

⁄ ∑ ∑

11

We can then correct for multiple testing using, for example, the False Discovery Rate (FDR)

procedure.

1.3.1.2 Integrating different types of data

Integrating complementary data from different sources is helpful for reducing noise and

prioritizing targets [70, 71]. For example, Gortzak-Uzan et al. combined proteins identified in

ovarian cancer ascites with differentially expressed genes from CDIP and protein-protein

interactions from the Interologous Interaction Database (I2D) [72] to identify putative

biomarkers for early ovarian cancer detection in serum [71]. Methods of gene set analysis [73]

can be used to combine experimental data with annotation and pathway databases such as the

Gene Ontology [74], KEGG, and Reactome; these allow us to convert lists of differentially

expressed genes into lists of differentially expressed gene groups, which are more stable across

studies [51]. Gene set analysis is a crucial tool for most bioinformatics investigations, and some

form of it is used in each of Chapters 2-5. Gene set analysis methods differ in their choice of

gene-level statistics, the way they combine these into gene-set statistics, and their methods for

assigning significance [73].

One popular algorithm for analyzing gene sets is Gene Set Enrichment Analysis (GSEA) [51].

GSEA is the core algorithm used to combine drug response data with gene expression signatures

of disease in the Broad Institute’s Connectivity Map (CMap;

http://www.broadinstitute.org/cmap/) [75], a resource that we make extensive use of in Chapters

3 and 5; we also implement the same procedure as the first step of our CMapBatch algorithm

(Chapter 5). CMap (build 02) contains data on the gene expression responses of human cell lines

12

to 6100 drug treatments (corresponding to 1309 unique drugs; for some drugs, drug dosage or

cell line was varied). For each drug treatment, probe sets are ranked in order of most up-

regulated to most down-regulated in response to the treatment. As previously described [75], the

CMap online tool takes as input a gene signature of disease from a separate experiment – for

example, a set of genes up-regulated and a set of genes down-regulated with diabetes – and

applies GSEA to identify drugs that reverse these harmful gene expression changes.

Formally, for each drug treatment i, and for each of the two sets of signature genes, called tag

lists (the up-regulated genes and the down-regulated genes), the CMap resource computes a

“Kolmogorov-Smirnov statistic, ksupi and ksdown

i , and then combines these statistics into a

“connectivity score”. These statistics are calculated as follows.

Let n be the number of probe sets for which drug-treatment response was measured (in CMap, n

= 22,283), t the number of probe sets in the tag list, and V a vector (v1,…,vt) of the rank of each

probe set in the tag list for treatment instance i (so each vj ϵ (1,…,22283)), sorted in ascending

order.

Define a and b as:

[

]

[

]

Then the Kolmogorov-Smirnov statistic for the tag list and instance is calculated as:

{

13

Repeating these calculations using the set of up-regulated genes and the set of down-regulated

genes from the query signature yields ksupi and ksdown

i, respectively. Then if sgn(ksup

i) =

sgn(ksdowni), set the connectivity score S

i = 0. Otherwise, set s

i = ksup

i - ksdown

i, p = max(s

i), and q

= min(si), and calculate the connectivity score as:

{

For a given input signature, the connectivity scores will range from -1 to 1; large negative scores

correspond to drugs that reverse the gene expression changes in the query signature. The

Connectivity Map has been applied to identify new therapeutics for a wide range of diseases

including various cancers (e.g., [76, 77]).

1.3.1.3 Integrating data to predict gene function

Though many disease-related genes remain uncharacterized, HTP data and in silico methods can

be applied to assign them putative functions [78]. For example, in the successful MouseFunc

prediction challenge [79], teams of scientists competed to predict mouse gene function (Gene

Ontology categories) on the basis of several different data sources, including expression,

sequence, interactions, phenotype annotations, disease associations and phylogenetic profiles.

Several groups have integrated data from a variety of independent sources to create functional

interaction networks and predict new gene functions. For example, Wormnet [80] combines co-

expression, protein interaction, genetic interaction, and co-citation data to predict functional

relations between pairs of C. elegans genes. Individual data sources were weighted by a log-

likelihood score designed to reflect their ability to recover shared Gene Ontology functions, and

14

then combined to create weighted edges between genes. Wormnet edges were then applied to

identify new gene functions using a simple guilt-by-association approach. For example, the

authors demonstrated that Wormnet neighbors of genes that increase lifespan (identified from an

RNAi screen) were themselves highly enriched for lifespan-increasing genes (identified in

independent RNAi screens).

1.3.2 NETWORK ANALYSIS

Network approaches have been successful in addressing the analysis-based challenges of HTP

biology. Genes do not act in isolation; they form highly complex and interlinked molecular

networks. Examining genes in the context of these networks can yield valuable clues about their

function and relations, and expand our knowledge of individual pathways and their interactions

[81, 82]. For example, despite the noise in current protein interaction data sets, network analysis

can uncover biologically relevant information, such as lethality [83, 84], functional organization

[85-87], hierarchical structure [88, 89], modularity [90] and network-building motifs [91-93].

Three important applications of networks to aging and disease research include signature

generation, signature interpretation, and disease gene prediction.

1.3.2.1 Networks for signature generation

By integrating network information with gene expression data, we can identify predictive

signatures that perform better and are more conserved across studies than signatures based on

gene expression data alone. There are many ways of using networks to create improved gene

signatures. One class of methods that has proven very successful is score-based subnetwork

15

biomarkers [52, 94-97]. In these approaches, genes are aggregated around an initial “seed” gene

in a network to generate subnetworks whose pooled activity levels can be used to predict the

value of some response variable, such as disease status or survival time. For example, Chuang et

al. [52] calculated subnetwork scores as the mutual information [98] between subnetwork

activity (the mean normalized expression of subnetwork genes) and the class label (cancer vs.

normal). Subnetworks were grown outward iteratively from a seed node using a greedy search

procedure to maximize subnetwork score: at every step, the network neighbor of the current

subnetwork yielding the largest score increase was added to the subnetwork. Subnetworks

identified using this approach were shown to be highly conserved across studies, and to perform

better than individual genes or pre-defined gene groups at predicting breast cancer metastasis

[52]. Importantly, many crucial genes belonging to subnetwork biomarkers were not

differentially expressed on their own, demonstrating the added value of a network approach.

Related approaches have been used to develop subnetwork biomarkers for colon cancer using a

combination of proteome and transcriptome data [99, 100].

1.3.2.2 Networks for signature interpretation

Many genes that play a role in predictive signatures for a given disease have not been previously

linked to that disease, and thus can be considered novel candidate disease genes. Networks can

be used to link these genes with known disease mechanisms and pathways [101-104]. Gene

signatures mapped to protein interactions can be further annotated with other profiles (including

proteomic, CGH, and miRNA studies), and with network structures, such as graphlets [93, 105].

Networks can also reveal new connections between different signatures. For example, though a

recently-identified 15-gene prognostic and predictive signature in lung cancer [106] did not

16

directly overlap with previously published ones, network analysis revealed that they were highly

related: there were direct interactions between the protein products of genes from the new

signature and others. Similar results have been shown in other studies [14, 107]

1.3.2.3 Networks for identifying new disease genes

Analyses of the network connectivity of disease genes have shown that they can be characterized

by several topological properties. For example, proteins encoded by cancer genes tend to be

central in interaction networks (they have high degree and betweenness centrality [108-110]),

have high clustering coefficients [111], and are overrepresented in network motifs [110].

Several methods use the topological characteristics of known disease genes, in combination with

other features (such as Gene Ontology (GO) categories, protein domains, biological pathways,

and sequence features), to predict new disease genes [110, 112] or functional SNPs [113]. Many

algorithms identify modules in interaction networks, or groups of densely interconnected genes

that can be highly functionally related [114-117]. Module-finding algorithms can also be applied

to predict new disease genes [52, 118, 119]. In these approaches, graph modules are first

determined from network topology; next, a statistical test such as the hypergeometric test is

applied to identify functional categories that individual modules are enriched for; and finally, all

genes in a module are annotated with its enriched functions.

There are many distinct approaches to use network topology to identify modules. Some methods,

like Restricted Neighborhood Search Clustering (RNSC) [114], partition a network into disjoint

modules. In RNSC, network nodes are first randomly assigned to modules, and then at every

iteration of the algorithm one node is moved to a different module to reduce a cost function that

17

depends on the number of intramodule and intermodule edges. Other methods, such as that

proposed by Lancichinetti and Fortunato [120], use only local network topology and allow for

the possibility of overlapping modules. In that algorithm, modules are iteratively grown out from

individual seed nodes to greedily maximize a measure of modularity that is a function of the

number of edges internal to the module (i.e., edges connecting two internal nodes) and the total

number of edges connected to module nodes.

1.4 Research contributions of this thesis

The four projects constituting the research contributions of this thesis are concerned with

identifying biomarkers and therapeutics for aging and disease. Each project uses several of the

techniques described above to help deal with the challenges of HTP biology. In Chapter 2, I

proposed a new network analysis method for identifying biomarkers of aging that uses gene

modules rather than individual genes, and showed that it improved biomarker robustness and

classification performance. In Chapter 4, I adapted gene-level HTP chemogenomic data to study

drug response at the systems level; I connected drugs to pathways, phenotypes and networks, and

built the NetwoRx web portal (http://ophid.utoronto.ca/networx/) to make these data publicly

available. And in Chapters 3 and 5, I developed a novel meta-analysis pipeline to identify new

drugs that mimic the beneficial gene expression changes seen with calorie restriction (Chapter 3),

or that reverse the pathological gene changes associated with lung cancer (Chapter 5). In these

applications, noise and biological heterogeneity are mitigated by conducting meta-analyses using

large numbers of gene signatures (Chapters 3, 5), and modeling drug and disease responses at the

level of gene groups (Chapter 4) or subnetwork modules (Chapter 2) rather than individual

genes; and also by integrating multiple complementary HTP data sources e.g., gene expression

18

data with genome-wide RNAi phenotypes (Chapter 2) or large scale drug-induced growth-

inhibition data (Chapter 5).

Chapter 2. Inferring the functions of longevity genes with modular subnetwork biomarkers

of Caenorhabditis elegans aging.

Kristen Fortney, Max Kotlyar, and Igor Jurisica (2010). Genome Biology 11(2):R13. [97]

In this study we developed a new method to identify gene expression biomarkers of aging. We

overlaid expression data from two worm aging studies onto a functional interaction network, and

identified modular subnetwork biomarkers using a new performance criterion that trades off

modularity – internal cohesiveness at the network level – with relatedness to the class label (here,

worm age). We found that our method outperformed previous ones on key measures, yielding

biomarkers that were more conserved across studies and performed better on a difficult machine

learning task: predicting age based on expression data. We analyzed modular subnetwork

biomarkers to determine their relation to known mechanisms of aging, and found that they play

central roles in metabolic and DNA repair pathways, and are significantly enriched for longevity

genes. Finally, we applied them to assign putative aging-related functions to poorly characterized

longevity genes.

Chapter 3: In silico drug screen in mouse liver identifies candidate calorie restriction

mimetics.

19

Kristen Fortney, Eric Morgen, Max Kotlyar, and Igor Jurisica (2012). Rejuvenation Research

15(2). [121]

In this study we conducted a meta-analysis using multiple gene signatures to identify drugs that

mimic calorie restriction. Calorie restriction (CR) extends lifespan in mammals and can delay the

onset of age-related diseases, including cancer and diabetes. Drugs that target the same genes and

pathways as CR may have enormous therapeutic potential. We collected nine previously

published gene expression signatures of CR and screened them against HTP drug-response data

for over 1000 drugs (from the Connectivity Map) to obtain sets of drugs that mimic CR at the

transcriptional level. We implemented a novel meta-analysis method and identified 14 drugs that

consistently mimic CR across signatures. We characterized these drugs by relating them to

known lifespan-extending drugs and analyzed them using a mode-of-action network.

Chapter 4: NetwoRx: Connecting drugs to networks and phenotypes in S. cerevisiae.

Kristen Fortney, Wing Xie, Max Kotlyar, Joshua Griesman, Yulia Kotseruba and Igor Jurisica

(2012). Submitted. [122]

In this study we adapted gene-level HTP chemogenomic data to study drug response at the

systems level. We integrated the three largest S. cerevisiae chemogenomic experiments, which

together comprise the responses of thousands of gene knockout strains to 466 drugs, and applied

data-mining approaches to investigate drug effects at the system level. We identified yeast

pathways, functions, and phenotypes that are targeted by particular drugs, computed measures of

drug-drug similarity, and constructed drug-phenotype networks. We created the NetwoRx web

portal (http://ophid.utoronto.ca/networx/) to make the results of our analyses fully available and

20

to facilitate new systems-level analyses of drug response. We demonstrated with use case

examples how NetwoRx can be applied to target specific phenotypes, repurpose drugs using

mode-of-action analysis, investigate bipartite networks, and predict new drugs that affect yeast

aging.

Chapter 5: Computationally repurposing drugs for lung cancer with CMapBatch:

candidate therapeutics from an integrative meta-analysis of cancer gene signatures and

chemogenomic data.

Kristen Fortney, Joshua Griesman, Max Kotlyar, and Igor Jurisica (2012). In Preparation. [123]

Though existing methods that use signatures to repurpose drugs are based on the analysis of

individual signatures, for many diseases dozens of gene signatures are in the public domain. We

developed a new meta-analysis method, CMapBatch, to exploit these data, and made it publicly

available at http://ophid.utoronto.ca/cmapbatch. CMapBatch is a computational meta-analysis

pipeline that takes as input a collection of gene signatures of disease and outputs a list of drugs

predicted to consistently reverse pathological gene changes. We applied CMapBatch to a

collection of 21 gene expression signatures of lung cancer. We demonstrated that, while drug

candidates identified by Connectivity Map analysis of individual gene signatures are highly

variable, CMapBatch returns very stable sets of top drug candidates. Our meta-analysis of all 21

signatures revealed that 247 drugs consistently reversed lung cancer gene changes. In silico

validation on the NCI-60 collection showed that drug candidates significantly inhibit growth in

nine lung cancer cell lines.

21

22

2 Inferring the functions of longevity genes with

modular subnetwork biomarkers of Caenorhabditis

elegans aging

This chapter is based on:

Kristen Fortney, Max Kotlyar, and Igor Jurisica (2010). Inferring the functions of longevity

genes with modular subnetwork biomarkers of Caenorhabditis elegans aging. Genome Biology

11(2):R13.

2.1 Abstract

A central goal of biogerontology is to identify robust gene-expression biomarkers of aging. Here

we develop a method where the biomarkers are networks of genes selected based on age-

dependent activity and a graph-theoretic property called modularity. Tested on C. elegans, our

algorithm yields better biomarkers than previous methods — they are more conserved across

studies and better predictors of age. We apply these modular biomarkers to assign novel aging-

related functions to poorly characterized longevity genes.

23

2.2 Introduction

Aging is a highly complex biological process involving an elaborate series of transcriptional

changes. These changes can vary substantially in different species, in different individuals of the

same species, and even in different cells of the same individual [39, 124, 125]. Because of this

complexity, transcriptional signatures of aging are often subtle, making microarray data difficult

to interpret – more so than for many diseases [42, 63]. Interaction networks represent prior

biological knowledge about gene connectivity that can be exploited to help interpret complex

phenotypes like aging [126, 127]. Here for the first time, we integrate networks with gene

expression data to identify modular subnetwork biomarkers of chronological age.

With few exceptions, previous analyses of aging microarray data have been limited to studying

the differential expression of individual genes. However, single-gene analyses have been

criticized for several reasons. Briefly, they are insensitive to multivariate effects and often lead to

poor reproducibility across studies [49, 50, 58] – even random subsets of data from the same

experiment can produce widely divergent lists of significant genes. Recent studies have shown

that examining gene expression data at a systems level – in terms of appropriately chosen groups

of genes, rather than single genes – offers several advantages. Compared to significant genes,

significant gene groups are more replicable across different studies, lead to higher performance

in classification tasks, and are more biologically interpretable [49, 52].

Many complementary approaches to the systems-level analysis of microarray data have been

proposed. These range from methods like Gene Set Enrichment Analysis [51], which determines

whether members of pre-defined groups of biologically related genes (such as those supplied by

the Gene Ontology [74]) share significantly coordinated patterns of expression, to machine

learning methods that consider all possible combinations of genes and identify groups whose

24

combined expression pattern can distinguish between different phenotypes – with no constraint

that the genes in a group must be biologically related.

Network methods for interpreting gene expression data [52, 95, 128-132] fall in between these

two extremes: they incorporate prior biological knowledge in the form of an interaction network

– so that genes in a significant group are likely to participate in shared functions – but they

consider many different combinations of genes, and so are more flexible than methods using pre-

defined gene groups. Gene groups identified by these methods constitute novel biological

hypotheses about which genes participate together in common functions related to the class

variable.

Here, we propose a novel strategy for identifying subnetwork biomarkers: we incorporate a

measure of topological modularity into the expression for subnetwork score. This yields

subnetwork biomarkers that are biologically cohesive and that have different activity levels at

different ages. Using two aging microarray datasets, we show that our method improves on

previous approaches, yielding subnetworks that are more conserved across studies, and that

perform better in a machine learning task. We identify the subnetworks that play a role in worm

aging, and then explore their connection with known longevity genes. Finally, we apply them to

assign putative aging-related functions to longevity genes (genes that affect lifespan when

deleted or perturbed). Worm is the ideal model organism for studying these questions, since it

has the largest number of characterized longevity genes [64], and microarray datasets using

worms of 4 or more ages are publicly available [125, 133]. Our work builds on a family of

successful algorithms that incorporate supervised information to find subnetworks with

phenotype-dependent activity, which we discuss below.

25

2.2.1 Methods for extracting active subnetworks by integrating gene expression

data, network connectivity, and supervised class labels

To date, some of the most successful network-based methods of gene group identification for

class prediction have been the score-based subnetwork markers originally proposed in Ideker et

al. [94] and developed and expanded in later works, e.g. [52, 95, 128, 129, 134, 135].

Subnetworks identified using these approaches were recently shown to be highly conserved

across studies and to perform better than individual genes or pre-defined gene groups at

predicting breast cancer metastasis [52].

Most of these methods share the same basic architecture. Each algorithm aggregates genes

around a seed node in a way that maximizes some measure of performance. In previous

implementations, the score is a function of the subnetwork activity (often calculated as the mean

expression value of the genes in the subnetwork) and the class label – i.e. subnetworks get high

scores if their activity is different for different classes. Subnetworks are grown outward

iteratively from a seed node, typically using a greedy search procedure to maximize subnetwork

score: at every step, the network neighbour of the current subnetwork yielding the largest score

increase is added to the subnetwork.

Subnetwork scores are calculated differently in individual implementations (e.g. [95] uses the t-

statistic and [52] uses mutual information) but are always solely a function of what we refer to as

class relevance, i.e. of expression data and class labels. In particular, in all previous

implementations the subnetwork score is insensitive to network topology – the only topological

constraint is that subnetwork members must form a connected component.

26

However, a large body of work in network theory has demonstrated the value of more

sophisticated topological measures of network cohesiveness, or modularity [116, 117]. In fact,

many algorithms successfully identify groups of functionally related genes on the basis of

network topology alone. The simple intuition behind these algorithms is that genes that are

members of a highly interconnected group (that is only sparsely connected to the rest of the

network) are more likely to participate in the same biological function or process. In biological

networks, genes belonging to the same topological module are more likely to share functional

annotations or belong to the same protein complex [114, 115, 136].

No score-based subnetwork method proposed to date takes advantage of the rich modular

structure of biological interaction networks. Here, we propose incorporating topological

modularity into the expression for subnetwork score, and show that this approach offers

important advantages – increased conservation across studies, and improved performance on a

learning task. For the remainder of the paper, we refer to subnetworks grown using scores that

are a function of class relevance alone as regular subnetworks, and to those grown using our new

scoring criterion as modular subnetworks.

2.3 Results and Discussion

2.3.1 Identifying active subnetworks in aging by trading off network modularity

and class relevance

Here, we give a basic outline of our method for identifying subnetworks that are both highly

modular and relevant to the class variable (Fig.1), and then we discuss the novel aspect – the

subnetwork scoring method – in detail; other algorithm parameters are listed in Materials and

Methods. We compared the performance characteristics of modular and regular subnetworks

using two microarray studies of worm aging [125, 133].

27

Figure 2-1. High-scoring subnetworks fulfill two criteria: they are modular and related to

aging.

A. High-scoring subnetworks have high modularity, i.e., they are highly interconnected, and sparsely

connected to the rest of the network. B. High-scoring subnetworks have high class relevance, i.e. they

have activity levels that increase or decrease as a function of worm age.

2.3.2 Identifying modular subnetworks

Our method is summarized in Fig. 2-2. First, we assign a weight to every edge in the interaction

network that reflects the strength of the relation between the two genes that flank it (quantified

using Spearman correlation). For genes i and j with normalized expression vectors iz andjz ,

the weight ijw

is defined as:

1 if there is a network edge between nodes and ( , ) , where

0 otherwiseij i j ij ij

i jw corr

z z

Next, we grow subnetworks starting at particular seed genes in the network (see Materials and

Methods). At each stage of the network growth procedure, the algorithm considers all network

28

neighbours of the current subnetwork N. For each neighbour, the algorithm calculates the change

in subnetwork score that would result if that neighbour were added to N. Here, we define the

subnetwork score S as a weighted sum of class relevance R and modularity M, where R captures

how related subnetwork activity is to age and M measures subnetwork cohesiveness:

for some 0S R M

At every stage, the neighbour that leads to the highest score increase (without reducing either

class relevance or modularity) is added to the subnetwork.

The intuition behind the modularity parameter M is that it allows us to trade off the information

in gene expression data with the prior knowledge about gene connectivity encoded in the

functional interaction network: for noisy microarray studies, or ones with few samples, we

should place a greater emphasis on prior knowledge by choosing higher values for β. Previous

subnetwork scoring algorithms effectively assume that β = 0, or S = R.

Figure 2-2. Identifying modular subnetworks.

29

A. Start with the largest connected component of the functional interaction network representing all

genes whose expression has been measured B. Weight every edge of the network with the absolute

value of the Spearman correlation between the two genes flanking it. C. Identify age-related

subnetworks by growing subnetworks iteratively out from seed nodes.

2.3.3 Class relevance R

We measure class relevance as the Spearman correlation between subnetwork activity and age,

so that a subnetwork is considered age-related to the extent that its activity level either increases

or decreases monotonically with increasing age (Fig. 2-1B). Subnetwork activity is calculated as

the mean expression level of subnetwork genes. Thus, if the genes in subnetwork N have

normalized expression vectors 1{ ,..., }nz z , and c is the vector of ages for each sample, then the

activity is 1

1 n

i

in

a z , and the class relevance is ( , )R corr a c .

2.3.4 Network modularity M

To define the modularity of a connected set of genes in a network, we use a weighted

generalization of the local measure proposed in Lancichinetti and Fortunato [120]. We calculate

the modularity for a subnetwork as the edge weight internal to the subnetwork divided by the

total edge weight of all subnetwork nodes, squared. For subnetwork N, we define the internal,

external, and total weight:

int

,

1

2ij

i j N

w w

ext ij

i Nj N

w w

tot int extw w w

30

Then the modularity of N can be written as int

2

tot1

wM

w

. For all subnetworks, M lies between 0

and 1.

2.3.5 Comparing regular and modular subnetworks

To compare the performance of regular and modular subnetworks, we generated several

subnetworks of each type by adjusting algorithm parameters. For modular subnetworks, we set

the modularity coefficient β = 50, 100, 250, 500, or 1000 (significant subnetworks generated

using these parameters are called m1, m2, m3, m4 and m5). For regular networks we set β = 0,

and halted subnetwork growth at different score cut-off thresholds r = 0.01, 0.02, 0.05, 0.1 or 0.2

(groups of significant subnetworks are called r1, r2, r3, r4, and r5).

We generated modular subnetworks m1-m5 and regular subnetworks r1-r5 separately for

two different C. elegans aging microarray datasets: 104 microarrays of individual wild-type (N2)

worms over 7 ages (9-17 microarrays per age) [125], and 16 microarrays of pooled sterile (fer-

15) worms over 4 ages (4 microarrays per age) [133]. For each study, we grew subnetworks

seeded at every node in the functional interaction network, so that corresponding subnetworks

grown using different expression datasets could be directly compared. We used randomization

tests to determine which subnetworks were significantly associated with age in each study. For

further details, see Materials and Methods. Below, we compare these regular and modular

subnetworks in terms of their robustness across studies and performance on a machine learning

task.

31

2.3.6 Modular subnetworks are more robust across studies than regular

subnetworks

Comparing the modular subnetworks m1-m5 and the regular subnetworks r1-r5 derived from

both studies, we found that modular subnetworks identified as significant in one study were

highly likely to be significant in the other study (i.e., seed genes of significant modular

subnetworks were highly conserved across studies). Fig. 2-3 shows that 15-18% of significant

modular subnetworks were identified in both studies; in contrast, only 3-5% of significant

regular ones were.

For each modular and regular network type, we also calculated the significance of the overlap

between sets of significant seed genes using the hypergeometric test, and these values showed

the same trend (Fig. 2-3). While all subnetwork types were more conserved across studies than

would be expected by chance (p < 10-3

), modular subnetworks were much more conserved than

regular ones – they had enrichment p-values ranging from 10-84

to 10-137

, while regular

subnetworks had p-values from 10-3

to 10-38

.

While substantially more modular than regular subnetworks were conserved across studies, many

subnetworks were identified in only one study; this can be partially accounted for by noise in the

individual microarray studies, the fact that the two studies used different microarray platforms

and different strains of worm, and the fact that the current functional interaction network is not

complete and contains some errors.

32

Figure 2-3. Modular subnetworks are highly conserved across studies.

Modular subnetworks m1-m5 are shown in green and regular subnetworks r1-r5 in blue. Bar height

shows the percentage overlap across studies for seed genes of significant modular and regular

subnetworks derived from the data in Golden et al. and Budovskaya et al.; this is calculated as the size of

the intersection of sets of significant seed genes from both studies, divided by the union. P-values above

each bar show the significance of the overlap calculated using the hypergeometric test.

2.3.7 Modular subnetworks trained on aging gene expression data from wild-type

worms successfully predict age in fer-15 worms

We compared the performance of single genes, regular subnetworks, and modular subnetworks

on a machine learning task: predicting worm age on the basis of gene expression levels (Fig. 2-

4). We acquired sets of significant genes from [125]; g1 is made up of all the genes considered

33

significant in that study, and g2 is the aging gene signature used for machine learning in [125]

(i.e., g2 is the 100 most significant genes from g1). Using machine learning features drawn from

gene sets g1-g2, regular subnetworks r1-r5, or modular subnetworks m1-m5 derived from the

larger microarray study [125], we trained support vector regression (SVR) algorithms to predict

the age of wild-type worms on the basis of gene expression (for details, see Materials and

Methods). We then tested the performance of the learned feature weights on an independent data

set in a different strain of worm (fer-15) [133]. Performance on the test set was quantified as the

squared correlation coefficient (SCC) between worm ages predicted by the SVR and true worm

ages (measuring performance in terms of mean-squared error would be inappropriate here,

because the worms in the training and test sets had different lifespans). All p values reported in

this section were calculated using the Wilcoxon ranksum comparison of medians test.

To capture the typical performance of machine learners that used either genes or subnetworks as

features, we considered four different sizes of feature set (5, 10, 25, or 50 features). Then, for

each size of feature set, and for each set of genes (g1-g2) or subnetworks (r1-r5, m1-m5), we

performed 1000 tests. For example, for the 25-feature SVRs, and for the m1 significant

subnetworks, we randomly drew 25 subnetworks from m1, trained them on the wild-type worm

data, and then tested them on the fer-15 data – and repeated that process of drawing, training, and

testing 1000 times. Fig. 2-5 summarizes test results at each feature level, showing the typical

performance of the best sets of genes, regular subnetworks, and modular subnetworks. Full

results for every parameter setting are available in Figure 2-S1, and p-value comparisons in

Table 2-S1.

Over all tests, the SVRs using 25 or 50 modular subnetwork features (of the m1 and m3 types)

achieved the highest typical performance, with a median SCC of 0.91 between predicted and true

34

worm age; this is a statistically significant 7% and 26% improvement over the best performances

of regular subnetworks (p < 10-83

) and genes (p <10-202

), respectively (Fig. 2-5).

Figure 2-4. Predicting worm age using machine learning.

The activities of genes or subnetworks (subnetwork activity is calculated as the mean activity of its

member genes) are used by Support Vector Regression algorithms to predict age on the basis of gene

expression. Performance is typically measured using both the mean-squared error of the difference

between true and predicted ages, and the squared correlation coefficient between true and predicted

ages.

2.3.8 Subnetworks vs. genes

Modular and regular subnetworks dramatically outperform significant genes across a range of

parameters. For example, using 25 features (Fig. 2-5), the best modular subnetworks have a

median SCC of 0.91 and the best regular subnetworks of 0.85, versus 0.70 for the 100-gene

signature. This result was consistent across feature levels and parameter settings, and is highly

significant for all tests: i.e., for every comparison between modular subnetwork features and

gene features, we have p < 10-15

. For all sizes of feature set, the best-performing subnetworks

35

(m3) always showed a median SCC at least 0.16 higher than the best-performing genes (g2), i.e.

at least a 24% improvement.

2.3.9 Modular vs. regular subnetworks

For all sizes of feature set, the median SCC of the best modular subnetwork type always

exceeded that of the best regular subnetwork type by at 0.05-0.08, corresponding to a 6-10%

performance improvement (Fig. 2-5). The performance difference between the best modular

subnetworks and the best regular subnetworks is highly significant at all feature levels (p < 10-

32).

It was not only the best modular subnetworks that outperformed the best regular subnetworks; in

fact, modular subnetworks significantly outperformed the best regular subnetworks for most

parameter settings. With the exception of m5 ( 1000 ), each modular subnetwork type

significantly outperforms the best regular subnetwork type at all feature levels. For three types of

modular subnetwork (m1-m3), the performance difference between them and the best regular

subnetworks is highly significant (ranksum p < 10-26

for every comparison); m4 outperforms the

best regular subnetworks at p < 10-5

for three feature levels, and at p<10-2

for 5 features; for m5,

there is no consistent trend (Fig. 2-S1). All pairwise comparisons (p-values) between regular and

modular subnetworks are available in Table 2-S1.

36

Figure 2-5. Subnetworks and genes predict the age of fer-15 worms.

Modular subnetworks are shown in green, regular subnetworks in blue, and gene sets in gray. This figure

shows the best-performing type of modular subnetworks, regular subnetworks, and genes at each

feature level. For modular subnetworks, this is type m3 at every feature level; for regular subnetworks,

type r3 at 5 and 10 features, r2 at 25 features, and r4 at 50 features; for genes, g2 at all feature levels.

Support Vector Regression algorithms using 5, 10, 25, or 50 features were trained to predict age on the

data from Golden et al. [125] and tested on Budovskaya et al. [133]. For each size of feature set, 1000

different Support Vector Regression learners were computed; curves show their median performance

(quantified using the squared correlation coefficient between true and predicted age in the bottom

panel), and error bars indicate the 95% confidence intervals for the medians (calculated using a

bootstrap estimate).

2.3.10 The role of the modularity coefficient in machine learning

Different values of β correspond to giving different proportional weights to the information in

gene expression data and to the prior knowledge about gene connectivity encoded in the

37

functional interaction network: for noisy microarray studies, or ones with few samples, we might

want to depend more on prior knowledge by choosing a high value for β.

For the Golden et al. dataset [125] that we used for training, we found that a value of β = 100

corresponds roughly to treating class relevance and modularity as equally important in the

expression for subnetwork score: in simulations where we generated subnetworks using either

modularity or class relevance alone as the scoring criterion (i.e. S = M or S = R), the median

modularity of the S = M subnetworks was two orders of magnitude smaller than the median class

relevance of the S = R ones, i.e., ‘good’ values for modularity are roughly 100 times smaller than

‘good’ values for class relevance.

As β becomes larger, the proportional contribution of class relevance to the expression for

subnetwork score becomes smaller – and so for large enough values of β, the algorithm will

behave essentially like other purely unsupervised network clustering algorithms that greedily

aggregate nodes around a seed to maximize modularity [120, 136, 137]. In our tests, subnetworks

generated using β = 50, 100, or 250 behaved virtually identically on the learning task; the

performance of β = 500 subnetworks was typically a bit lower; and that of β = 1000 ones lower

still. For large enough values of β, we would expect the typical performance of modular

subnetworks to fall below that of regular subnetworks, because supervised feature selection is

superior to unsupervised feature selection [138].

In the previous two sections, we established that modular subnetworks are more robust across

studies than regular subnetworks and perform better in a worm age prediction task. Modular

subnetworks grown using the coefficient 250 showed both the highest robustness across

38

studies and the best performance on the test set, so we chose to analyze them in greater detail.

For the remainder of the paper, we will explore the relation between these subnetwork

biomarkers (generated from the larger microarray study [125]) and worm aging. The full set of

these subnetworks is available in Table S2.

2.3.11 Modular subnetworks predict wild-type worm age with low mean-squared

error

Here, we show using 5-fold cross-validation that modular subnetworks grown using 250 can

predict the age of individual wild-type worms in the original dataset (104 worm microarrays over

7 ages) with low mean-squared error and a high squared correlation coefficient. Again, we used

support regression algorithms (SVRs) for all learning tasks.

Because it would be circular to predict age on the same dataset that was used to determine the

features [139], we first divided the wild-type worm aging dataset into 5 stratified folds for cross-

validation. We repeated the search for significant subnetworks 5 times, each time using 4/5 of

the data to select significant subnetworks and train SVRs, and then the remaining 1/5 as a test set

to evaluate the learned feature weights. We compared the performance of modular subnetworks

with that of the top 100 differentially expressed genes reported in [125]. To construct SVRs

using genes as features, we used the same 5 stratified folds – i.e., we used 4/5 of the data to

select the top 100 most significant genes and learn feature weights, and the remaining 1/5 as test

data, and repeated this process for each of the 5 folds. As in the original study [125], for each

fold we selected the top 100 significant genes by performing an F-test and applying a False

Discovery Rate [140] (FDR) correction.

39

For four different sizes of feature set (5, 10, 25 or 50), we generated 1000 different SVRs using

either modular subnetworks or genes as features to capture their typical performance. All p-

values reported here were computed using the Wilcoxon ranksum test.

At every size of feature set (5, 10, 25 or 50), modular subnetworks significantly outperform

differentially expressed genes (p < 10-28

) according to the metrics of mean-squared error (MSE)

and squared correlation coefficient (SCC) between predicted age and true age. For example,

using feature sets of size 50, we obtained a median MSE of 7.9 for subnetworks vs. 11.2 for

genes (p < 10-98

), and a median SCC of 0.77 for subnetworks vs 0.69 for genes (p < 10-65

). Fig.

2-6A shows the median performance of modular subnetworks and genes across all tests, and Fig.

2-6B shows the predictions of a typical SVR learner built using 50 modular subnetworks as

features. At every size of feature set, the MSE for genes was at least 1.76 higher than the

corresponding MSE for subnetworks (i.e., at least 22% higher than the corresponding MSE for

subnetworks) (p < 10-28

), and the SCC for subnetworks was at least 0.05 higher (p < 10-28

).

Over all tests, the modular SVRs with 50 features achieved the best performance: a median SCC

of 0.77 and a median MSE of 7.9. This SCC is substantially lower than the highest one achieved

on the test set of pooled fer-15 worms in the last section (0.91) because predicting the age of an

individual worm is more difficult than predicting the age of a large pooled group of age-matched

worms (pooling removes individual variability).

40

Figure 2-6. Modular subnetwork biomarkers of aging predict the age of individual wild-type

worms.

A. Machine learners built from modular subnetworks or genes, predicting worm age in a cross-

validation task on the data from Golden et al. using 5, 10, 25, or 50 features. For each size of feature set,

1000 different Support Vector Regression learners were computed; curves show their median

performance (quantified using mean-squared error in the top panel, and the squared correlation

coefficient between true and predicted age in the bottom panel), and error bars indicate the 95%

confidence intervals for the medians (calculated using a bootstrap estimate). B. The performance of a

typical Support Vector Regression learner built using 50 modular subnetworks as features; true worm

age is shown on the x-axis, and predicted age on the y-axis.

2.3.12 Longevity genes play crucial roles in significant subnetworks

For these analyses, we compiled two sets of known longevity genes (see Materials and Methods,

Table S3): L1, a set of 233 genes that extend lifespan when perturbed, and L2, a larger set of 494

genes that either shorten or extend lifespan when perturbed.

41

2.3.13 Significant subnetworks are enriched for known longevity genes

We found that significant subnetworks derived using both C. elegans aging microarray studies

[125, 133] were significantly enriched for both sets of longevity genes, relative to the

background set of 12808 genes represented in the functional interaction network. All p-values

reported here were calculated using the hypergeometric test. For the Golden et al. [125] data, of

the 1957 genes that play a role in significant subnetworks, 65 are in L1 (p < 10-6

) and 124 are in

L2 (p < 10-8

), and of the 535 seed genes that produce significant subnetworks, 27 are in L1 (p <

10-5

) and 45 are in L2 (p < 10-6

). For the Budovskaya et al. [133] study, subnetworks seeds were

highly enriched for known longevity genes, and the set of all subnetwork genes was slightly

enriched for them. Of the 1559 seed genes of significant subnetworks, 43 are in L1 (p = 0.003)

and 90 are in L2 (p < 10-4

), and of the 4158 genes represented in some subnetwork, 88 are in L1

(p = 0.048) and 181 are in L2 (p = 0.025).

2.3.14 Examples of significant subnetworks containing known longevity genes

While HTP experimental methods have helped to identify hundreds of worm longevity genes

[64], their aging-related functions remain poorly understood. We found that subnetwork

biomarkers are highly enriched for longevity genes. Thus, subnetworks can provide a molecular

context for these genes in aging: they can be applied to uncover new connections between

different longevity genes, or to assign putative aging-related functions to them.

In Figure 2-7, we show several representative examples of significant subnetworks

derived from the Golden et al. [125] data that involve multiple known longevity genes. The

42

complete list is given in Table S2; individual NAViGaTOR XML [141] and PSI-MI XML [142]

files for each subnetwork are available from the website

http://www.cs.utoronto.ca/~juris/data/GB10/. Subnetwork A involves longevity genes vit-2

and vit-5. B has known longevity genes age-1, daf-18, and vit-2; previous work has uncovered

that a mutation in daf-18 will suppress the lifespan-extending effect of an age-1 mutation [35]. C

contains longevity genes rps-3 and skr-1, which are involved in protein anabolic and catabolic

processes, respectively. Subnetwork D contains longevity genes unc-60 and tag-300, which are

both involved in locomotion. E contains longevity genes fat-7 and elo-5, which are involved in

fatty acid desaturation and elongation. Subnetwork F has longevity genes rps-22 and rha-2, and

G has longevity genes blmp-1, his-71, and Y42G9A.4. Blmp-1 and his-71 are both involved in

DNA binding.

Figure 2-7. Some examples of significant longevity subnetworks.

43

Examples of significant modular subnetworks from Golden et al. [125] containing multiple known

longevity genes (from L2, see Materials and Methods). Edge width is proportional to gene-gene co-

expression, node size is proportional to the Spearman correlation between gene expression and age,

and known longevity genes are indicated by green circles.

2.3.15 Modular subnetworks participate in many different age-related biological

processes

Aging is highly stochastic and affects many distinct biochemical pathways. We analyzed the

union of all genes in significant modular subnetworks using biological process categories from

the Gene Ontology [74] (GO) and pathways from the Kyoto Encyclopaedia of Genes and

Genomes [143] (KEGG) databases to determine their relation to known mechanisms of aging.

Full results are given in Tables 2-1 and 2-2; all functions and pathways shown in the table and

discussed below are significant at p < 0.05 after an FDR correction.

In total, we identified 27 KEGG pathways and 37 non-redundant GO biological processes

(see Materials and Methods) that were significantly enriched for subnetwork genes. To test

whether these pathways and processes were also related to aging, we calculated the significance

of their overlap with the set of experimentally determined longevity genes (Table S3). We found

that one third of the GO biological processes (12 of 37) and KEGG pathways (10 of 27)

associated with subnetworks were significantly enriched for longevity genes (p < 0.05). Aging-

associated GO categories enriched for subnetwork genes include ‘locomotory behaviour,’ which

has recently been proposed as a biomarker of physiological aging [125] , and ‘determination of

adult life span’; KEGG pathways include ‘cell cycle’ and several metabolic pathways (including

‘citrate cycle,’ ‘glycolysis’).

44

Table 2-1. Gene Ontology biological process categories enriched in the set of genes

represented in modular subnetworks.

All categories shown are significant at p <0.05 after an FDR correction for multiple testing. GO

categories written in italics are also enriched for known longevity genes (Table S3).

Gene Ontology biological process P-Value

Translation 6.45E-17

Hermaphrodite genitalia development 1.20E-16

Embryonic cleavage 1.37E-15

Germline cell cycle switching, mitotic to meiotic cell cycle 8.32E-14

Locomotory behaviour 1.84E-13

Meiosis 1.10E-11

Positive regulation of multicellular organism growth 4.25E-11

Morphogenesis of an epithelium 3.85E-06

Protein catabolic process 1.13E-05

Phosphate transport 4.99E-04

Negative regulation of multicellular organism growth 8.07E-04

Ubiquitin-dependent protein catabolic process 1.94E-03

Nucleosome assembly 1.97E-03

Establishment of nucleus localization 2.37E-03

Tricarboxylic acid cycle 3.26E-03

DNA replication 4.64E-03

Protein transport 5.01E-03

Energy coupled proton transport, against electrochemical gradient 5.02E-03

Leucyl-tRNA aminoacylation 5.02E-03

Collagen and cuticulin-based cuticle development 5.12E-03

Organelle organization and biogenesis 5.19E-03

Chromosome segregation 7.48E-03

mRNA metabolic process 8.44E-03

Protein import into nucleus 1.15E-02

Purine base biosynthetic process 1.15E-02

Sulfur compound biosynthetic process 1.40E-02

DNA repair 1.45E-02

Determination of adult life span 1.74E-02

Threonine metabolic process 1.75E-02

Water-soluble vitamin biosynthetic process 1.78E-02

ATP synthesis coupled proton transport 3.14E-02

rRNA processing 3.85E-02

Isoleucyl-tRNA aminoacylation 4.02E-02

Methionyl-tRNA aminoacylation 4.02E-02

45

Valyl-tRNA aminoacylation 4.02E-02

Embryonic pattern specification 4.04E-02

Regulation of cell cycle 4.04E-02

Table 2-2. KEGG pathways enriched in the set of genes represented in modular

subnetworks.

All categories shown are significant at p <0.05 after an FDR correction for multiple testing. KEGG

pathways written in italics are also enriched for known longevity genes (Table S3).

KEGG Pathway P-Value

Ribosome 2.17E-27

Metabolic pathways 2.70E-15

Proteasome 2.33E-10

Pyrimidine metabolism 1.34E-09

Purine metabolism 7.08E-07

DNA replication 1.54E-06

Nucleotide excision repair 1.81E-05

Aminoacyl-tRNA biosynthesis 2.80E-05

Cell cycle 4.37E-05

Glutamate metabolism 1.54E-04

Glycolysis / Gluconeogenesis 2.97E-04

Citrate cycle (TCA cycle) 5.41E-04

Methionine metabolism 1.25E-03

Ubiquitin mediated proteolysis 7.19E-03

Pyruvate metabolism 7.27E-03

Base excision repair 7.38E-03

Glyoxylate and dicarboxylate metabolism 7.39E-03

Arginine and proline metabolism 8.35E-03

Glycine, serine and threonine metabolism 8.38E-03

Pentose phosphate pathway 1.23E-02

Valine, leucine and isoleucine biosynthesis 1.30E-02

One carbon pool by folate 1.30E-02

RNA polymerase 1.76E-02

Alanine and aspartate metabolism 1.76E-02

Non-homologous end-joining 2.15E-02

Selenoamino acid metabolism 2.17E-02

Mismatch repair 2.20E-02

46

2.3.16 Modular subnetworks can be used to annotate longevity genes with novel

functions

An important advantage of subnetwork over single-gene biomarkers is that they can be applied to

infer novel functions for subnetwork members [119]. Most worm longevity genes were identified

in HTP RNA interference screens, and thus many remain poorly characterized. And though

several longevity genes do have some previously known functions, their aging-related function is

still unknown.

We used modular subnetworks (derived from the expression data in [125]) to assign putative

functions in aging to known longevity genes by annotating them with the Gene Ontology (GO)

Biological Process categories that their associated subnetworks were significantly enriched for.

In total, we provided 49 longevity genes with novel annotations; nine of these genes had no

previous Gene Ontology biological process annotations (apart from those electronically inferred)

or well-characterized orthologs (named NCBI KOGs [144]). The most significant novel

annotation for each longevity gene is given in Table 2-3, as an example of our approach (poorly

characterized genes are indicated with an asterisk). The full list of all longevity gene GO

categories inferred by subnetwork annotations is available in Table S4, and on the website

http://www.cs.utoronto.ca/~juris/data/GB10/. All GO categories in the tables are significant

with p < 0.05 (after an FDR correction), and annotated to at least 25% of subnetwork genes.

Table 2-3. Assigning putative functions to longevity genes.

The first column lists longevity genes, column 2 shows the most highly enriched Gene Ontology

biological process in subnetworks containing that gene, and the p-value of the enrichment

(hypergeometric test with FDR correction) is shown in column 3. Genes with no previously known

manual GO BP annotation are indicated with an asterisk.

47

Gene GO biological process P-Value

rpl-4 cellular macromolecular complex assembly 2.16E-02

vit-5 phosphate transport 3.70E-05

rha-2 cellular macromolecular complex assembly 2.16E-02

C06E7.1 protein complex assembly 2.26E-02

C25H3.6* transcription from RNA polymerase II promoter 4.87E-02

pat-4 chromatin assembly or disassembly 4.92E-03

C33H5.18 chromatin assembly or disassembly 3.02E-03

unc-60 protein complex assembly 2.26E-02

vit-2 phosphate transport 3.70E-05

ril-1* cell adhesion 3.57E-02

CD4.4* ribosome biogenesis 1.85E-02

eif-3.F organelle organization and biogenesis 3.75E-03

F09F7.5* pigment metabolic process 5.01E-03

pab-2 chromatin assembly or disassembly 8.99E-05

hpk-1 growth 2.78E-02

mdh-1 lipid metabolic process 3.36E-02

blmp-1 chromatin assembly 7.22E-04

daf-3 protein complex assembly 2.26E-02

F28B3.5* amine metabolic process 3.04E-03

rps-23 tRNA aminoacylation for protein translation 1.04E-03

F30A10.10

chromatin assembly or disassembly 4.95E-02

dlk-1 transcription from RNA polymerase II promoter 4.87E-02

F40F8.5* nucleobase metabolic process 5.08E-05

elo-5 lipid metabolic process 4.34E-02

F43G9.3 water-soluble vitamin metabolic process 2.04E-03

ife-1 organelle organization and biogenesis 3.75E-03

spt-4 chromatin assembly or disassembly 8.40E-05

aakb-1 nucleobase, nucleoside and nucleotide metabolic process

1.45E-03

dod-22* gene expression 1.85E-02

F57B9.3 amine metabolic process 2.83E-02

cdc-25.1 amine metabolic process 1.90E-02

nac-3 cellular macromolecular complex assembly 2.16E-02

lin-23 cytoskeleton organization and biogenesis 2.59E-02

K10D2.2 anion transport 5.54E-04

ifg-1 organelle organization and biogenesis 3.75E-03

sir-2.1 lipid transport 2.44E-04

wip-1* chromatin assembly or disassembly 1.99E-02

skn-1 chromatin assembly or disassembly 3.56E-04

vha-6 regulation of metabolic process 3.84E-02

W01B11.3 establishment of protein localization 1.93E-04

W06B11.3*

fatty acid metabolic process 6.78E-03

rpl-30 chromatin assembly or disassembly 3.02E-03

tag-300 cytoskeleton organization and biogenesis 2.59E-02

Y42G9A.4 chromatin assembly or disassembly 3.32E-02

gdi-1 secondary metabolic process 1.98E-02

48

spl-1 sulfur metabolic process 2.33E-02

pod-1 intracellular protein transport 2.04E-02

lrs-2 intracellular protein transport 2.04E-02

let-60 nucleotide-excision repair 1.11E-02

2.4 Conclusions

Aging results not from individual genes acting in isolation of one another, but from the combined

activity of sets of associated genes representing a multiplicity of different biological pathways.

For the most part, the organization and function of these aging-related pathways remain poorly

understood. In particular, the role of most longevity genes in aging is still unknown.

In this work, we showed that high-throughput information about which genes are

likely associated with which other genes – in the form of a functional interaction network – can

yield new insights into the transcriptional programs of aging. We identified modular

subnetworks associated with worm aging – highly interconnected groups of genes that change

activity with age – and showed that they are effective biomarkers for predicting worm age on the

basis of gene expression. In particular, they outperform biomarkers of aging based on the activity

of single genes or regular subnetworks. Furthermore, we found that modular subnetwork

biomarkers were significantly enriched for known longevity genes. Thus, modular subnetwork

biomarkers can provide a molecular context for each longevity gene in aging – in effect, each

longevity subnetwork constitutes a biological hypothesis as to which genes interact with known

longevity genes in some common age-related function.

This work is the first to use a new subnetwork performance criterion that incorporates modularity

into the expression for subnetwork score, and the first to integrate network information with gene

expression data to identify biomarkers of aging. The subnetwork biomarkers identified by our

49

method are highly conserved across studies, and this opens the door to studying longevity genes

– or indeed, any age-related gene set of interest – over a range of different health and disease

conditions. In particular, we are interested in investigating the different subnetworks associated

with longevity genes in diseases like cancer, and in aging across species.

2.5 Materials and Methods

2.5.1 Code

Code for most simulations was written in Matlab R2008b and is available on the website,

http://www.cs.utoronto.ca/~juris/data/GB10/. For support vector regression experiments, we

used the Matlab wrapper to LIBSVM [145]. We analyzed gene sets for enriched gene ontology

using the topGO package (ver. 1.10.1, [146]) in R 2.8.0. Subnetworks were visualized using

NAViGaTOR ver. 2.1.7 (http://ophid.utoronto.ca/navigator; [141]).

2.5.2 Data sets

Microarray experiments. Aging expression datasets for two recent studies were downloaded

from GEO [147]. From Golden et al. [125], we obtained data for 104 microarrays of individual

wild-type (N2) worms over 7 ages (9-17 microarrays per age). From Budovskaya et al. [133],

we obtained 16 microarrays of pooled sterile (fer-15) worms over 4 ages (4 microarrays per age).

For both studies, we discarded probesets containing more than 30% missing values for some age

group.

50

Interaction network. Functional interactions for C. elegans ORFs were downloaded from

WormNet [148]. The network used in our analyses consists of the largest connected component

of the network formed from all WormNet ORFs represented by some probeset in two separate

worm aging microarray studies [125, 133], and represents 12808 distinct C. elegans ORFs and

275525 interactions.

Longevity genes. We obtained L1, our high confidence set of genes that extend lifespan when

perturbed or knocked out, from the recent list compiled in [149]. In total, 233 genetic

perturbations that extend lifespan belonged to the largest connected component of WormNet

made up of genes covered by both expression studies. We constructed L2, our larger set of

longevity genes, by taking the union of L1 and the set of mutations that affect worm lifespan

downloaded from the GenAge database [64]. This yielded 494 genes that either shorten or extend

lifespan when perturbed (and are annotated to the network we use). Both gene lists are available

in Supplementary Table S3.

2.5.3 Subnetwork analyses

Subnetwork search parameters

Seed genes: Previous methods [52, 95] seed the subnetwork search process at a random subset of

genes on the network; a problem with this approach is that different choices of seed genes might

yield substantially different significant subnetworks. To avoid this bias, we grew subnetworks

seeded from every node of the interaction network. For all machine learning tests, the total set of

significant subnetworks was reduced to a non-redundant set, i.e. if two significant subnetworks

51

shared more than 25% overlap (as measured with the Jaccard index), the lower-scoring

subnetwork was deleted from the set of candidate features.

Stopping criteria: For modular subnetworks grown iteratively out from a seed node, the search

was halted when there were no nodes that would increase both subnetwork modularity and class

relevance. For regular subnetworks, the search was halted when there were no nodes that would

increase the subnetwork score (class relevance) past some threshold r (r = 0.01, 0.02, 0.05, 0.1

and 0.2 for regular subnetworks r1-r5), or when there were no remaining local nodes (i.e., nodes

at most two edges away from the seed).

Identifying significant subnetworks

We calculate subnetwork significance using both self-contained and competitive gene set tests

[73]. Our competitive test is identical to that used in [52], and our self-contained test is more

stringent – we use the method suggested in [95].

For the self-contained test, we randomized the assignment of ages to worms (samples), and then

repeated the search for subnetworks starting from each network node. The subnetwork score of

the original subnetwork determined from the true data was then ranked against the corresponding

subnetworks determined from the artificial data that seeded from the same gene. This process

was repeated 1000 times.

For the competitive test, we generated 100 artificial interactomes by randomizing the assignment

of gene names to nodes on the functional interaction network and recalculating the weight for

each network edge based on the new genes that flanked it (only for modular networks – regular

networks do not use edge information). We repeated the search for significant subnetworks on

52

each artificial interactome. Scores for subnetworks determined from the true interactome were

ranked against the scores of all subnetworks generated from the artificial interactomes.

Subnetworks were considered significant if they achieved p < 0.001 on the local self-contained

test and p < 0.05 on the global competitive test.

2.5.4 Machine learning comparisons

We used ε-insensitive support vector regression (SVR) algorithms [150] to learn worm age as a

function of the activity of regular subnetworks, modular subnetworks or differentially expressed

genes. All SVRs were trained using a linear kernel and the default parameters provided by

LIBSVM [145]. For SVR features made up of subnetworks, subnetwork activity for a sample

was calculated as the mean activity of all the genes in the subnetwork.

2.5.5 GO and KEGG enrichment analyses

The union of all genes present in some significant modular subnetwork (β = 250; derived using

data from [125]) was compared with the background network, i.e. the set of 12808 genes present

in the largest connected component of the network formed from all WormNet ORFs represented

by some probeset in both microarray studies [125, 133].

Because there is a lot of redundancy in the Gene Ontology tree, we used the ‘elim’ method [146]

to determine the most specific significant biological process categories (i.e., those at the deepest

level of the tree), and then controlled for multiple testing using an FDR [140] cut-off of 0.05. For

53

KEGG, we calculated an enrichment p-value for each term using the hypergeometric test, and

again controlled for multiple testing using an FDR cut-off of 0.05.

2.6 Abbreviations

SVR: Support Vector Regression; SCC: Squared correlation coefficient; MSE: Mean-squared

error.

2.7 Supplementary Materials

Supplementary figures and tables can be accessed online at

http://genomebiology.com/2010/11/2/R13.

2.8 Acknowledgments

This work was supported by Genome Canada via the Ontario Genomics Institute, the Canada

Foundation for Innovation (Grant Nos. 12301 and 203383), and IBM to IJ. We thank K. Brown

and D. Tweed for their helpful comments.

54

3 In silico drug screen in mouse liver identifies

candidate calorie restriction mimetics

This chapter is based on:

Kristen Fortney, Eric Morgen, Max Kotlyar, and Igor Jurisica (2012). In silico drug screen in

mouse liver identifies candidate calorie restriction mimetics. Rejuvenation Research 15(2).

3.1 Abstract

Calorie restriction (CR) extends lifespan in mammals and delays the onset of age-related

diseases, including cancer and diabetes. Drugs that target the same genes and pathways as CR

may have enormous therapeutic potential. Recently, genome-scale data on the responses of

human cell lines to over 1000 drug treatments have become available. Here we integrate these

data with gene expression signatures of CR in mouse liver to generate a prioritized list of

candidate CR mimetics. We identify 14 drugs that reproduce the effects of CR at the

transcriptional level.

55

3.2 Introduction

Direct screens for testing the effect of drugs on mouse lifespan – in which mice are treated with a

drug, tracked for several years, and their lifespan curve compared with that of control animals –

are time-consuming, expensive, and methodologically challenging [1]. As a consequence, few

drugs have been screened in this way. For example, only 17 compounds have been (or are being)

evaluated as part of the National Institute on Aging’s Intervention Testing Program [2].

Because of these limitations, there is a pressing need for faster and higher-throughput surrogate

assays to help identify and prioritize new drug candidates for healthspan extension. A promising

alternative to direct screening of lifespan is expression-based screening for calorie restriction

(CR) mimetics [3-6]. CR is one of the most reproducible and effective lifespan interventions; it

extends lifespan in model organisms from yeast to mammals, and delays the age of onset for

many diseases of aging, including cancer and diabetes [7]. Thus, drugs that mimic CR at the

transcriptional level may be of great therapeutic value [8]. In an expression-based screen, mice

are treated with drug, tissue samples are taken, and gene expression changes measured using

microarrays and compared to those induced by CR. Recent screens found that the drugs

metformin and resveratrol reproduce many gene changes seen with CR [4,9,10].

In this work, we develop an in silico version of expression-based drug screening for CR

mimetics that allows us to test hundreds of drugs at once. We collect nine previously published

transcriptional signatures of CR in mouse liver and screen each one against the Connectivity

Map [11], a public resource containing genome-scale data on the responses of human cell lines to

over 1000 drug treatments. We then conduct a meta-analysis and identify 14 drugs that

consistently rank among the top CR mimics across multiple studies.

56

3.3 Materials and Methods

3.3.1 Code

Code for all analyses was written in R 2.13.0. Several Bioconductor 2.8 packages were used; we

normalized raw Affymetrix CEL files with affy [151], used limma [152] to identify differentially

expressed probesets, and converted mouse IDs (array-specific, GenBank, or MGI) to human HG-

U133A probeset IDs for Connectivity Map analysis using annotationTools [153]. The drug-drug

network was visualized using NAViGaTOR 2.2.1 [141].

3.3.2 Drug-drug interaction network

We downloaded the DN drug-drug interaction network – where two drugs share an edge if they

share a common mode of action – from MANTRA [154].

3.3.3 Acquiring transcriptional signatures of calorie restriction

We collected gene signatures of CR from published studies [5, 155-161]. Our analysis pipeline

differed depending on whether the source publications made their raw data available, as

described below.

Identifying differentially expressed genes from raw microarray data. For publications where

Affymetrix CEL files were available [155-157], we re-analyzed the raw data to derive lists of

genes significantly up- or down-regulated following CR. We normalized CEL files using

the RMA method [162] implemented in affy [151], and identified differentially expressed probe

sets using the empirical Bayes method in limma [152].

Curating differentially expressed gene lists from published papers. For publications where no

raw Affymetrix data were available [5, 158-161], we downloaded lists of genes reported by the

57

authors to be differentially expressed (in paper text, tables, or supplementary materials). Where

fold change and FDR-corrected p-values were available, we used these data to filter gene lists.

For both types of signature, we removed genes with FDR values greater than 0.05 and (positive

or negative) fold change less than 1.25, sorted the remaining genes by FDR, and retained only

the top 250 up-regulated and the top 250 down-regulated genes for further analysis.

3.3.4 Connectivity map analysis of CR signatures

Mapping mouse CR signatures to human probeset IDs. We mapped mouse gene IDs to human

Affymetrix HG-U133A IDs for connectivity map analysis following previously established

protocols [75]. First, mouse CR signature genes were mapped to Entrez Gene IDs using either

the org.Mm.eg.db Bioconductor 2.8 library (for GenBank or MGI identifiers) or Affymetrix

annotation files Mouse430_2.na31.annot.csv or MG_U74Av2.na32.annot.csv (for probeset IDs).

Mouse Entrez Gene IDs were then mapped to human Entrez Gene IDs using homologene.data

(release 65; www.ncbi.nlm.nih.gov/homologene), and finally to HGU133A IDs using the

Affymetrix annotation file HG-U133A.na31.annot.csv.

Acquiring drug connectivity scores for CR signatures. For each CR signature, mean connectivity

scores for 1309 drugs were calculated as previously described [75] using data on 6100 drug

treatments downloaded from Connectivity Map build 02 at

http://www.broadinstitute.org/cmap/. Drug mean connectivity scores were then converted to

ranks. The connectivity score quantifies the extent to which a drug treatment mimics the query

signature and is based on the Kolmogorov-Smirnov statistic.

58

3.3.5 Meta-analysis of drug-response data

Combining ranked lists of drugs to identify CR mimetics. (Ranked connectivity scores) We

adapted the Rank Product method [67] to identify drugs that consistently mimic CR at the

transcriptional level. For each drug, we calculated the product of its ranks in all CR signatures.

Computing p-values. We randomly permuted the assignment of connectivity scores to drugs for

the 6100 instances (drug treatments), recalculated mean connectivity scores and drug ranks for

1309 drugs in each signature, and re-calculated randomized rank products 10000 times to

estimate p-values and false discovery rates.

3.4 Results

3.4.1 Transcriptional signatures of CR

We obtained nine transcriptional signatures of CR in mouse liver by collecting and analyzing

data from eight previous publications (we collected two signatures from Tsuchiya et al. [156],

one for wild-type and one for dwarf mice). The mice used to generate the CR signatures came

from both sexes and a variety of ages and genetic backgrounds, and CR mice consumed between

56-70% of the calories of the matched control group, depending on the study. For each mouse

CR signature, we constructed an orthologous human signature made up of Affymetrix HG-

U133A probe IDs for Connectivity Map analysis (see Materials and methods). For each human

CR signature, we calculated mean connectivity scores for the 1309 drugs in the Connectivity

Map collection [75]. Connectivity scores range between -1 and 1; a high, positive mean

connectivity score indicates that drug treatment reproduces many of the gene changes with CR.

For each human signature, we then constructed a ranked list of drugs based on the connectivity

scores.

59

3.4.2 Meta-analysis identified fourteen candidate CR mimetics

We combined the nine ranked lists of drugs into a single matrix, and identified drugs that were

consistently highly ranked across signatures using the Rank Product method [67] (see Materials

and Methods). At a false-discovery rate cut-off of 25% (corresponding to unadjusted p-values <=

0.0026), we found that 14 drugs significantly mimic the CR response in mouse liver (Figure 3-

1A).

While most signatures were consistent, i.e. gave high ranks to most of the significant drugs

(Figure 3-1A), two signatures stood out (the two leftmost columns); these correspond to the

earliest signature included in this study [160], and the signature derived from dwarf mice on a

CR diet [156]. These two relative outliers highlight the usefulness of meta-analyses that identify

consistent trends.

Many of the significant drugs show overlapping modes of action (MoA); 10 of 14 form a

connected component in the MoA drug-drug interaction network (downloaded from MANTRA;

[154]) where two drugs were joined by an edge if both drug treatments induced significantly

similar gene changes (Figure 3-1B). We also queried the network with the three best-known

longevity therapeutics (metformin, rapamycin, and resveratrol), and found that one of the drugs

identified in our screen has MoA similar to resveratrol, and two to rapamycin (Figure 3-1B).

Significant drugs (Figure 3-1A) are indicated for a wide variety of diseases. For example,

pioglitazone is a prescription drug used to treat type 2 diabetes [163]; colchicine is used to treat

gout [164]; MG-262 is a proteasome inhibitor with anti-inflammatory effects in the heart [165];

and Gly-His-Lys can activate wound repair [166]. Three of the drugs have been previously

linked to aging: the PI3K inhibitors wortmannin and LY-294002 and the anti-diabetes drug

60

pioglitazone increase lifespan in Drosophila [167, 168]. Other drugs, such as the Chembridge

compounds 5155877 and 5224221, are not yet well characterized in terms of their biological

effects.

To our knowledge, none of the significant drugs identified in this study has yet been evaluated as

a CR mimetic; they should be prioritized for further analyses and biological validation.

61

Figure 3-1. Fourteen drug treatments significantly mimic the effects of CR on hepatic gene

expression.

A. Significant drugs, and a heatmap showing the rank of each drug in each CR signature queried (top

ranks are shown in red). Source publications for signatures are shown on top and ordered from least to

most recent; drugs are shown on the right and ordered from most to least significant. B. Drug-drug

interaction network showing links between significant drugs from our screen (grey) and known longevity

62

drugs resveratrol and rapamycin (green). Two drugs are linked if they show similar modes of action

[154].

3.5 Discussion

While few drugs have been directly tested for their effect on mouse lifespan, over a thousand

drugs have been characterized in terms of their effects on gene expression, and these data are in

the public domain. We have applied this resource to identify fourteen drugs that have similar

transcriptional signatures to CR in mouse liver.

Several dozen other transcriptional signatures of CR are publicly available – mostly in mouse

and rat, but some in primates, including a few in humans. We plan to follow up our pilot study in

mouse liver by conducting in silico expression-based screening on the full set. Expression-based

screening can also be applied to other life-extending treatments for which microarray response

data are available, to see for example which compounds induce the gene changes seen in long-

lived Ames or Snell dwarf mice (versus wild-type).

Longevity drugs have great potential to help treat the diseases of aging, yet few such drugs are

known. Ours and similar approaches that leverage the large quantity of public data on drugs and

mammalian aging can accelerate the identification and development of new longevity

therapeutics.

3.6 Acknowledgments

This work was supported in part by Ontario Research Fund (GL2-01-030), Canada Institutes for

Health Research (BIO-99745), the Canada Foundation for Innovation (CFI #12301 and

#203383), the Canada Research Chair Program, and IBM to IJ, and the Ontario Ministry of

63

Health and Long Term Care. The views expressed do not necessarily reflect those of the

OMOHLTC.

64

4 NetwoRx: Connecting drugs to networks and

phenotypes in S. cerevisiae

This chapter is based on:

Kristen Fortney, Wing Xie, Max Kotlyar, Joshua Griesman, Yulia Kotseruba and Igor Jurisica

(2012). NetwoRx: Connecting drugs to networks and phenotypes in S. cerevisiae. Submitted.

4.1 Abstract

Motivation: Drug modes of action are complex and still poorly understood. The set of known

drug targets is widely acknowledged to be biased and incomplete, and so gives only limited

insight into the system-wide effects of drugs. But a high-throughput assay unique to yeast –

barcode-based chemogenomic screens – can measure the individual drug response of every yeast

deletion mutant in parallel.

Results: We integrated the three largest S. cerevisiae chemogenomic experiments, which

together comprise the responses of thousands of gene knockout strains to 466 drugs, and applied

data mining approaches to investigate drug effects at the system level. We identified yeast

pathways, functions, and phenotypes that are targeted by particular drugs, computed measures of

drug-drug similarity, and constructed drug-phenotype networks. We built the NetwoRx web

portal to make the results of these analyses freely available. NetwoRx also implements

automated analysis routines; users can query new gene groups against the entire collection of

drug profiles and NetwoRx will calculate which drugs target them. We demonstrate with

65

example use cases how NetwoRx can be applied to target specific phenotypes, repurpose drugs

using mode-of-action analysis, investigate bipartite networks, and predict new drugs that affect

yeast aging.

Availability: NetwoRx is freely available on the web at http://ophid.utoronto.ca/networx.

66

4.2 Introduction

The modes of action of many FDA-approved drugs remain poorly characterized: drugs have off-

target effects and these cause unanticipated side effects. In recent years, HTP experiments have

begun to provide crucial clues to the global cellular response to drugs. Computational

interrogation of these data has many important applications relevant to human health, including

target and side-effect prediction, drug repurposing, and mode-of-action analysis.

Chemogenomic barcode screens are particularly valuable HTP drug assays that are unique to

yeast – comparable data is not yet available for any mammalian model organism. These screens

report the change in colony growth in response to drug treatment for every one of the ~6000

deletion strains in the yeast deletion collection [169, 170]. Deletion strains in the yeast deletion

collection are each tagged with unique bar codes, permitting the growth response of every

deletion strain to be measured in parallel (bar codes are hybridized to microarrays). Previous

studies have demonstrated the relevance of these yeast data to human disease. For example [171]

tested 81 psychoactive drugs in yeast and identified secondary drug targets that help explain side

effects in human patients, and [172] applied the screen to identify the molecular targets of

elesclomol, a promising chemotherapy adjuvant. These unique chemogenomic data can

complement other HTP measures of drug effects such as gene expression [75].

Bioinformatics analyses of individual chemogenomics datasets have provided valuable insights

into drug mode of action. These analyses have included the unsupervised clustering of drug

fitness profiles (growth responses) to identify groups of drugs that affect genes in the same way

[173, 174], and calculating gene co-fitness and using it to predict gene function [175].

67

Here we integrate the three largest chemogenomic experiments, covering several thousand yeast

genes and 466 drugs, and investigate drug effects at the systems level. We apply gene set

analysis to identify pathways and phenotypes targeted by drugs, compute drug-drug similarity

metrics for mode-of-action analysis, and build drug-phenotype networks. We applied our

methods to four gene set collections of high biological relevance: Gene Ontology categories [74],

KEGG pathways [143], SGD mutant phenotypes [176], and YEASTRACT targets of

transcription factors [177]. We make the full results of our analyses available through NetwoRx,

a web database linking drugs to networks and phenotypes. We also set up automated analysis

routines in NetwoRx; users can query new gene lists against the entire collection of drug profiles

and NetwoRx will retrieve the drugs that target them.

We demonstrate with example use cases how NetwoRx can be applied to (1) identify drugs that

modulate the oxidative stress response; (2) repurpose drugs for cancer by examining pathways

involved in DNA damage; (3) investigate the druggability of transcription factor targets with a

bipartite network; (4) cluster the drug-pathway network to identify drugs with shared modes of

action; and (5) predict new drugs that modulate yeast aging.

4.3 Materials and Methods

4.3.1 NetwoRx methods

4.3.1.1 Data sets

Chemogenomic data. Log-ratio data of control to drug treatment strain abundance, and P values

for individual drug-gene associations, were obtained from the three largest previously published

yeast chemogenomic studies [171, 173, 174]. The union of these datasets comprised 5924 genes

and 466 drugs. [173] and [171] used the diploid yeast deletion collection, including both

homozygous and heterozygous deletion strains [169, 170]. [174] used the haploid yeast deletion

68

collection [169]. NetwoRx treats the experimental data in the same manner as in the original

publications. The data of [173] are treated as two separate experiments, an experiment with

homozygous deletion strains (4742 distinct ORFs, 132 drugs) and an experiment with

heterozygous deletion strains (5272 ORFs, 318 drugs). The data of [171] are treated as a single

experiment with a mix of heterozygous and homozygous deletion strains (5200 ORFs, 81 drugs).

The data of [174] are treated as a single experiment with the haploid deletion collection (4111

ORFs, 82 drugs).

Gene sets. KEGG Pathways [143] and Gene Ontology categories [74] were obtained from the

Bioconductor 2.8 package org.Sc.sgd.db; mutant phenotypes were downloaded from SGD [176];

transcription factor targets were obtained from the YEASTRACT database [177].

4.3.1.2 Gene set scores

Gene-level scores. Scores for individual gene-drug relations were calculated as log strain

abundance ratios (of control to drug treatment); these data were downloaded from the individual

publications used in NetwoRx. If a gene was represented more than once in a dataset, for each

drug treatment we selected the gene’s largest score.

Gene set score (GSS). The statistic for a set of genes was calculated as the mean of the gene-

level scores for set genes, adjusted for set size (Figure 4-1). For a drug treatment with mean µ

and standard deviation σ, and a gene set P of size n, if S(P) was the average of the gene-level

scores si (for genes i in P), then the GSS was S’(P) = (S(P) - µ)/(σ/sqrt(n)). For each drug

treatment, we calculated scores for those gene sets where gene-level scores were available for at

least 5 and no more than 500 genes; other gene sets were assigned a value of NA.

69

P-values. For a GSS corresponding to a given gene set and drug treatment, we calculated two P

values, P1 and P2. For P1, we computed the one-sided P-value corresponding to the Z-score

defined by the GSS. For P2, we considered the matrix of GSS values for all drug treatments in a

dataset and all gene sets of a given type (e.g. all GO categories); we calculated P2 as the fraction

of these values equal to or exceeding the GSS. The P value for the gene set under the drug

treatment was then reported as P = max(P1,P2). For user queries of new gene sets we report only

P1 (as there is no appropriate background gene set collection to use for P2).

4.3.1.3 Drug-drug similarity

For each chemogenomic dataset, we calculated two measures of drug-drug similarity for all pairs

of drugs, S1 and S2.

For S1, we took the matrix of gene-level scores (genes vs. drugs), eliminated columns or rows

where more than half of values were NA, and then calculated the Pearson correlation between all

pairs of columns (drugs). For drugs represented more than once in a data set, we merged

replicates by calculating average correlations. For each drug-drug similarity score, we calculated

its associated P value as the fraction of other drug-drug similarity scores equal to or exceeding it.

For S2, we repeated the same filtering, calculations, and merging on the matrix of GSS scores

(gene sets vs. drugs); all gene set types (GO, KEGG, YEASTRACT, SGD phenotype) were

included in the GSS matrix.

70

4.3.1.4 Bipartite interaction networks

For a given gene set collection and chemogenomic dataset, a drug and a gene-set were

considered to interact if the GSS had an associated P value ≤ 0.05 for at least one treatment with

that drug. For each drug/gene-set association we report the lowest P value observed over all

treatments of the same drug.

4.3.1.5 Code

Code for all analyses was written in R 2.13.0; we also used the Bioconductor 2.8 GSEABase and

org.Sc.sgd.db.

4.3.1.6 Database implementation

The NetwoRx portal was written in Java and runs on the WebSphere 6.1 application server on an

IBM P595 server with a secondary P595 backup server. The database runs on DB2 9.5 on an

IBM P570 server with a mirror running on P595 for redundancy and workload balancing.

Figure 4-1. Gene set analysis of chemogenomic data.

The score S of a pathway P is calculated as a function f of the gene-level scores s for genes in P.

71

4.3.2 Use-case methods

4.3.2.1 Data sets

Chronological aging. Three sets of genes that extend yeast chronological lifespan were obtained

from previously published genome-wide experiments [178-180].

Drugs that modulate aging. Drugs known to modulate aging in S. cerevisiae were downloaded

from the Lifespan Observation Database at http://lifespandb.sageweb.org/.

4.3.2.2 Code and software

Code for all analyses was written in R 2.13.0. We used the WGCNA R package for drug-drug

similarity network analysis [181]. Networks were visualized with NAViGaTOR 2.2.1

(http://ophid.utoronto.ca/navigator/).

4.4 NetwoRx content and functionality

Here we briefly describe basic database content and functionality.

4.4.1 Database contents

The NetwoRx web portal contains drug-response data calculated for 466 drugs and thousands of

S. cerevisiae genes. Drugs are linked to their PubChem Compound IDs [182], yeast genes to

their SGD entries [183], and gene sets to their relevant databases (GO, KEGG, YEASTRACT, or

SGD phenotype).

Drug-gene associations. P values for associations between drugs and individual genes. Link:

http://ophid.utoronto.ca/networx/singleid

72

Drug-pathway associations. P values and gene-set scores for associations between drugs and

KEGG pathways [143], GO categories, YEASTRACT targets of transcription factors [177], and

SGD mutant phenotypes [176]. Link: http://ophid.utoronto.ca/networx/drug2pathway

Drug-drug similarity metrics. Similarity values S1 and S2 (between -1 and 1) and associated P

values for all pairs of drugs, quantifying the extent to which drugs affect genes (S1) or pathways

(S2) in the same way. Link: http://ophid.utoronto.ca/networx/drug2drug

Drug-pathway networks. Bipartite networks of significant drug-pathway associations are

available as tab-delimited text files or Navigator 2.2.1 files for network visualization [184]. Link:

http://ophid.utoronto.ca/networx/drugnetworks

New pathway search. Users can specify a new set of genes, and NetwoRx will calculate which

drugs interact with it. Link: http://ophid.utoronto.ca/networx/newmodule

4.4.2 Accessing data

Search by drug. Users can search for drugs by their name or by their PubChem Compound ID

(e.g., rapamycin or 5284616).

Search by gene or list of genes. Users can search for yeast genes by their systematic names (e.g.,

YKL203C).

Search by gene set identifier. Users can search for gene sets by their set-specific identifiers (e.g.,

GO:0006979).

73

4.5 NetwoRx use case examples

Here we provide several NetwoRx use cases, using NetwoRx data alone (cases 1-4) or in

combination with data from other high-throughput experiments (case 5).

4.5.1 Retrieving drugs that perturb phenotypes: oxidative stress

Querying NetwoRx with gene sets related to oxidative stress – from the Gene Ontology

(‘response to oxidative stress’, GO:0006979) or SGD mutant phenotypes (‘oxidative stress

resistance’) returns drugs that perturb these pathways. Both compounds known to cause

oxidative stress (e.g., hydrogen peroxide, paraquat) and to protect from it (e.g., allyl disulfide,

rapamycin) are returned. Other significant drugs have not yet been tested for their impact on

oxidative stress (Figure 4-2).

Figure 4-2. Drugs that perturb oxidative stress pathways.

Drugs are shown in order of increasing P value; some drugs (green) are known to ameliorate the effects

of oxidative stress while other drugs (red) induce it. Data set: homozygous collection of [173].

74

4.5.2 Focused searches identify drugs with shared mode of action: drugs that

target the same DNA damage pathways as Cisplatin

Querying NetwoRx with the chemotherapeutic agent Cisplatin (CID 441203) to identify its mode

of action returns four significant KEGG pathways related to DNA damage: base excision repair

(sce03410), nucleotide excision repair (sce03420), DNA mismatch repair (sce03430), and

homologous recombination (sce03440). Querying NetwoRx with these four DNA damage

pathways and extracting the drug-pathway network reveals that many significant drugs are

known cancer drugs that are connected to multiple pathways (Figure 4-3). Other significant

drugs have not yet been tested for cancer and should be prioritized for further study.

Figure 4-3. Mode of action analysis of the chemotherapeutic cisplatin.

Node size is proportional to degree. Known cancer drugs are indicated in green. Data set: homozygous

collection of [173].

75

4.5.3 Bipartite networks reveal that some gene sets are druggable hubs

NetwoRx users can choose to download the entire collection of significant drug–pathway

connections for a given gene set type, either as a tab-delimited text file or as a graph that can be

visualized in NAViGaTOR 2.2.1 [184]. Downloading the entire set of associations for

YEASTRACT transcription factors reveals that while most targets of TFs are affected by only

few drugs, some (e.g., GCR1, IFH1) are perturbed by very many (Figure 4-4).

Figure 4-4. Bipartite network showing all connections between drugs and YEASTRACT

targets of transcription factors.

Node size is proportional to degree. Data set: [174]. We highlight the high degree nodes and their

connectivity.

4.5.4 Clustering the drug-pathway matrix identifies drug modules that share

modes of action

NetwoRx provides measures of drug-drug similarity that quantify the extent to which pairs of

genes impact pathways in the same way. NetwoRx users can search these data by drug name or

download them in bulk. We downloaded the entire matrix of drug-drug similarities from

NetwoRx for the heterozygous experiments of [173]. We then used the R package WGCNA

76

[181] to cluster drugs into modules sharing mode-of-action. These modules can be applied for

drug repurposing. For example, one module was highly enriched for psychoactive drugs (Figure

4-5). Five of the six drugs in the module are used as sedatives and antipsychotics. The last drug,

hexestrol, is a synthetic estrogen that NetwoRx predicts to be psychoactive.

Figure 4-5. Drug module identified by clustering the matrix of drug-drug similarity scores.

5 of 6 drugs in this module are known to be psychoactive (indicated in bold). Data set: heterozygous

collection of [173].

4.5.5 User-defined gene sets: identifying new drugs that modulate yeast

chronological aging

NetwoRx can perform gene set analysis of new gene sets specified by the user. Here we apply

this functionality to identify new drugs that may modulate yeast aging. Three previous studies

have conducted genome-wide assays in yeast to identify gene deletions that lead to increased

survival in prolonged stationary phase [178-180]. We obtained sets of longevity genes from each

77

publication (42, 57, and 90 genes, respectively). Notably, the overlap among these sets was very

poor (Figure 4-6, bottom right). There were 3 genes common to [179] and [180], and one gene

common to [178] and [180]. No gene was common to all three studies; furthermore, no gene was

common to the two most recent studies, despite the fact that they shared a very similar

experimental methodology.

Querying these gene sets against all datasets in the NetwoRx collection revealed that these gene

sets share many targeting drugs. In total, 125 drugs target at least one gene set, 29 target at least

two sets, and 8 target all three. We downloaded the set of drugs previously shown to extend yeast

chronological lifespan from the Lifespan Observation Database at http://lifespandb.sageweb.org/.

Three of these drugs (rapamycin, caffeine, and sodium chloride) are included in the NetwoRx

collection, and our analysis identified all three as significantly associated with one or more aging

gene sets (Figure 4-6, green nodes). Rapamycin, a well-known anti-aging drug which has been

shown to extend lifespan in multiple species [185] targets all three gene sets; NaCl targets two

gene sets; caffeine targets one. Other drugs in the network have been reported to extend life in

other species, e.g., curcumin and wortmannin extend life in Drosophila [167, 186].

Other NetwoRx functionalities can be applied to narrow down a list of interesting candidates

from the set of 125 significant drugs. For example, we retrieved from NetwoRx a list of the top

10 drugs most similar in terms of their pathway-based mode of action to the anti-aging drug

rapamycin, from the heterozygous experiment of [173]. Six of these ten drugs also target at least

one aging gene set, i.e. are represented in the aging-drug network (Figure 4-6): allyl disulfide,

allyl sulfide, CDL 14A, CDL 3F2, CID 688028, and CID 697443.

78

Figure 4-6. Drugs predicted by NetwoRx to modulate yeast chronological lifespan.

Drugs known to increase yeast lifespan are indicated in green. Node size is proportional to degree, and

edge width is proportional to the statistical significance of the drug/gene-set connection (for all

connections P <= 0.05). Diagram at bottom right indicates the overlap between the genes identified as

significant in each aging study. Data set: union of all 3.

4.6 Discussion

The NetwoRx web portal brings together data from the major S. cerevisiae barcode

chemogenomics experiments and facilitates their systems-level analysis. These unique

chemogenomic data can help shed light on the genome-wide effects of drug treatment,

accelerating the identification and development of new therapeutics.

As with any assay, yeast barcode chemogenomic screens have several limitations; these have

been discussed elsewhere, e.g., [187, 188]. Importantly, these screens can be used only with

79

those compounds bioactive in yeast; can capture only those drug-gene interactions which impact

growth; and be relevant to disease for only those human proteins having yeast homologs.

Integrative computational methods that mine chemogenomic data are fast, cheap, and can

complement traditional methods of drug screening. We illustrated with examples how NetwoRx

can be applied to analyze mode-of-action of cancer drugs, repurpose psychoactive drugs, and

predict new drugs that modulate yeast aging.

In the future, developments in RNAi technology will allow experiments of comparable

throughput to be conducted in mammalian cell lines, and we will expand NetwoRx to include

these data. Pooled shRNA screens have already helped elucidate the mode-of-action of

individual cancer drugs and show enormous promise for speeding drug development [189, 190].

4.7 Acknowledgements

We thank Marc Angeli and Abraham Heifets for their helpful comments on the manuscript.

Funding: This work was supported in part by Ontario Research Fund (GL2-01-030 and RE-03-

020), Canada Institutes for Health Research (BIO-99745), the Canada Foundation for Innovation

(CFI #12301 and #203383), the Canada Research Chair Program, and IBM to IJ, and the Ontario

Ministry of Health and Long Term Care. The views expressed do not necessarily reflect those of

the OMOHLTC.

80

5 Computationally repurposing drugs for lung

cancer with CMapBatch: candidate therapeutics

from an integrative meta-analysis of cancer gene

signatures and chemogenomic data

This chapter is based on:

Kristen Fortney, Joshua Griesman, Max Kotlyar, and Igor Jurisica (2012). Computationally

repurposing drugs for lung cancer with CMapBatch: candidate therapeutics from an integrative

meta-analysis of cancer gene signatures and chemogenomic data. In Preparation.

5.1 Abstract

Background

Using gene signatures to computationally repurpose FDA-approved drugs can accelerate the

development of new therapeutics. Though existing methods for signature-based repurposing are

based on the analysis of individual signatures, for many diseases dozens of gene signatures are in

the public domain. We develop CMapBatch to exploit these data. CMapBatch is a

computational meta-analysis pipeline that takes as input a collection of gene signatures of

disease and outputs a list of drugs predicted to consistently reverse pathological gene changes.

We apply CMapBatch to identify new therapeutics for lung cancer.

81

Results

We applied CMapBatch to a collection of 21 gene expression signatures of lung cancer. We

demonstrate that, while drug candidates identified by CMap analysis of individual gene

signatures (http://www.broadinstitute.org/cmap/) are highly variable, CMapBatch returns very

stable sets of top drug candidates. Our meta-analysis of all 21 signatures revealed that 247 drugs

consistently reversed lung cancer gene changes. In silico validation on the NCI-60 collection

showed that drug candidates significantly inhibit growth in nine lung cancer cell lines (of nine

tested). Common protein targets of drug candidates included CALM1 and PLA2G4A. We

characterized these drugs’ chemical properties and drug-target network, and applied multiple

criteria to rank them in terms of therapeutic promise.

Conclusions

CMapBatch can improve signature-based drug repurposing by leveraging the large number of

disease signatures; we have made this method publicly available at

http://ophid.utoronto.ca/cmapbatch. We applied CMapBatch to identify a prioritized list of new

candidate drugs for lung cancer.

82

5.2 Background

Lung cancer accounts for the largest number of cancer-related deaths, and the 5-year survival

rate (across all stages) is only 16% [191]; there is an urgent need for new therapeutics to help

treat it. Over the past two decades, the application of HTP technologies has led to the rapid

accumulation of comprehensive and diverse public datasets cataloguing genome-wide molecular

alterations seen with lung cancer or with drug administration. Integrative computational methods

that mine these data are fast, cheap, and can complement traditional methods of drug screening;

complementary information in these distinct resources can be leveraged to develop

comprehensive in silico screens for novel cancer therapeutics [192].

One such resource, the Connectivity Map (CMap), which is the focus of our analyses, catalogues

the transcriptional responses to drug treatment in human cell lines for over a thousand small

molecules [75]. CMap has been successfully applied to identify novel therapeutics for a diverse

set of indications including various cancers (e.g., [76, 77]), and most recently osteoarthritic pain

[193] and muscle atrophy [194].

CMap was applied in two earlier studies to identify novel therapeutics for lung cancer. Wang et

al. [195] combined two microarray data sets to create a single transcriptional signature of lung

adenocarcinoma and screened it against CMap. They tested one of their drug hits (17-AAG) in

vitro and found that in inhibited growth in two lung adenocarcinoma cell lines. Ebi et al. [196]

constructed a transcriptional signature of survival in patients with lung adenocarcinoma; CMap

analysis identified several drugs that might improve outcome. The authors experimentally

83

confirmed the growth inhibitory activity of several drug hits, including rapamycin, LY-294002,

prochlorperazine, and resveratrol.

Nearly every previous analysis using Connectivity Map data to link drugs to diseases has done so

with the CMap online tool (http://broadinstitute.org/cmap/). The CMap tool takes as input a set

of up-regulated probe sets and a set of down-regulated probe sets, and returns a list of drugs that

reverts or mimics those gene expression changes. However, for most diseases, not one but many

– often dozens – of distinct gene signatures are available. For example, the cancer-specific

database Oncomine (version 4.4) stores mRNA data from 566 different studies [65]. As the

CMap tool only deals with one gene signature at a time, the question of how best to take

advantage of the information in a large collection of disease signatures remains an important

open problem.

While a few studies have used multiple disease signatures in CMap analysis, e.g., [194, 195]

(though with one exception [197], they used only two or three signatures per disease), they have

all relied on essentially the same strategy of collapsing all disease signatures into a single meta-

signature (by e.g., intersecting lists of significant genes from different studies, as in [194]) and

querying the CMap data with this signature. Since each of the individual disease signatures was

constructed using dozens or even hundreds of microarrays, there is fairly strong evidence for

every gene in each signature. In comparison, the drug response data in CMap is noisy: the 1309

drugs have each been tested only a median of 4 times (4 treatment microarrays). This noise has

consequences: previous work has shown that even small changes in the input gene signature can

84

lead to large changes in the list of drugs identified as significant by CMap analysis (with the

sscMap program)[198] [199].

Here we propose an alternative strategy for connecting a set of disease gene signatures to drugs,

CMapBatch. Rather than collapsing all the gene signatures in the set into a single gene signature,

we propose to screen each disease signature separately against CMap to produce a set of ranked

lists of drug candidates. Next, we apply a rank-based meta-analysis method to identify which

drugs are consistently ranked as the best candidates across all disease signatures. I.e., we perform

the meta-analysis at a later step: our method combines lists of drugs rather than lists of genes.

We show that this strategy returns more stable sets of top drug candidates.

Next, we applied CMapBatch to lung cancer. We used three steps to identify and prioritize new

lung cancer therapeutics. First, we conducted a meta-analysis using CMapBatch to identify drugs

that reverse the transcriptional changes seen with lung cancer across 21 gene signatures. We

found that 247 CMap drugs consistently counter the gene changes that occur with lung cancer.

Second, we performed in silico validation of drug candidates with the NCI-60 growth inhibition

data. We found that drug candidates identified by CMapBatch were significantly more likely to

slow growth in nine lung cancer cell lines than other CMap drugs (P < 0.01). Third, we

implemented data integration for drug prioritization. We identified common protein targets of

significant drugs, and used chemical structure similarity and drug-target relationships to

prioritize candidate therapeutics.

85

5.3 Results and discussion

5.3.1 CMapBatch meta-analysis strategy: From individual cancer gene signatures

to candidate therapeutics

Our CMapBatch meta-analysis pipeline comprises the following steps (Figure 5-1): For each

individual lung cancer signature (tumour vs. normal comparison), we calculate mean

connectivity scores for 1309 small molecules (as previously described [75]). Connectivity scores

range between -1 and 1; a large, negative mean connectivity score indicates that drug treatment

reverses many of the gene changes with lung cancer. We use the mean connectivity score to

construct a ranked list of drugs for each signature. We combine the ranked lists of drugs into a

single matrix, and identify drugs that were consistently highly ranked across signatures using the

Rank Product method [67] (see Materials and Methods).

Our analyses are based on 21 previously published gene expression signatures of lung cancer

obtained from Oncomine [65] and CDIP, the Cancer Data Integration Portal

(http://ophid.utoronto.ca/cdip/). The samples used to derive each signature have diverse

histologies, and mRNA levels were measured on various commercial platforms (Table 5-S1).

86

Figure 5-1. CMapBatch meta-analysis pipeline.

Given a set of disease signatures, CMapBatch calculates mean connectivity scores for 1,309 drugs and

converts them to ranks. Next, CMapBatch applies the Rank Product method to identify drugs that are

consistently highly ranked across signatures. On a set of 21 transcriptional signatures of lung cancer, we

identified 247 drugs that significantly reverse these pathological gene expression changes (at FDR < 1%).

5.3.2 Candidate drugs identified via CMapBatch are more conserved across

signature subsets than candidate drugs identified from single gene signatures

Previous work has shown that CMap analysis of different gene signatures for the same disease

can return very different lists of drug candidates [199]. This is undesirable, if perhaps

unsurprising as gene signatures themselves can be highly variable [192]. Consistent with

previous work, when we retrieved lists of the top 50 drugs for each of the 21 different gene

signatures of lung cancer (using the CMap online tool), overlap was poor. The median number of

87

drug candidates present in top 50 drug candidate lists from two different signatures was only 22

(Figure 5-2 in blue). Repeating the same test using lung cancer signatures of the same type – 10

adenocarcinoma signatures – did not lead to much improvement. For adenocarcinoma, the

median number of drugs identified by two signatures was 26 (Figure 5-2 in gray), but the

difference is not statistically significant.

Next, we sought to determine whether using a large set of signatures with CMapBatch would

lead to a more stable list of top drug candidates. For this test, we randomly assigned the 21 lung

cancer gene signatures to two groups, one with 10 and the other with 11 signatures. We ran

CMapBatch separately on the two disjoint sets of signatures, and compared lists of the top 50

drugs identified for each set. We repeated this test 100 times. We found that CMapBatch

consistently identifies the same drugs as combatting lung cancer, even when it is trained on

completely different sets of lung cancer signatures. A median of 39 drugs were found to be

common to both to the lists of top 50 drugs identified from two disjoint sets of signatures (Figure

5-2 in green), significantly more than are found with individual gene signatures (Wilcox test P

<< 0.01).

88

Figure 5-2. CMapBatch produces more stable lists of significant drugs than individual gene

signatures.

Shown are boxplots of the number of conserved drug candidates when any two lists of top 50 drug

candidates are intersected. Green: 21 gene signatures were split into two disjoint sets of 10 and 11

signatures, CMapBatch was run on both sets, and top drugs from each set were compared; this

experiment was repeated 100 times. Blue: 21 gene signatures were used to retrieve 21 lists of drugs

with the CMap online tool; top drugs from all pairs of signatures were compared. Grey: 10 gene

signatures of the same lung cancer type (adenocarcinoma) were used to retrieve 10 lists of drugs with

the CMap online tool; top drugs from all pairs of signatures were compared. CMapBatch results showed

a significantly higher median overlap (Wilcox test P << 0.01).

89

5.3.3 Characterizing and prioritizing candidate lung cancer therapeutics

For the remainder of this paper, we focus on characterizing and prioritizing the full set of

significant drugs identified by CMapBatch using all 21 gene signatures of lung cancer.

CMapBatch meta-analysis identified 247 candidate lung cancer therapeutics. At an FDR cut-off

of 0.01, we find that 247 drugs (out of 1,309 drugs in CMap Build 2) significantly reverse the

gene expression changes seen with lung cancer in the full set of 21 lung cancer signatures. This

is a large number of drugs, but in line with previous results obtained using similar data; e.g., a

recent paper examining disease-drug relationships using the 164 drugs tested in CMap Build 1

linked 72 of them to adenocarcinoma of the lung, and 67 to squamous cell carcinoma of the lung

[197].

5.3.4 Candidate therapeutics inhibit growth in nine lung cancer cell lines

As an independent validation of our results, we used growth inhibition data from the NCI-60

collection [200] to determine whether the drug candidates we identified are better at slowing

growth in lung cancer cell lines. For all our NCI-60 analyses we used the nine lung cancer cell

lines in which over 100 Connectivity Map drugs were tested (see Methods).

Significant drugs are more effective at inhibiting growth more than other Connectivity Map

drugs. In all nine cell lines, drugs that CMapBatch identifies as reversing the transcriptional

90

changes seen with lung cancer are significantly better (Wilcox test P < 0.01) than other CMap

drugs at inhibiting growth (Figure 5-3).

Figure 5-3. Drug candidates inhibit growth in lung cancer cell lines more than other

Connectivity Map drugs.

We tested whether the drugs that we identified as significantly reversing the gene changes seen with

lung cancer were better at inhibiting growth using NCI-60 GI50 data in 9 lung cancer cell lines. In every

cell line, significant drugs are better than other Connectivity Map drugs at inhibiting growth (Wilcox test

P < 0.01).

23 significant drugs inhibit growth in a majority of lung cancer cell lines. For each of the nine

cell lines, and using data from every drug tested on that line, we define the threshold for

sensitivity to a drug to be the top 20% of the –logGI50 values; i.e., we say that the cell line is

sensitive to those drugs with –logGI50 values in the top 20%. By this definition, of all the NCI60

drugs that have been tested in 5-9 lung cancer cell lines, 7,794 of 44,802, or 17%, inhibit growth

in 5 or more cell lines. Of the significant drugs tested, 23/41, or 56% inhibit growth in 5 or more

lung cancer cell lines (Figure 5-4, left).

91

Among these 23 are several that are already in use to treat cancer. For example, daunorubicin

and the chemically related doxorubicin are topoisomerase inhibitors and commonly-used

chemotherapeutic agents; sirolimus (rapamycin) is currently in clinical trials for several cancers

and was recently shown to increase NSCLC tumour cell sensitivity to erlotinib [201]; vorinostat,

a histone deacetylase inhibitor, enhanced the response to carboplatin or paclitaxel in patients

with advanced NSCLC [202]; MS-275, also a histone deacetylase inhibitor, enhanced the

response to erlotinib in an erlotinib-resistant lung adenocarcinoma cell line [203].

Others of the 23 have not yet been investigated as cancer therapeutics (i.e., there are fewer than

20 Pubmed abstracts linking the drug to any type of cancer) and should be prioritized for further

biological validation. For example, spiperone and pimozide are antipsychotics. Recently,

pimozide was shown to reduce the viability of several cancer cell lines while sparing normal

cells [22].

We call this set of 23 drugs that transcriptionally reverse lung cancer gene changes and slow

growth in lung cancer cell lines – TOP drugs; in subsequent sections, we prioritize significant

drugs that have not been tested in NCI-60 by linking them to TOP drugs using a variety of

metrics.

92

Figure 5-4. Prioritizing drug candidates with GI50 values and chemical structures.

Twenty-three of the significant drugs inhibit growth in a majority of lung cancer cell lines (left). A further

11 significant drugs not tested in NCI-60 are highly structurally similar (Tanimoto similarity >= 0.8) to

one or more of the sixteen (right).

93

5.3.5 Prioritizing drugs by structural similarity: eleven significant drugs are highly

structurally similar to TOP drugs

The Tanimoto coefficient quantifies the chemical structure similarity between two molecules

[204]; here, we call two molecules structurally similar if this number exceeds 0.8. We found that

eleven drugs that reverse the transcriptional changes observed in lung cancer were structurally

similar to one or more drugs in TOP (Figure 5-4, right). These drugs were not evaluated as part

of the NCI-60 project; furthermore, 9 of 11 appear in fewer than 20 Pubmed abstracts concerned

with cancer. These are novel anticancer therapeutics identified by our computational screen.

5.3.6 Prioritizing drugs by shared target: thirty-eight significant drugs share a

protein target with one or more TOP drugs

We used drug-target data from DrugBank [205] and ChemBank [206] (as provided in MANTRA

[154] ) to construct a drug-drug interaction network on the set of CMap drugs; two drugs are

linked by an edge if they share one or more protein targets (Figure 5-5). In total, 83 of the

significant drugs were present in this network (the protein targets of many drugs are still

unknown), including 9 TOP drugs. Thirty-eight significant drugs that were not tested in the NCI-

60 collection share one or more protein targets with a TOP drug (Figure 5-5A, purple and green

nodes), indicating they may have a similar mode of action and may inhibit growth in lung cancer

cell lines.

Seven of these 38 drugs were also found to be structurally similar to TOP drugs (Figure 5-5A,

green nodes): prochlorperazine, promazine, trifluoperazine, fluspirilene, phenindione,

vidarabine, and chlorpromazine. As these drugs are linked to TOP drugs by two separate lines of

evidence, they are promising candidates for further biological validation.

94

Figure 5-5. Significant drugs share many protein targets.

A. In the drug-target network for drug candidates, two drugs are connected by an edge if they have the

same protein target. Shown in colour are the drugs that slow growth in 5 or more lung cancer cell lines

(blue), their immediate neighbours (purple), and the drugs that are structurally similar to them (green).

95

Green edges indicate drug pairs that, in addition to sharing a protein target, were also found to be highly

structurally similar (see Figure 5-4). B. 83 significant drugs are represented in the drug-target network,

and the largest connected component contains 72 drugs. 10,000 random draws of 83 drugs from the

drug-target network resulted in smaller connected components (median size 42 drugs; P << 0.01).

5.3.7 Common protein targets of significant drugs

The largest connected component in the drug-target interaction network comprised 72 drugs,

which is significantly larger (P << 0.01) than what would be expected by chance; random sets of

83 drugs in the drug-drug network yield largest connected components with a median size of

only 42 drugs (Figure 5-5B). This indicates that some gene targets are overrepresented among

significant drugs; these genes may be valuable drug targets for lung cancer. We applied the

hypergeometric test to each gene target of a significant drug and identified ten over-represented

targets (P < 0.05; Table 5-1).

The top over-represented gene is Calmodulin 1 (Calm1), a gene involved in the cell cycle and in

signal transduction; it’s a target of 9 CMap drugs, and we found that 8 of these reverse the

transcriptional changes seen with lung cancer. Recent research suggests that CBP501, a drug

currently in Phase II clinical trials for NSCLC, may sensitize tumors to the chemotherapeutic

agents bleomycin and cisplatin by inhibiting Calm1 [25]. Thus, other significant drugs that target

Calm1 may also enhance the effect of chemotherapy. The 8 drugs we identified are bepridil,

felodipine, flunarizine, fluphenazine, loperamide, phenoxybenzamine, pimozide, and

miconazole.

96

The second-most overrepresented gene is PLA2G4A, whose protein product is a member of the

cytosolic phospholipase A2 family. Cytosolic phospholipase A2 (cPLA2) has been previously

implicated in cancer progression and metastasis. Furthermore, in a mouse model of lung cancer,

the inhibition of cPLA2 activity led to delayed tumour growth [207, 208]. There are 4 drugs

targeting PLA2G4A included in the CMap collection, and all 4 significantly reverse lung cancer

gene changes in our analyses: flunisolide, fluocinonide, fluorometholone, and medrysone.

Table 5-1. Common protein targets of candidate drugs.

Gene P value Count (significant CMap

drugs)

Count (all CMap

drugs)

CALM1 5.42E-07 8 9

PLA2G4A 0.000288 4 4

DRD2 0.0006 12 34

HTR2A 0.000614 9 21

SERPINA6 0.001465 5 8

ABCC8 0.002252 3 3

CYP3A3 0.002252 3 3

SLC6A4 0.002321 7 16

KCNH2 0.002956 5 9

SLC6A2 0.005594 6 14

ADRA1A 0.008209 8 24

ABCB1 0.008724 5 11

97

5.3.8 Significant drugs are broad-acting: they affect more genes than other drugs

We used the CMap gene expression profiles from before and after drug treatment to calculate the

number of genes differentially expressed in response to a drug, for each of the 1,309 drugs in the

collection (see Materials and Methods). We found that significant drugs affect a median of 8.5

genes, while other CMap drugs affect only a median of 3 (Figure 5-6; Wilcox test P << 0.01).

Figure 5-6. Significant drugs affect more genes than other Connectivity Map drugs.

We used CMap data to calculate the number of genes that were significantly differentially regulated (P <

0.05) for each of 1,309 drugs. Drugs that we identified as reversing the gene changes seen with lung

cancer affected significantly more genes than other drugs (Median of 8.5 vs. 3 genes; Wilcox test P <<

0.01).

98

5.3.9 Many drugs are indicated for lung cancer independently of subtype

We investigated the top drugs that revert expression changes in different lung cancer subtypes by

running CMapBatch on the two largest signature subsets in our collection, adenocarcinoma (10

signatures) and squamous cell carcinoma (6 signatures). We found a very high concordance

among top drugs; 79 drugs are common to the top 100 drugs lists for adenocarcinoma and

squamous cell carcinoma. Furthermore, all 79 drugs are significant in the full 21-signature meta-

analysis.

5.4 Conclusions

Dozens of distinct gene signatures are available for many diseases. We developed CMapBatch to

efficiently integrate these data with the Connectivity Map to automate drug repurposing and

identify stable lists of candidate therapeutics. Using the example of lung cancer, we showed that

CMapBatch improved on previous strategies for drug repurposing based on the analysis of

individual gene signatures. We have made our method publicly available as an online tool at

http://ophid.utoronto.ca/cmapbatch.

5.5 Methods

5.5.1 Code and software

Code for all analyses was written in R 2.14.0. We converted gene names to HG-U133A probeset

IDs for Connectivity Map analysis using the hgu133a.db (Bioconductor 2.8). The drug-target and

mode of action networks were analyzed using igraph (Bioconductor 2.8) and visualized using

NAViGaTOR 2.2.1 [141], and drug structures were visualized with PyMOL [209]. We

calculated Tanimoto similarity for all pairs of 1148 CMap drugs for which PubChem IDs were

available using the PubChem Chemical Structure Clustering Tool [182].

99

5.5.2 Data sources

Transcriptional signatures of lung cancer. We downloaded 21 gene signatures of lung cancer

from CDIP version 1.0, the Cancer Data Integration Portal (http://ophid.utoronto.ca/cdip/). We

included signatures from all lung cancer vs. normal comparisons where 10 or more genes were

found to be differentially up- and down-regulated. For Oncomine signatures, we sorted up- and

down-regulated genes by adjusted P value, using a threshold of FDR <= 0.05.; we retained only

the top 250 up-regulated and top 250 down-regulated genes.

Drug-response data. We downloaded rankMatrix, containing the ranks of genes in response to

6,100 drug treaments (corresponding to 1,309 unique drugs), from Connectivity Map Build 02 at

http://www.broadinstitute.org/cmap/.

Interaction networks. We downloaded the drug-target interaction network, where two drugs

share an edge if they share a physical binding partner, from MANTRA [154]. We visualized the

drug target interaction network with NAViGaTOR 2.2.1 [184].

Lists of genes differentially regulated by CMap drugs. We downloaded lists of genes

significantly up-or down-regulated by CMap drugs from [210].

5.5.3 Connectivity map analysis of lung cancer signatures

Mapping gene names to probeset IDs. We mapped human gene IDs to Affymetrix HG-U133A

IDs for connectivity map analysis following previously established protocols [3].

100

Calculating mean connectivity scores for each signature. For each lung cancer signature, mean

connectivity scores for 1,309 drugs were calculated as previously described [3] and converted to

ranks.

5.5.4 Meta-analysis of drug-response data

Combining ranked lists of drugs to construct a consensus ranked list. We adapted the Rank

Product method [17] to identify drugs that consistently reverse the transcriptional changes seen

with lung cancer across a large collection of signatures. For each drug, we calculated the product

of its ranks in all lung cancer signatures.

Identifying drugs with significantly small rank products. We randomly permuted the assignment

of connectivity scores to drugs for the 6,100 instances (drug treatments), recalculated mean

scores and drug ranks for 1,309 drugs in each signature, and re-calculated randomized rank

products 10,000 times. We used this background distribution to calculate p-values and estimate

false discovery rates.

5.5.5 NCI-60 analysis of significant drugs

We restricted our analyses to the NCI-60 GI50 (50% growth inhibition) data and to those lung

cancer cell lines where at least 100 Connectivity Map drugs were tested (there were nine of

these, all NSCLC: NCI-H23, NCI-H522, 549/ATCC, EKVX, NCI-H226, NCI-H322M, NCI-

H460, HOP-62, HOP-92). As different GI50 thresholds were used to denote minimal activity in

response to a drug for different concentration ranges, we filtered the data to make results

comparable across drugs. We retained only those entries with an LCONC (maximum log10

concentration) of -4 and where the drug concentration was measured in units of molarity.

101

5.6 Acknowledgements

This work was supported in part by Ontario Research Fund (GL2-01-030 and RE-03-020),

Canada Institutes for Health Research (BIO-99745), the Canada Foundation for Innovation (CFI

#12301 and #203383), the Canada Research Chair Program, and IBM to IJ, and the Ontario

Ministry of Health and Long Term Care. The views expressed do not necessarily reflect those of

the OMOHLTC.

5.7 Supplementary Material

Table 5-S1. Twenty-one lung cancer gene signatures (tumour vs. normal comparisons).

Histology PMID Source

Adenocarcinoma 18992152 CDIP

Adenocarcinoma 16549822 CDIP

Adenocarcinoma 18927117 CDIP

Adenocarcinoma 12118244 Oncomine

Adenocarcinoma 11707567 Oncomine

Carcinoid 11707567 Oncomine

Small Cell 11707567 Oncomine

Squamous 11707567 Oncomine

Large Cell 11707590 Oncomine

Adenocarcinoma 11707590 Oncomine

Small Cell 11707590 Oncomine

Squamous 11707590 Oncomine

Large Cell 20421987 Oncomine

Adenocarcinoma 20421987 Oncomine

Squamous 20421987 Oncomine

Adenocarcinoma 18297132 Oncomine

102

Adenocarcinoma 16314486 Oncomine

Adenocarcinoma 17540040 Oncomine

Squamous 15833835 Oncomine

Squamous 16188928 Oncomine

Squamous 14581339 Oncomine

103

6 General conclusions and significance

6.1 Conclusions

Aging and disease are complex and heterogeneous biological processes. HTP technologies

invented in the past two decades are only just beginning to provide us with a comprehensive

genome-wide view of how aging and disease affect cells, tissues, and organisms, from the

perspectives of transcription, translation, methylation, and several other modalities. New analysis

methods that integrate these comprehensive and complementary data have enormous potential to

transform our understanding of the basic mechanisms of aging and disease and to suggest new

and better therapies to treat their pathological effects.

As discussed in Chapter 1, there are several major challenges to extracting the maximum

information from these new data. HTP data are noisy, and analysis techniques designed for

small-scale biological experiments often do not translate well to the setting of ‘big data’. In the

four research chapters of this thesis, I described the development and application of new analysis

methods and strategies to the problems of identifying biomarkers and therapeutics for aging and

disease.

Chapters 2, 3, and 5 made novel methodological contributions, and in Chapter 4 existing

bioinformatics methods were applied to gain insight into a particular biological problem. In

Chapter 2, I proposed a novel algorithm for identifying subnetwork biomarkers of aging. In this

algorithm, biomarkers are networks of genes selected based on a score that takes into account

age-dependent activity (Spearman correlation of subnetwork activity with age) and a locally-

defined graph-theoretic measure of modularity. Subnetworks are grown starting from seed genes

104

in an interaction network; at each stage of the growth procedure, the algorithm considers all

network neighbors of the current subnetwork, and greedily maximizes subnetwork score by

adding the neighbor leading to the largest score increase. Subnetworks identified with this

algorithm outperformed previous ones on key measures, yielding biomarkers that were more

conserved across studies and performed better on a machine learning task (predicting age based

on expression data using Support Vector Regression algorithms). This work was the first to use a

new subnetwork performance criterion that incorporates modularity into the expression for

subnetwork score, and the first to integrate network information with gene expression data to

identify biomarkers of aging. In Chapters 3 and 5, I developed the CMapBatch tool; CMapBatch

takes advantage of the large quantity of public gene expression data to help speed drug

discovery. In CMapBatch, Kolmogorov-Smirnov statistics are first calculated to determine

connectivity scores that link individual gene signatures of some disease (e.g. lung cancer) to

1309 drugs from the CMap collection [75]. Connectivity scores reflect the extent to which drug

treatments reverse (or mimic) the gene expression changes in the query signature. The drugs are

next ranked by connectivity score for each signature, and finally an adapted Rank Product

method [67] is applied to combine the ranked lists and identify drugs that are consistently highly

ranked as the best therapeutics for the disease across a large set of independent signatures of

disease. CMapBatch produced more stable sets of drug candidates for lung cancer than previous

methods (Chapter 5), and in silico validation revealed that CMapBatch drug candidates

significantly inhibited growth in nine lung cancer cell lines, of nine tested. In Chapter 4, I

applied existing methods of gene-set and drug-network analysis to study drug effects at the level

of systems, networks, and phenotypes in S. cerevisiae, and built the NetwoRx web portal to store

these data.

105

The four projects described in this thesis were tested in a range of model systems – yeast, worm,

mouse, and human cell lines – and were concerned with a variety of distinct computational tasks

using diverse HTP data sources. The common threads running through all four were (1) the shift

of focus away from single genes to the systems level, and (2) the integration of complementary

HTP data sources.

6.1.1 From genes to pathways, phenotypes, and networks

Past work has borne out the hypothesis that shifting focus away from individual genes and

towards more holistic gene modules, networks, pathways, and phenotypes can bring several

advantages – as discussed in Chapter 1, systems-level differences tend to be more reproducible

across studies, and biomarkers based on modules or gene groups can perform better on

classification tasks. Systems-level analyses played central roles in each of the four projects that

constitute this thesis. For example, in Chapter 2 I showed that high-throughput information about

the higher-level associations between genes – in the form of a functional interaction network –

can yield new insights into the transcriptional programs of aging. I identified modular

subnetworks associated with worm aging – highly interconnected groups of genes that change

activity with age – and showed that they are effective biomarkers for predicting worm age on the

basis of gene expression. And in Chapter 4, I built the NetwoRx web portal to facilitate the

systems-level interrogation of yeast chemogenomic data. I illustrated with examples how

NetwoRx can be applied to analyze mode-of-action of cancer drugs, repurpose psychoactive

drugs, and predict new drugs that modulate yeast aging.

106

6.1.2 Integrating complementary HTP data sources

As reviewed in Chapter 1, noise and biological heterogeneity complicate the analysis of HTP

data. Chapters 2-5 relied on HTP data integration to reduce noise and identify more robust or

accurate biomarkers and therapeutics. For example, dozens of distinct gene signatures are

available for many diseases. I developed CMapBatch to efficiently integrate these data with the

Connectivity Map to automate drug repurposing and identify stable lists of candidate

therapeutics, and applied this method to identify candidate calorie restriction mimetics (Chapter

2) and lung cancer therapeutics (Chapter 5). Using the example of lung cancer, I showed that

CMapBatch improved on previous strategies for drug repurposing based on the analysis of

individual gene signatures (Chapter 5). In most projects, I also integrated HTP data of multiple

types, e.g., gene expression data with genome-wide RNAi phenotypes (Chapter 2) and large

scale drug-induced growth-inhibition data (Chapter 5).

6.2 Open questions and future work

In the future, more and different HTP data will enable the development of improved biomarkers

and therapeutics for aging and disease.

6.2.1 Limitations in HTP data

The quality of any integrative computational analysis is necessarily limited by the data that are

available. For example, in the case of intra- and inter-tumor heterogeneity, we may sometimes

have too few samples or too much noise to be able to develop cancer signatures with the required

accuracy and reproducibility; similarly, existing microarray studies of aging that sample only

107

two time-points (old vs. young animals) may not contain sufficient information for the accurate

modeling of complex age-related biological processes. To take a couple of specific examples

relevant to the work in this thesis, Chapter 4 uses chemogenomic data taken from experiments in

S. cerevisiae. Importantly, these screens can be used only with those compounds bioactive in

yeast; can capture only those drug-gene interactions which impact growth; and be relevant to

disease for only those human proteins having yeast homologs. And Chapters 3 and 5 depend on

the Connectivity Map, which contains data on the transcriptional response to drugs in human cell

lines. Cell lines are an imperfect model for in vivo drug response.

6.2.2 Future work

As HTP technologies become cheaper and more widely adopted, the biological states of health,

aging and disease will be sampled to a density sufficient to support sophisticated new analyses

and models that will transform the field of translational medicine. Furthermore, newer

technologies such as whole genome sequencing will soon provide entirely different perspectives

on aging and disease. For example, roughly 100 cancer genomes have been sequenced so far –

most of these just within the past year – and several major projects are underway, which should

see that number quickly increase. The International Cancer Genome Consortium (ICGC) plans to

sequence 500 tumors from each of 50 different cancers [211], and the Cancer Genome Atlas

(TCGA) will sequence more than 20 different tumor types in the next 5 years [212]. Making

these data and related clinical information publicly available will significantly contribute to our

understanding of the molecular changes in disease, enabling new discoveries as well as more

comprehensive validation of novel prognostic and predictive signatures. The rapid pace of

technology development will need to be paced by the equally rapid development of new tools

108

and algorithms to handle the large volumes of HTP data and integrate them with existing

knowledge.

The four research projects described in this thesis represent preliminary steps towards a true

systems biology understanding of aging, disease, and drug response. The methods developed in

all four projects are general, in the sense that they could be applied to a number of distinct

biological problems and domains. These and similar approaches that leverage the large quantity

of public HTP data on drugs, aging, and disease can substantially accelerate the identification

and development of new biomarkers and therapeutics.

109

7 References

1. Wieser D, Papatheodorou I, Ziehm M, Thornton JM: Computational biology for

ageing. Philos Trans R Soc Lond B Biol Sci 2011, 366:51-63.

2. Zahn JM, Poosala S, Owen AB, Ingram DK, Lustig A, Carter A, Weeraratna AT, Taub

DD, Gorospe M, Mazan-Mamczarz K, et al: AGEMAP: a gene expression database for

aging in mice. PLoS Genetics 2007, 3:e201-e201.

3. Zahn JM, Kim SK: Systems biology of aging in four species. Current Opinion in

Biotechnology 2007, 18:355-359.

4. Antosh M, Whitaker R, Kroll A, Hosier S, Chang C, Bauer J, Cooper L, Neretti N,

Helfand SL: Comparative transcriptional pathway bioinformatic analysis of dietary

restriction, Sir2, p53 and resveratrol life span extension in Drosophila. Cell Cycle

2011, 10:904-911.

5. Estep PW, 3rd, Warner JB, Bulyk ML: Short-term calorie restriction in male mice

feminizes gene expression and alters key regulators of conserved aging regulatory

pathways. PLoS One 2009, 4:e5242.

6. Spindler SR, Mote PL: Screening candidate longevity therapeutics using gene-

expression arrays. Gerontology 2007, 53:306-321.

7. Rhodes DR, Chinnaiyan AM: Integrative analysis of the cancer transcriptome. Nat

Genet 2005, 37 Suppl:S31-37.

8. Pegram MD, Lipton A, Hayes DF, Weber BL, Baselga JM, Tripathy D, Baly D,

Baughman SA, Twaddell T, Glaspy JA, Slamon DJ: Phase II study of receptor-

enhanced chemosensitivity using recombinant humanized anti-p185HER2/neu

monoclonal antibody plus cisplatin in patients with HER2/neu-overexpressing

metastatic breast cancer refractory to chemotherapy treatment. J Clin Oncol 1998,

16:2659-2671.

9. Slamon DJ, Press MF: Alterations in the TOP2A and HER2 genes: association with

adjuvant anthracycline sensitivity in human breast cancers. J Natl Cancer Inst 2009,

101:615-618.

10. Lowe JA, Jones P, Wilson DM: Network biology as a new approach to drug

discovery. Curr Opin Drug Discov Devel 2010, 13:524-526.

11. Buyse M, Loi S, van't Veer L, Viale G, Delorenzi M, Glas AM, d'Assignies MS, Bergh J,

Lidereau R, Ellis P, et al: Validation and clinical utility of a 70-gene prognostic

signature for women with node-negative breast cancer. J Natl Cancer Inst 2006,

98:1183-1192.

110

12. Spentzos D, Levine DA, Ramoni MF, Joseph M, Gu X, Boyd J, Libermann TA,

Cannistra SA: Gene expression signature with independent prognostic significance in

epithelial ovarian cancer. J Clin Oncol 2004, 22:4700-4710.

13. Dhanasekaran SM, Barrette TR, Ghosh D, Shah R, Varambally S, Kurachi K, Pienta KJ,

Rubin MA, Chinnaiyan AM: Delineation of prognostic biomarkers in prostate cancer.

Nature 2001, 412:822-826.

14. Zhu CQ, Strumpf D, Li CY, Li Q, Liu N, Der S, Shepherd FA, Tsao MS, Jurisica I:

Prognostic gene expression signature for squamous cell carcinoma of lung. Clin

Cancer Res 2010, 16:5038-5047.

15. Fraser HB, Khaitovich P, Plotkin JB, Pääbo S, Eisen MB: Aging and Gene Expression

in the Primate Brain. PLoS Biology 2005, 3:e274 EP --e274 EP -.

16. Flachsbart F, Caliebe A, Kleindorp R, Blanche H, von Eller-Eberstein H, Nikolaus S,

Schreiber S, Nebel A: Association of FOXO3A variation with human longevity

confirmed in German centenarians. Proc Natl Acad Sci U S A 2009, 106:2700-2705.

17. Christensen K, Johnson TE, Vaupel JW: The quest for genetic determinants of human

longevity: challenges and insights. Nat Rev Genet 2006, 7:436-448.

18. Kleindorp R, Flachsbart F, Puca AA, Malovini A, Schreiber S, Nebel A: Candidate gene

study of FOXO1, FOXO4, and FOXO6 reveals no association with human longevity

in Germans. Aging Cell 2011, 10:622-628.

19. Lau SK, Boutros PC, Pintilie M, Blackhall FH, Zhu CQ, Strumpf D, Johnston MR,

Darling G, Keshavjee S, Waddell TK, et al: Three-gene prognostic classifier for early-

stage non small-cell lung cancer. J Clin Oncol 2007, 25:5562-5569.

20. Dupuy A, Simon RM: Critical review of published microarray studies for cancer

outcome and guidelines on statistical analysis and reporting. J Natl Cancer Inst 2007,

99:147-157.

21. Diamandis EP: Cancer biomarkers: can we turn recent failures into success? J Natl

Cancer Inst 2010, 102:1462-1467.

22. Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia

JG, Geoghegan J, Germino G, et al: Multiple-laboratory comparison of microarray

platforms. Nat Methods 2005, 2:345-350.

23. Bell AW, Deutsch EW, Au CE, Kearney RE, Beavis R, Sechi S, Nilsson T, Bergeron JJ:

A HUPO test sample study reveals common problems in mass spectrometry-based

proteomics. Nat Methods 2009, 6:423-430.

24. Auffray C, Chen Z, Hood L: Systems medicine: the future of medical genomics and

healthcare. Genome Med 2009, 1:2.

111

25. Augen J: Information technology to the rescue! Nat Biotechnol 2001, 19 Suppl:BE39-

40.

26. Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF,

Gibbons FD, Dreze M, Ayivi-Guedehoussou N, et al: Towards a proteome-scale map

of the human protein-protein interaction network. Nature 2005, 437:1173-1178.

27. Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M,

Zenkner M, Schoenherr A, Koeppen S, et al: A human protein-protein interaction

network: a resource for annotating the proteome. Cell 2005, 122:957-968.

28. Braun P, Tasan M, Dreze M, Barrios-Rodiles M, Lemmens I, Yu H, Sahalie JM, Murray

RR, Roncari L, de Smet AS, et al: An experimentally derived confidence score for

binary protein-protein interactions. Nat Methods 2009, 6:91-97.

29. Gstaiger M, Aebersold R: Applying mass spectrometry-based proteomics to genetics,

genomics and network biology. Nat Rev Genet 2009, 10:617-627.

30. Kislinger T, Cox B, Kannan A, Chung C, Hu P, Ignatchenko A, Scott MS, Gramolini

AO, Morris Q, Hallett MT, et al: Global survey of organ and organelle protein

expression in mouse: combined proteomic and transcriptomic profiling. Cell 2006,

125:173-186.

31. Bachtiary B, Boutros PC, Pintilie M, Shi W, Bastianutto C, Li JH, Schwock J, Zhang W,

Penn LZ, Jurisica I, et al: Gene Expression Profiling in Cervical Cancer: An

Exploration of Intratumor Heterogeneity. Clin Cancer Res 2006, 12:5632-5640.

32. Blackhall FH, Pintilie M, Wigle D, Jurisica I, Liu N, Radulovitch N, Keshavjee S,

Johnston M, Shepherd FA, Tsao M-S: Stability and heterogeneity of expression

profiles in lung cancer specimens harvested following surgical resection. Neoplasia

2004, 6:761-767.

33. Axelrod DE, Miller N, Chapman JA: Avoiding Pitfalls in the Statistical Analysis of

Heterogeneous Tumors. Biomed Inform Insights 2009, 2:11-18.

34. Jubb AM, Buffa FM, Harris AL: Assessment of tumour hypoxia for prediction of

response to therapy and cancer prognosis. J Cell Mol Med 2010, 14:18-29.

35. Cleator SJ, Powles TJ, Dexter T, Fulford L, Mackay A, Smith IE, Valgeirsson H,

Ashworth A, Dowsett M: The effect of the stromal component of breast tumours on

prediction of clinical outcome using gene expression microarray analysis. Breast

Cancer Res 2006, 8:R32.

36. Myhre S, Mohammed H, Tramm T, Alsner J, Finak G, Park M, Overgaard J, Borresen-

Dale AL, Frigessi A, Sorlie T: In silico ascription of gene expression differences to

tumor and stromal cells in a model to study impact on breast cancer outcome. PLoS

One 2010, 5:e14002.

112

37. Fend F, Raffeld M: Laser capture microdissection in pathology. J Clin Pathol 2000,

53:666-672.

38. Chandran UR, Dhir R, Ma C, Michalopoulos G, Becich M, Gilbertson J: Differences in

gene expression in prostate cancer, normal appearing prostate tissue adjacent to

cancer and prostate tissue from cancer free organ donors. BMC Cancer 2005, 5:45.

39. Bahar R, Hartmann CH, Rodriguez KA, Denny AD, Busuttil RA, Dolle ME, Calder RB,

Chisholm GB, Pollock BH, Klein CA, Vijg J: Increased cell-to-cell variation in gene

expression in ageing mouse heart. Nature 2006, 441:1011-1014.

40. Li Z, Wright FA, Royland J: Age-dependent variability in gene expression in male

Fischer 344 rat retina. Toxicol Sci 2009, 107:281-292.

41. Somel M, Khaitovich P, Bahn S, Paabo S, Lachmann M: Gene expression becomes

heterogeneous with age. Curr Biol 2006, 16:R359-360.

42. de Magalhaes JP, Curado J, Church GM: Meta-analysis of age-related gene expression

profiles identifies common signatures of aging. Bioinformatics 2009, 25:875-881.

43. Elias JE, Haas W, Faherty BK, Gygi SP: Comparative evaluation of mass

spectrometry platforms used in large-scale proteomics investigations. Nat Methods

2005, 2:667-675.

44. Tan PK, Downey TJ, Spitznagel EL, Jr., Xu P, Fu D, Dimitrov DS, Lempicki RA, Raaka

BM, Cam MC: Evaluation of gene expression measurements from commercial

microarray platforms. Nucleic Acids Res 2003, 31:5676-5684.

45. Curtis C, Lynch AG, Dunning MJ, Spiteri I, Marioni JC, Hadfield J, Chin SF, Brenton

JD, Tavare S, Caldas C: The pitfalls of platform comparison: DNA copy number

array technologies assessed. BMC Genomics 2009, 10:588.

46. Barnes M, Freudenberg J, Thompson S, Aronow B, Pavlidis P: Experimental

comparison and cross-validation of the Affymetrix and Illumina gene expression

analysis platforms. Nucleic Acids Res 2005, 33:5914-5923.

47. Eggle D, Debey-Pascher S, Beyer M, Schultze JL: The development of a comparison

approach for Illumina bead chips unravels unexpected challenges applying newest

generation microarrays. BMC Bioinformatics 2009, 10:186.

48. Sandberg R, Larsson O: Improved precision and accuracy for microarrays using

updated probe set definitions. BMC Bioinformatics 2007, 8:48.

49. Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering

statistically significant pathways in expression profiling studies. Proc Natl Acad Sci

U S A 2005, 102:13544-13549.

113

50. Zhang M, Yao C, Guo Z, Zou J, Zhang L, Xiao H, Wang D, Yang D, Gong X, Zhu J, et

al: Apparently low reproducibility of true differential expression discoveries in

microarray studies. Bioinformatics 2008, 24:2057-2063.

51. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich

A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a

knowledge-based approach for interpreting genome-wide expression profiles. Proc

Natl Acad Sci U S A 2005, 102:15545-15550.

52. Chuang HY, Lee E, Liu YT, Lee D, Ideker T: Network-based classification of breast

cancer metastasis. Mol Syst Biol 2007, 3:140.

53. Fujita A, Gomes LR, Sato JR, Yamaguchi R, Thomaz CE, Sogayar MC, Miyano S:

Multivariate gene expression analysis reveals functional connectivity changes

between normal/tumoral prostates. BMC Syst Biol 2008, 2:106.

54. Zhu CQ, da Cunha Santos G, Ding K, Sakurada A, Cutz JC, Liu N, Zhang T, Marrano P,

Whitehead M, Squire JA, et al: Role of KRAS and EGFR as biomarkers of response

to erlotinib in National Cancer Institute of Canada Clinical Trials Group Study

BR.21. J Clin Oncol 2008, 26:4268-4275.

55. Ein-Dor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast

cancer: is there a unique set? Bioinformatics 2005, 21:171-178.

56. Ein-Dor L, Zuk O, Domany E: Thousands of samples are needed to generate a robust

gene list for predicting outcome in cancer. Proc Natl Acad Sci U S A 2006, 103:5923-

5928.

57. Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S,

Jurisica I, Giordano TJ, Misek DE, et al: Gene expression-based survival prediction in

lung adenocarcinoma: a multi-site, blinded validation study. Nat Med 2008, 14:822-

827.

58. Boutros PC, Lau SK, Pintilie M, Liu N, Shepherd FA, Der SD, Tsao MS, Penn LZ,

Jurisica I: Prognostic gene signatures for non-small-cell lung cancer. Proc Natl Acad

Sci U S A 2009, 106:2824-2828.

59. van Vliet MH, Reyal F, Horlings HM, van de Vijver MJ, Reinders MJ, Wessels LF:

Pooling breast cancer datasets has a synergetic effect on classification performance

and improves signature stability. BMC Genomics 2008, 9:375.

60. Warnat P, Eils R, Brors B: Cross-platform analysis of cancer microarray data

improves gene expression based classification of phenotypes. BMC Bioinformatics

2005, 6:265.

61. Fierro AC, Vandenbussche F, Engelen K, Van de Peer Y, Marchal K: Meta Analysis of

Gene Expression Data within and Across Species. Curr Genomics 2008, 9:525-534.

114

62. Plank M, Wuttke D, van Dam S, Clarke SA, de Magalhaes JP: A meta-analysis of

caloric restriction gene expression profiles to infer common signatures and

regulatory mechanisms. Mol Biosyst 2012, 8:1339-1349.

63. Pan F, Chiu CH, Pulapura S, Mehan MR, Nunez-Iglesias J, Zhang K, Kamath K,

Waterman MS, Finch CE, Zhou XJ: Gene Aging Nexus: a web database and data

mining platform for microarray data on aging. Nucleic Acids Res 2007, 35:D756-759.

64. de Magalhaes JP, Budovsky A, Lehmann G, Costa J, Li Y, Fraifeld V, Church GM: The

Human Ageing Genomic Resources: online databases and tools for

biogerontologists. Aging Cell 2009, 8:65-72.

65. Rhodes DR, Kalyana-Sundaram S, Mahavisno V, Varambally R, Yu J, Briggs BB,

Barrette TR, Anstet MJ, Kincead-Beal C, Kulkarni P, et al: Oncomine 3.0: genes,

pathways, and networks in a collection of 18,000 cancer gene expression profiles.

Neoplasia 2007, 9:166-180.

66. Culhane AC, Schwarzl T, Sultana R, Picard KC, Picard SC, Lu TH, Franklin KR, French

SJ, Papenhausen G, Correll M, Quackenbush J: GeneSigDB--a curated database of

gene expression signatures. Nucleic Acids Res 2010, 38:D716-725.

67. Breitling R, Armengaud P, Amtmann A, Herzyk P: Rank products: a simple, yet

powerful, new method to detect differentially regulated genes in replicated

microarray experiments. FEBS Lett 2004, 573:83-92.

68. Hong F, Breitling R: A comparison of meta-analysis methods for detecting

differentially expressed genes in microarray experiments. Bioinformatics 2008,

24:374-382.

69. Hong F, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J: RankProd: a

bioconductor package for detecting differentially expressed genes in meta-analysis.

Bioinformatics 2006, 22:2825-2827.

70. Varambally S, Yu J, Laxman B, Rhodes DR, Mehra R, Tomlins SA, Shah RB, Chandran

U, Monzon FA, Becich MJ, et al: Integrative genomic and proteomic analysis of

prostate cancer reveals signatures of metastatic progression. Cancer Cell 2005,

8:393-406.

71. Gortzak-Uzan L, Ignatchenko A, Evangelou AI, Agochiya M, Brown KA, St Onge P,

Kireeva I, Schmitt-Ulms G, Brown TJ, Murphy J, et al: A proteome resource of ovarian

cancer ascites: integrated proteomic and bioinformatic analyses to identify putative

biomarkers. J Proteome Res 2008, 7:339-351.

72. Brown KR, Jurisica I: Unequal evolutionary conservation of human protein

interactions in interologous networks. Genome Biol 2007, 8:R95.

73. Ackermann M, Strimmer K: A general modular framework for gene set enrichment

analysis. BMC Bioinformatics 2009, 10:47.

115

74. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski

K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The

Gene Ontology Consortium. Nat Genet 2000, 25:25-29.

75. Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP,

Subramanian A, Ross KN, et al: The Connectivity Map: using gene-expression

signatures to connect small molecules, genes, and disease. Science 2006, 313:1929-

1935.

76. De Preter K, De Brouwer S, Van Maerken T, Pattyn F, Schramm A, Eggert A,

Vandesompele J, Speleman F: Meta-mining of neuroblastoma and neuroblast gene

expression profiles reveals candidate therapeutic compounds. Clin Cancer Res 2009,

15:3690-3696.

77. Vilar E, Mukherjee B, Kuick R, Raskin L, Misek DE, Taylor JM, Giordano TJ, Hanash

SM, Fearon ER, Rennert G, Gruber SB: Gene expression patterns in mismatch repair-

deficient colorectal cancers highlight the potential therapeutic role of inhibitors of

the phosphatidylinositol 3-kinase-AKT-mammalian target of rapamycin pathway.

Clin Cancer Res 2009, 15:2829-2839.

78. Hu P, Bader G, Wigle DA, Emili A: Computational prediction of cancer-gene

function. Nat Rev Cancer 2007, 7:23-34.

79. Pena-Castillo L, Tasan M, Myers CL, Lee H, Joshi T, Zhang C, Guan Y, Leone M,

Pagnani A, Kim WK, et al: A critical assessment of Mus musculus gene function

prediction using integrated genomic evidence. Genome Biol 2008, 9 Suppl 1:S2.

80. Lee I, Lehner B, Vavouri T, Shin J, Fraser AG, Marcotte EM: Predicting genetic

modifier loci using functional gene networks. Genome Res 2010, 20:1143-1153.

81. Mills GB, Jurisica I, Yarden Y, Norman JC: Genomic amplicons target vesicle

recycling in breast cancer. J Clin Invest 2009, 119:2123-2127.

82. Agarwal R, Gonzalez-Angulo AM, Myhre S, Carey M, Lee JS, Overgaard J, Alsner J,

Stemke-Hale K, Lluch A, Neve RM, et al: Integrative analysis of cyclin protein levels

identifies cyclin b1 as a classifier and predictor of outcomes in breast cancer. Clin

Cancer Res 2009, 15:3654-3662.

83. Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein

networks. Nature 2001, 411:41-42.

84. Hahn MW, Kern AD: Comparative genomics of centrality and essentiality in three

eukaryotic protein-interaction networks. Mol Biol Evol 2005, 22:803-806.

85. Maslov S, Sneppen K: Specificity and stability in topology of protein networks.

Science 2002, 296:910-913.

116

86. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM,

Michon AM, Cruciat CM, et al: Functional organization of the yeast proteome by

systematic analysis of protein complexes. Nature 2002, 415:141-147.

87. Wuchty S: Topology and weights in a protein domain interaction network--a novel

way to predict protein interactions. BMC Genomics 2006, 7:122.

88. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL: Hierarchical

organization of modularity in metabolic networks. Science 2002, 297:1551-1555.

89. Yu H, Paccanaro A, Trifonov V, Gerstein M: Predicting interactions in protein

networks by completing defective cliques. Bioinformatics 2006.

90. Han JD, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy D, Walhout AJ,

Cusick ME, Roth FP, Vidal M: Evidence for dynamically organized modularity in the

yeast protein-protein interaction network. Nature 2004, 430:88-93.

91. Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U: Network motifs:

simple building blocks of complex networks. Science 2002, 298:824-827.

92. Rice JJ, Kershenbaum A, Stolovitzky G: Lasting impressions: motifs in protein-

protein maps may provide footprints of evolutionary events. Proc Natl Acad Sci U S

A 2005, 102:3173-3174.

93. Przulj N, Corneil DG, Jurisica I: Efficient estimation of graphlet frequency

distributions in protein-protein interaction networks. Bioinformatics 2006, 22:974-

980.

94. Ideker T, Ozier O, Schwikowski B, Siegel AF: Discovering regulatory and signalling

circuits in molecular interaction networks. Bioinformatics 2002, 18 Suppl 1:S233-

240.

95. Nacu S, Critchley-Thorne R, Lee P, Holmes S: Gene expression network analysis and

applications to immunology. Bioinformatics 2007, 23:850-858.

96. Hwang YC, Lin CC, Chang JY, Mori H, Juan HF, Huang HC: Predicting essential

genes based on network and sequence analysis. Mol Biosyst 2009, 5:1672-1678.

97. Fortney K, Kotlyar M, Jurisica I: Inferring the functions of longevity genes with

modular subnetwork biomarkers of Caenorhabditis elegans aging. Genome Biol

2010, 11:R13.

98. Cover TM, Thomas JA: Elements of Information Theory. Wiley-Interscience; 1991.

99. Nibbe RK, Markowitz S, Myeroff L, Ewing R, Chance MR: Discovery and scoring of

protein interaction subnetworks discriminative of late stage human colon cancer.

Mol Cell Proteomics 2009, 8:827-845.

117

100. Nibbe RK, Koyuturk M, Chance MR: An integrative -omics approach to identify

functional sub-networks in human colorectal cancer. PLoS Comput Biol 2010,

6:e1000639.

101. Rhodes DR, Tomlins SA, Varambally S, Mahavisno V, Barrette T, Kalyana-Sundaram S,

Ghosh D, Pandey A, Chinnaiyan AM: Probabilistic model of the human protein-

protein interaction network. Nat Biotechnol 2005, 23:951-959.

102. Radulovich N, Pham NA, Strumpf D, Leung L, Xie W, Jurisica I, Tsao MS: Differential

roles of cyclin D1 and D3 in pancreatic ductal adenocarcinoma. Mol Cancer 2010,

9:24.

103. Tomasini R, Tsuchihara K, Wilhelm M, Fujitani M, Rufini A, Cheung CC, Khan F, Itie-

Youten A, Wakeham A, Tsao MS, et al: TAp73 knockout shows genomic instability

with infertility and tumor suppressor functions. Genes Dev 2008, 22:2677-2691.

104. Sodek KL, Evangelou AI, Ignatchenko A, Agochiya M, Brown TJ, Ringuette MJ,

Jurisica I, Kislinger T: Identification of pathways associated with invasive behavior

by ovarian cancer cells using multidimensional protein identification technology

(MudPIT). Mol Biosyst 2008, 4:762-773.

105. Przulj N: Biological network comparison using graphlet degree distribution.

Bioinformatics 2007, 23:e177-183.

106. Zhu CQ, Ding K, Strumpf D, Weir BA, Meyerson M, Pennell N, Thomas RK, Naoki K,

Ladd-Acosta C, Liu N, et al: Prognostic and predictive gene signature for adjuvant

chemotherapy in resected non-small-cell lung cancer. J Clin Oncol 2010, 28:4417-

4424.

107. Zhu CQ, Pintilie M, John T, Strumpf D, Shepherd FA, Der SD, Jurisica I, Tsao M-S:

Understanding prognostic gene expression signatures in lung cancer. Clin Lung

Cancer 2009, 10.

108. Jonsson PF, Bates PA: Global topological features of cancer proteins in the human

interactome. Bioinformatics 2006, 22:2291-2297.

109. Syed AS, D'Antonio M, Ciccarelli FD: Network of Cancer Genes: a web resource to

analyze duplicability, orthology and network properties of cancer genes. Nucleic

Acids Res 2010, 38:D670-675.

110. Rambaldi D, Giorgi FM, Capuani F, Ciliberto A, Ciccarelli FD: Low duplicability and

network fragility of cancer genes. Trends Genet 2008, 24:427-430.

111. Li L, Zhang K, Lee J, Cordes S, Davis DP, Tang Z: Discovering cancer genes by

integrating network and functional properties. BMC Med Genomics 2009, 2:61.

112. Aragues R, Sander C, Oliva B: Predicting cancer involvement of genes from

heterogeneous data. BMC Bioinformatics 2008, 9:172.

118

113. Savas S, Geraci J, Jurisica I, Liu G: A comprehensive catalogue of functional genetic

variations in the EGFR pathway: protein-protein interaction analysis reveals novel

genes and polymorphisms important for cancer research. Int J Cancer 2009,

125:1257-1265.

114. King AD, Przulj N, Jurisica I: Protein complex prediction via cost-based clustering.

Bioinformatics 2004, 20:3013-3020.

115. Spirin V, Mirny LA: Protein complexes and functional modules in molecular

networks. Proc Natl Acad Sci U S A 2003, 100:12123-12128.

116. Palla G, Derenyi I, Farkas I, Vicsek T: Uncovering the overlapping community

structure of complex networks in nature and society. Nature 2005, 435:814-818.

117. Newman ME: Modularity and community structure in networks. Proc Natl Acad Sci

U S A 2006, 103:8577-8582.

118. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL: The human disease

network. Proc Natl Acad Sci U S A 2007, 104:8685-8690.

119. Sharan R, Ulitsky I, Shamir R: Network-based prediction of protein function. Mol Syst

Biol 2007, 3:88.

120. Lancichinetti A, Fortunato S, Kertész J: Detecting the overlapping and hierarchical

community structure in complex networks. New Journal of Physics 2009, 11.

121. Fortney K, Morgen E, Kotlyar M, Jurisica I: In silico drug screen in mouse liver

identifies candidate calorie restriction mimetics. Rejuvenation Res 2012, 15.

122. Fortney K, Xie W, Kotlyar M, Griesman J, Kotseruba Y, Jurisica I: NetwoRx:

Connecting drugs to networks and phenotypes in S. Cerevisiae. (Submitted) 2012.

123. Fortney K, Griesman J, Kotlyar M, Jurisica I: Computationally repurposing drugs for

lung cancer with CMapBatch: candidate therapeutics from an integrative meta-

analysis of cancer gene signatures and chemogenomic data. In Preparation 2012.

124. Kim SK: Common aging pathways in worms, flies, mice and humans. J Exp Biol

2007, 210:1607-1612.

125. Golden TR, Hubbard A, Dando C, Herren MA, Melov S: Age-related behaviors have

distinct transcriptional profiles in Caenorhabditis elegans. Aging Cell 2008, 7:850-

865.

126. Budovsky A, Abramovich A, Cohen R, Chalifa-Caspi V, Fraifeld V: Longevity

network: construction and implications. Mech Ageing Dev 2007, 128:117-124.

127. Promislow DE: Protein networks, pleiotropy and the evolution of senescence. Proc

Biol Sci 2004, 271:1225-1234.

119

128. Hwang T, Park T: Identification of differentially expressed subnetworks based on

multivariate ANOVA. BMC Bioinformatics 2009, 10:128.

129. Liu M, Liberzon A, Kong SW, Lai WR, Park PJ, Kohane IS, Kasif S: Network-based

analysis of affected biological processes in type 2 diabetes models. PLoS Genet 2007,

3:e96.

130. Xue H, Xian B, Dong D, Xia K, Zhu S, Zhang Z, Hou L, Zhang Q, Zhang Y, Han JD: A

modular network model of aging. Mol Syst Biol 2007, 3:147.

131. Wang X, Dalkic E, Wu M, Chan C: Gene module level analysis: identification to

networks and dynamics. Curr Opin Biotechnol 2008, 19:482-491.

132. Ulitsky I, Shamir R: Identification of functional modules using network topology and

high-throughput data. BMC Syst Biol 2007, 1:8.

133. Budovskaya YV, Wu K, Southworth LK, Jiang M, Tedesco P, Johnson TE, Kim SK: An

elt-3/elt-5/elt-6 GATA transcription circuit guides aging in C. elegans. Cell 2008,

134:291-303.

134. Ulitsky I, Karp R, Shamir R: Detecting Disease-Specific Dysregulated Pathways Via

Analysis of Clinical Expression Profiles. In Proceedings of 12th Int'l Conf Research in

Comp Molecular Biology (RECOMB'08). 2008

135. Dittrich M, Klau G, Rosenwald A, Dandekar T, Müller T: Identifying functional

modules in protein-protein interaction networks: an integrated exact approach.

Bioinformatics 2008, 24:i223-231.

136. Marbach D, Schaffter T, Mattiussi C, Floreano D: Generating realistic in silico gene

networks for performance assessment of reverse engineering methods. J Comput Biol

2009, 16:229-239.

137. Clauset A: Finding local community structure in networks. Phys Rev E Stat Nonlin

Soft Matter Phys 2005, 72:026132.

138. Bair E, Tibshirani R: Semi-supervised methods to predict patient survival from gene

expression data. PLoS Biol 2004, 2:E108.

139. Simon R, Radmacher MD, Dobbin K, McShane LM: Pitfalls in the use of DNA

microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 2003,

95:14-18.

140. Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and

Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society 1995,

57:289-300.

141. Brown KR, Otasek D, Ali M, McGuffin MJ, Xie W, Devani B, Toch IL, Jurisica I:

NAViGaTOR: Network Analysis, Visualization and Graphing Toronto.

Bioinformatics 2009, 25:3327-3329.

120

142. Hermjakob H, Montecchi-Palazzi L, Bader G, Wojcik J, Salwinski L, Ceol A, Moore S,

Orchard S, Sarkans U, von Mering C, et al: The HUPO PSI's molecular interaction

format--a community standard for the representation of protein interaction data.

Nat Biotechnol 2004, 22:177-183.

143. Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima

S, Okuda S, Tokimatsu T, Yamanishi Y: KEGG for linking genomes to life and the

environment. Nucleic Acids Res 2008, 36:D480-484.

144. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS,

Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new

developments in phylogenetic classification of proteins from complete genomes.

Nucleic Acids Res 2001, 29:22-28.

145. Lin C-CCaC-J: LIBSVM: a library for support vector machines. 2001.

146. Alexa A, Rahnenfuhrer J, Lengauer T: Improved scoring of functional groups from

gene expression data by decorrelating GO graph structure. Bioinformatics 2006,

22:1600-1607.

147. Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression

and hybridization array data repository. Nucleic Acids Res 2002, 30:207-210.

148. Lee I, Lehner B, Crombie C, Wong W, Fraser AG, Marcotte EM: A single gene network

accurately predicts phenotypic effects of gene perturbation in Caenorhabditis

elegans. Nat Genet 2008, 40:181-188.

149. Smith ED, Tsuchiya M, Fox LA, Dang N, Hu D, Kerr EO, Johnston ED, Tchao BN, Pak

DN, Welton KL, et al: Quantitative evidence for conserved longevity pathways

between divergent eukaryotic species. Genome Res 2008, 18:564-570.

150. Smola A, Scholkopf B: A tutorial on support vector regression. Statistics and

Computing 2004, 14:199-222.

151. Gautier L, Cope L, Bolstad BM, Irizarry RA: affy--analysis of Affymetrix GeneChip

data at the probe level. Bioinformatics 2004, 20:307-315.

152. Smyth GK: Limma: Linear Models for Microarray Data. In Bioinformatics and

Computational Biology Solutions Using R and Bioconductor. Edited by Gentleman R,

Carey V, Huber W, Irizarry R, Dudoit S. Heidelberg: Springer; 2005: 397-420

153. Kuhn A, Luthi-Carter R, Delorenzi M: Cross-species and cross-platform gene

expression studies with the Bioconductor-compliant R package 'annotationTools'.

BMC Bioinformatics 2008, 9:26.

154. Iorio F, Bosotti R, Scacheri E, Belcastro V, Mithbaokar P, Ferriero R, Murino L,

Tagliaferri R, Brunetti-Pierri N, Isacchi A, di Bernardo D: Discovery of drug mode of

action and drug repositioning from transcriptional responses. Proc Natl Acad Sci U S

A, 107:14621-14626.

121

155. Selman C, Kerrison ND, Cooray A, Piper MD, Lingard SJ, Barton RH, Schuster EF,

Blanc E, Gems D, Nicholson JK, et al: Coordinated multitissue transcriptional and

plasma metabonomic profiles following acute caloric restriction in mice. Physiol

Genomics 2006, 27:187-200.

156. Tsuchiya T, Dhahbi JM, Cui X, Mote PL, Bartke A, Spindler SR: Additive regulation of

hepatic gene expression by dwarfism and caloric restriction. Physiol Genomics 2004,

17:307-315.

157. Dhahbi JM, Mote PL, Fahy GM, Spindler SR: Identification of potential caloric

restriction mimetics by microarray profiling. Physiol Genomics 2005, 23:343-350.

158. Corton JC, Apte U, Anderson SP, Limaye P, Yoon L, Latendresse J, Dunn C, Everitt JI,

Voss KA, Swanson C, et al: Mimetics of caloric restriction include agonists of lipid-

activated nuclear receptors. J Biol Chem 2004, 279:46204-46212.

159. Fu C, Hickey M, Morrison M, McCarter R, Han ES: Tissue specific and non-specific

changes in gene expression by aging and by early stage CR. Mech Ageing Dev 2006,

127:905-916.

160. Miller RA, Chang Y, Galecki AT, Al-Regaiey K, Kopchick JJ, Bartke A: Gene

expression patterns in calorically restricted mice: partial overlap with long-lived

mutant mice. Mol Endocrinol 2002, 16:2657-2666.

161. Dhahbi JM, Kim HJ, Mote PL, Beaver RJ, Spindler SR: Temporal linkage between the

phenotypic and genomic responses to caloric restriction. Proc Natl Acad Sci U S A

2004, 101:5524-5529.

162. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of

Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31:e15.

163. Govindan J, Evans M: Pioglitazone in clinical practice: where are we now? Diabetes

Ther 2012, 3:1-8.

164. Terkeltaub RA, Furst DE, Bennett K, Kook KA, Crockett RS, Davis MW: High versus

low dosing of oral colchicine for early acute gout flare: Twenty-four-hour outcome

of the first multicenter, randomized, double-blind, placebo-controlled, parallel-

group, dose-comparison colchicine study. Arthritis Rheum 2010, 62:1060-1068.

165. Ludwig A, Fechner M, Wilck N, Meiners S, Grimbo N, Baumann G, Stangl V, Stangl K:

Potent anti-inflammatory effects of low-dose proteasome inhibition in the vascular

system. J Mol Med (Berl) 2009, 87:793-802.

166. Jagtap P, Soriano FG, Virág L, Liaudet L, Mabley J, Szabó É, Haskó G, Marton A,

Lorigados CB, Gallyas FJ, et al: Novel phenanthridinone inhibitors of poly(adenosine

5'-dipho... : Critical Care Medicine. Critical Care Medicine 2002, 30:1071-1082.

122

167. Moskalev AA, Shaposhnikov MV: Pharmacological inhibition of phosphoinositide 3

and TOR kinases improves survival of Drosophila melanogaster. Rejuvenation Res

2010, 13:246-247.

168. Jafari M, Khodayari B, Felgner J, Bussel, II, Rose MR, Mueller LD: Pioglitazone: an

anti-diabetic compound with anti-aging properties. Biogerontology 2007, 8:639-651.

169. Giaever G, Chu AM, Ni L, Connelly C, Riles L, Veronneau S, Dow S, Lucau-Danila A,

Anderson K, Andre B, et al: Functional profiling of the Saccharomyces cerevisiae

genome. Nature 2002, 418:387-391.

170. Winzeler EA, Shoemaker DD, Astromoff A, Liang H, Anderson K, al. e: Functional

characterization of the S. cerevisiae genome by gene deletion and parallel analysis.

Science 1999, 285:901-906.

171. Ericson E, Gebbia M, Heisler LE, Wildenhain J, Tyers M, Giaever G, Nislow C: Off-

target effects of psychoactive drugs revealed by genome-wide assays in yeast. PLoS

Genet 2008, 4:e1000151.

172. Blackman RK, Cheung-Ong K, Gebbia M, Proia DA, He S, Kepros J, Jonneaux A,

Marchetti P, Kluza J, Rao PE, et al: Mitochondrial electron transport is the cellular

target of the oncology drug elesclomol. PLoS One 2012, 7:e29798.

173. Hillenmeyer ME, Fung E, Wildenhain J, Pierce SE, Hoon S, Lee W, Proctor M, St Onge

RP, Tyers M, Koller D, et al: The chemical genomic portrait of yeast: uncovering a

phenotype for all genes. Science 2008, 320:362-365.

174. Parsons AB, Lopez A, Givoni IE, Williams DE, Gray CA, Porter J, Chua G, Sopko R,

Brost RL, Ho CH, et al: Exploring the mode-of-action of bioactive compounds by

chemical-genetic profiling in yeast. Cell 2006, 126:611-625.

175. Hillenmeyer ME, Ericson E, Davis RW, Nislow C, Koller D, Giaever G: Systematic

analysis of genome-wide fitness data in yeast reveals novel gene function and drug

action. Genome Biol 2010, 11:R30.

176. Engel SR, Balakrishnan R, Binkley G, Christie KR, Costanzo MC, Dwight SS, Fisk DG,

Hirschman JE, Hitz BC, Hong EL, et al: Saccharomyces Genome Database provides

mutant phenotype data. Nucleic Acids Res 2010, 38:D433-436.

177. Teixeira MC, Monteiro P, Jain P, Tenreiro S, Fernandes AR, Mira NP, Alenquer M,

Freitas AT, Oliveira AL, Sa-Correia I: The YEASTRACT database: a tool for the

analysis of transcription regulatory associations in Saccharomyces cerevisiae.

Nucleic Acids Res 2006, 34:D446-451.

178. Fabrizio P, Hoon S, Shamalnasab M, Galbani A, Wei M, Giaever G, Nislow C, Longo

VD: Genome-wide screen in Saccharomyces cerevisiae identifies vacuolar protein

sorting, autophagy, biosynthetic, and tRNA methylation genes involved in life span

regulation. PLoS Genet 2010, 6:e1001024.

123

179. Matecic M, Smith DL, Pan X, Maqani N, Bekiranov S, Boeke JD, Smith JS: A

microarray-based genetic screen for yeast chronological aging factors. PLoS Genet

2010, 6:e1000921.

180. Powers RW, 3rd, Kaeberlein M, Caldwell SD, Kennedy BK, Fields S: Extension of

chronological life span in yeast by decreased TOR pathway signaling. Genes Dev

2006, 20:174-184.

181. Peter Langfelder SH: WGCNA: an R package for weighted correlation network

analysis. BMC Bioinformatics 2012, 9:559.

182. Bolton EE, Wang Y, Thiessen PA, Bryant SH: Chapter 12 PubChem: Integrated

Platform of Small Molecules and Biological Activities. In Annual Reports in

Computational Chemistry. Volume Volume 4. Edited by Ralph AWaDCS: Elsevier;

2008: 217-241

183. Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET, Christie KR,

Costanzo MC, Dwight SS, Engel SR, et al: Saccharomyces Genome Database: the

genomics resource of budding yeast. Nucleic Acids Res 2012, 40:D700-705.

184. Brown KR, Otasek D, Ali M, McGuffin M, Xie W, Devani B, van Toch IL, Jurisica I:

NAViGaTOR: Network Analysis, Visualization and Graphing Toronto.

Bioinformatics 2009, 25:3327-3329.

185. Fontana L, Partridge L, Longo VD: Extending healthy life span--from yeast to

humans. Science 2010, 328:321-326.

186. Lee KS, Lee BS, Semnani S, Avanesian A, Um CY, Jeon HJ, Seong KM, Yu K, Min KJ,

Jafari M: Curcumin extends life span, improves health span, and modulates the

expression of age-associated aging genes in Drosophila melanogaster. Rejuvenation

Res 2010, 13:561-570.

187. Roemer T, Davies J, Giaever G, Nislow C: Bugs, drugs and chemical genomics. Nat

Chem Biol 2012, 8:46-56.

188. Smith AM, Ammar R, Nislow C, Giaever G: A survey of yeast genomic assays for

drug and target discovery. Pharmacol Ther 2010, 127:156-164.

189. O'Connell BC, Adamson B, Lydeard JR, Sowa ME, Ciccia A, Bredemeyer AL,

Schlabach M, Gygi SP, Elledge SJ, Harper JW: A genome-wide camptothecin

sensitivity screen identifies a mammalian MMS22L-NFKBIL2 complex required for

genomic stability. Mol Cell 2010, 40:645-657.

190. Schlabach MR, Luo J, Solimini NL, Hu G, Xu Q, Li MZ, Zhao Z, Smogorzewska A,

Sowa ME, Ang XL, et al: Cancer proliferation gene discovery through functional

genomics. Science 2008, 319:620-624.

124

191. Hayat MJ, Howlader N, Reichman ME, Edwards BK: Cancer statistics, trends, and

multiple primary cancer analyses from the Surveillance, Epidemiology, and End

Results (SEER) Program. Oncologist 2007, 12:20-37.

192. Fortney K, Jurisica I: Integrative computational biology for cancer research. Hum

Genet 2011, 130:465-481.

193. Chang M, Smith S, Thorpe A, Barratt MJ, Karim F: Evaluation of phenoxybenzamine

in the CFA model of pain following gene expression studies and connectivity

mapping. Mol Pain 2010, 6:56.

194. Kunkel SD, Suneja M, Ebert SM, Bongers KS, Fox DK, Malmberg SE, Alipour F,

Shields RK, Adams CM: mRNA expression signatures of human skeletal muscle

atrophy identify a natural compound that increases muscle mass. Cell Metab 2011,

13:627-638.

195. Wang G, Ye Y, Yang X, Liao H, Zhao C, Liang S: Expression-based in silico screening

of candidate therapeutic compounds for lung adenocarcinoma. PLoS One 2011,

6:e14573.

196. Ebi H, Tomida S, Takeuchi T, Arima C, Sato T, Mitsudomi T, Yatabe Y, Osada H,

Takahashi T: Relationship of deregulated signaling converging onto mTOR with

prognosis and classification of lung adenocarcinoma shown by two independent in

silico analyses. Cancer Res 2009, 69:4027-4035.

197. Sirota M, Dudley JT, Kim J, Chiang AP, Morgan AA, Sweet-Cordero A, Sage J, Butte

AJ: Discovery and preclinical validation of drug indications using compendia of

public gene expression data. Sci Transl Med 2011, 3:96ra77.

198. Zhang S-D, Gant TW: sscMap: An extensible Java application for connecting small-

molecule drugs using gene-expression signatures. BMC Bioinformatics 2012, 10:236.

199. McArt DG, Zhang SD: Identification of candidate small-molecule therapeutics to

cancer by gene-signature perturbation in connectivity mapping. PLoS One 2011,

6:e16382.

200. DTP: Developmental Therapeutics Program NCI/NIH. [http://dtp.nci.nih.gov/]

201. Gorzalczany Y, Gilad Y, Amihai D, Hammel I, Sagi-Eisenberg R, Merimsky O:

Combining an EGFR directed tyrosine kinase inhibitor with autophagy-inducing

drugs: a beneficial strategy to combat non-small cell lung cancer. Cancer Lett 2011,

310:207-215.

202. Ramalingam SS, Maitland ML, Frankel P, Argiris AE, Koczywas M, Gitlitz B, Thomas

S, Espinoza-Delgado I, Vokes EE, Gandara DR, Belani CP: Carboplatin and Paclitaxel

in combination with either vorinostat or placebo for first-line therapy of advanced

non-small-cell lung cancer. J Clin Oncol 2010, 28:56-62.

125

203. Suda K, Tomizawa K, Fujii M, Murakami H, Osada H, Maehara Y, Yatabe Y, Sekido Y,

Mitsudomi T: Epithelial to mesenchymal transition in an epidermal growth factor

receptor-mutant lung cancer cell line with acquired resistance to erlotinib. J Thorac

Oncol 2011, 6:1152-1161.

204. Willett P: Similarity searching using 2D structural fingerprints. Methods Mol Biol

2011, 672:133-158.

205. Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, et

al: DrugBank 3.0: a comprehensive resource for 'omics' research on drugs. Nucleic

Acids Res 2011, 39:D1035-1041.

206. Seiler KP, George GA, Happ MP, Bodycombe NE, Carrinski HA, Norton S, Brudz S,

Sullivan JP, Muhlich J, Serrano M, et al: ChemBank: a small-molecule screening and

cheminformatics resource database. Nucleic Acids Res 2008, 36:D351-359.

207. Meyer AM, Dwyer-Nield LD, Hurteau GJ, Keith RL, O'Leary E, You M, Bonventre JV,

Nemenoff RA, Malkinson AM: Decreased lung tumorigenesis in mice genetically

deficient in cytosolic phospholipase A2. Carcinogenesis 2004, 25:1517-1524.

208. Weiser-Evans MC, Wang XQ, Amin J, Van Putten V, Choudhary R, Winn RA,

Scheinman R, Simpson P, Geraci MW, Nemenoff RA: Depletion of cytosolic

phospholipase A2 in bone marrow-derived macrophages protects against lung

cancer progression and metastasis. Cancer Res 2009, 69:1733-1738.

209. The PyMOL Molecular Graphics System, Version 1.3, Schrödinger, LLC.

210. Kotlyar M, Fortney K, Jurisica I: Network-based characterization of drug-regulated

genes, drug targets, and toxicity. (Submitted) 2012.

211. Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, Bernabe RR, Bhan MK, Calvo F,

Eerola I, Gerhard DS, et al: International network of cancer genome projects. Nature

2010, 464:993-998.

212. Ledford H: Big science: The cancer genome challenge. Nature 2010, 464:972-974.