Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Bioinformatics approaches to biomarker and drug
discovery in aging and disease
by
Kristen Fortney
A thesis submitted in conformity with the requirements
for the degree of Doctor of Philosophy
Graduate Department of Medical Biophysics
University of Toronto
© Copyright by Kristen Fortney 2012
ii
Bioinformatics approaches to biomarker and drug discovery in aging and disease
Kristen Fortney
Doctor of Philosophy
Graduate Department of Medical Biophyiscs
University of Toronto
2012
Abstract
Over the past two decades, high-throughput (HTP) technologies such as microarrays and mass
spectrometry have fundamentally changed the landscape of aging and disease biology. They have
revealed novel molecular markers of aging, disease state, and drug response. Some have been
translated into the clinic as tools for early disease diagnosis, prognosis, and individualized
treatment and response monitoring. Despite these successes, many challenges remain: HTP
platforms are often noisy and suffer from false positives and false negatives; optimal analysis
and successful validation require complex workflows; and the underlying biology of aging and
disease is heterogeneous and complex. Methods from integrative computational biology can help
diminish these challenges by creating new analytical methods and software tools that leverage
the large and diverse quantity of publicly available HTP data.
In this thesis I report on four projects that develop and apply strategies from integrative
computational biology to identify improved biomarkers and therapeutics for aging and disease.
In Chapter 2, I proposed a new network analysis method to identify gene expression biomarkers
of aging, and applied it to study the pathway-level effects of aging and infer the functions of
poorly-characterized longevity genes. In Chapter 4, I adapted gene-level HTP chemogenomic
iii
data to study drug response at the systems level; I connected drugs to pathways, phenotypes and
networks, and built the NetwoRx web portal to make these data publicly available. And in
Chapters 3 and 5, I developed a novel meta-analysis pipeline to identify new drugs that mimic
the beneficial gene expression changes seen with calorie restriction (Chapter 3), or that reverse
the pathological gene changes associated with lung cancer (Chapter 5).
The projects described in this thesis will help provide a systems-level understanding of the
causes and consequences of aging and disease, as well as new tools for diagnosis (biomarkers)
and treatment (therapeutics).
iv
Acknowledgments
It is a pleasure to thank the many people that helped make this work possible.
Chief among these is my supervisor, Dr. Igor Jurisica. Throughout my degree, he has provided
exceptional guidance and support. The exciting and collaborative environment of the Jurisica lab
has been a great place to learn and develop as a scientist.
I would like to thank my Supervisory Committee members Drs. Elisabeth Tillier and Thomas
Kislinger for their advice and encouragement over the course of my graduate program.
I owe much to the help and support of my talented colleagues in the Jurisica lab and at the
University. Special thanks to Dr. Max Kotlyar for several years of fun and productive scientific
collaborations. I would also like to thank my other wonderful collaborators on the projects in this
thesis: Josh Griesman, Wing Xie, Dr. Eric Morgen, and Yulia Kotseruba. In addition, I’m
grateful to Abraham, Fiona, Kevin, and Marc for reading parts of this thesis and offering
valuable comments, Christian for keeping the cluster running, and Daniela and Sara for keeping
MaRS occupied late into the night. I thank the entire Jurisica lab for their great advice and
companionship over the years.
Finally, I would like to thank my family and friends for their support, optimism, and
encouragement throughout my studies.
v
Table of Contents
Acknowledgments .......................................................................................................................... iv
Table of Contents ........................................................................................................................... iv
Table of Figures .............................................................................................................................. x
1 Introduction ................................................................................................................................ 1
1.1 High-throughput technologies for aging and disease .......................................................... 1
1.2 The challenges facing high-throughput biology ................................................................. 3
1.2.1 NOISE AND HETEROGENEITY ......................................................................... 3
1.2.2 ANALYSIS ............................................................................................................. 6
1.3 How integrative computational biology can address these challenges ............................... 8
1.3.1 DATA INTEGRATION ......................................................................................... 9
1.3.2 NETWORK ANALYSIS ...................................................................................... 14
1.4 Research contributions of this thesis ................................................................................. 17
2 Inferring the functions of longevity genes with modular subnetwork biomarkers of
Caenorhabditis elegans aging .................................................................................................. 22
2.1 Abstract ............................................................................................................................. 22
2.2 Introduction ....................................................................................................................... 23
2.2.1 Methods for extracting active subnetworks by integrating gene expression
data, network connectivity, and supervised class labels ....................................... 25
2.3 Results and Discussion ..................................................................................................... 26
2.3.1 Identifying active subnetworks in aging by trading off network modularity and
class relevance ...................................................................................................... 26
2.3.2 Identifying modular subnetworks ......................................................................... 27
2.3.3 Class relevance R .................................................................................................. 29
2.3.4 Network modularity M ......................................................................................... 29
vi
2.3.5 Comparing regular and modular subnetworks ...................................................... 30
2.3.6 Modular subnetworks are more robust across studies than regular subnetworks . 31
2.3.7 Modular subnetworks trained on aging gene expression data from wild-type
worms successfully predict age in fer-15 worms .................................................. 32
2.3.8 Subnetworks vs. genes .......................................................................................... 34
2.3.9 Modular vs. regular subnetworks .......................................................................... 35
2.3.10 The role of the modularity coefficient in machine learning .............................. 36
2.3.11 Modular subnetworks predict wild-type worm age with low mean-squared
error ....................................................................................................................... 38
2.3.12 Longevity genes play crucial roles in significant subnetworks ............................ 40
2.3.13 Significant subnetworks are enriched for known longevity genes ....................... 41
2.3.14 Examples of significant subnetworks containing known longevity genes ........... 41
2.3.15 Modular subnetworks participate in many different age-related biological
processes ............................................................................................................... 43
2.3.16 Modular subnetworks can be used to annotate longevity genes with novel
functions ................................................................................................................ 46
2.4 Conclusions ....................................................................................................................... 48
2.5 Materials and Methods ...................................................................................................... 49
2.5.1 Code ...................................................................................................................... 49
2.5.2 Data sets ................................................................................................................ 49
2.5.3 Subnetwork analyses ............................................................................................. 50
2.5.4 Machine learning comparisons ............................................................................. 52
2.5.5 GO and KEGG enrichment analyses .................................................................... 52
2.6 Abbreviations .................................................................................................................... 53
2.7 Supplementary Materials .................................................................................................. 53
2.8 Acknowledgments ............................................................................................................. 53
3 In silico drug screen in mouse liver identifies candidate calorie restriction mimetics ............ 54
vii
3.1 Abstract ............................................................................................................................. 54
3.2 Introduction ....................................................................................................................... 55
3.3 Materials and Methods ...................................................................................................... 56
3.3.1 Code ...................................................................................................................... 56
3.3.2 Drug-drug interaction network ............................................................................. 56
3.3.3 Acquiring transcriptional signatures of calorie restriction .................................... 56
3.3.4 Connectivity map analysis of CR signatures ........................................................ 57
3.3.5 Meta-analysis of drug-response data .................................................................... 58
3.4 Results ............................................................................................................................... 58
3.4.1 Transcriptional signatures of CR .......................................................................... 58
3.4.2 Meta-analysis identified fourteen candidate CR mimetics ................................... 59
3.5 Discussion ......................................................................................................................... 62
3.6 Acknowledgments ............................................................................................................. 62
4 NetwoRx: Connecting drugs to networks and phenotypes in S. cerevisiae ............................. 64
4.1 Abstract ............................................................................................................................. 64
4.2 Introduction ....................................................................................................................... 66
4.3 Materials and Methods ...................................................................................................... 67
4.3.1 NetwoRx methods ................................................................................................. 67
4.3.2 Use-case methods .................................................................................................. 71
4.4 NetwoRx content and functionality .................................................................................. 71
4.4.1 Database contents .................................................................................................. 71
4.4.2 Accessing data ...................................................................................................... 72
4.5 NetwoRx use case examples ............................................................................................. 73
4.5.1 Retrieving drugs that perturb phenotypes: oxidative stress .................................. 73
4.5.2 Focused searches identify drugs with shared mode of action: drugs that target
the same DNA damage pathways as Cisplatin ..................................................... 74
viii
4.5.3 Bipartite networks reveal that some gene sets are druggable hubs ....................... 75
4.5.4 Clustering the drug-pathway matrix identifies drug modules that share modes
of action ................................................................................................................ 75
4.5.5 User-defined gene sets: identifying new drugs that modulate yeast
chronological aging ............................................................................................... 76
4.6 Discussion ......................................................................................................................... 78
4.7 Acknowledgements ........................................................................................................... 79
5 Computationally repurposing drugs for lung cancer with CMapBatch: candidate
therapeutics from an integrative meta-analysis of cancer gene signatures and
chemogenomic data .................................................................................................................. 80
5.1 Abstract ............................................................................................................................. 80
5.2 Background ....................................................................................................................... 82
5.3 Results and discussion ...................................................................................................... 85
5.3.1 CMapBatch meta-analysis strategy: From individual cancer gene signatures to
candidate therapeutics ........................................................................................... 85
5.3.2 Candidate drugs identified via CMapBatch are more conserved across
signature subsets than candidate drugs identified from single gene signatures .... 86
5.3.3 Characterizing and prioritizing candidate lung cancer therapeutics ..................... 89
5.3.4 Candidate therapeutics inhibit growth in nine NSCLC cell lines ......................... 89
5.3.5 Prioritizing drugs by structural similarity: eleven significant drugs are highly
structurally similar to TOP drugs .......................................................................... 93
5.3.6 Prioritizing drugs by shared target: twenty-eight significant drugs share a
protein target with one or more TOP drugs .......................................................... 93
5.3.7 Common protein targets of significant drugs ........................................................ 95
5.3.8 Significant drugs are broad-acting: they affect more genes than other drugs ....... 97
5.3.9 Many drugs are indicated for lung cancer independently of subtype ................... 98
5.4 Conclusions ....................................................................................................................... 98
5.5 Methods ............................................................................................................................. 98
5.5.1 Code and software ................................................................................................. 98
ix
5.5.2 Data sources .......................................................................................................... 99
5.5.3 Connectivity map analysis of lung cancer signatures ........................................... 99
5.5.4 Meta-analysis of drug-response data .................................................................. 100
5.5.5 NCI-60 analysis of significant drugs .................................................................. 100
5.6 Acknowledgements ......................................................................................................... 101
5.7 Supplementary Material .................................................................................................. 101
6 General conclusions and significance .................................................................................... 103
6.1 Conclusions ..................................................................................................................... 103
6.1.1 From genes to pathways, phenotypes, and networks .......................................... 105
6.1.2 Integrating complementary HTP data sources .................................................... 106
6.2 Open questions and future work ..................................................................................... 106
6.2.1 Limitations in HTP data ...................................................................................... 106
6.2.2 Future work ......................................................................................................... 107
7 References .............................................................................................................................. 109
x
Table of Figures
Figure 2-1. High-scoring subnetworks fulfill two criteria: they are modular and related to aging.
....................................................................................................................................................... 27
Figure 2-2. Identifying modular subnetworks. ............................................................................. 28
Figure 2-3. Modular subnetworks are highly conserved across studies. ...................................... 32
Figure 2-4. Predicting worm age using machine learning. ........................................................... 34
Figure 2-5. Subnetworks and genes predict the age of fer-15 worms. ......................................... 36
Figure 2-6. Modular subnetwork biomarkers of aging predict the age of individual wild-type
worms. ........................................................................................................................................... 40
Figure 2-7. Some examples of significant longevity subnetworks. .............................................. 42
Figure 3-1. Fourteen drug treatments significantly mimic the effects of CR on hepatic gene
expression. .................................................................................................................................... 61
Figure 4-1.Gene set analysis of chemogenomic data. ................................................................... 70
Figure 4-2.Drugs that perturb oxidative stress pathways. ............................................................. 73
Figure 4-3.Mode of action analysis of the chemotherapeutic cisplatin. ....................................... 74
Figure 4-4.Bipartite network showing all connections between drugs and YEASTRACT targets
of transcription factors. ................................................................................................................. 75
Figure 4-5.Drug module identified by clustering the matrix of drug-drug similarity scores. ...... 76
xi
Figure 4-6. Drugs predicted by NetwoRx to modulate yeast chronological lifespan. .................. 78
Figure 5-1. CMapBatch meta-analysis pipeline. ........................................................................... 86
Figure 5-2. CMapBatch produces more stable lists of significant drugs than individual gene
signatures. ..................................................................................................................................... 88
Figure 5-3. Drug candidates inhibit growth in lung cancer cell lines more than other Connectivity
Map drugs. .................................................................................................................................... 90
Figure 5-4. Prioritizing drug candidates with GI50 values and chemical structures. ................... 92
Figure 5-5. Significant drugs share many protein targets. ............................................................ 94
Figure 5-6. Significant drugs affect more genes than other Connectivity Map drugs. ................. 97
xii
Abbreviations
CGH Comparative genomic hybridization
CR Calorie restriction
FDR False discovery rate
GSEA Gene set enrichment analysis
GSS Gene set score
HTP High-throughput
KS Kolmogorov-Smirnov
MoA Mode of action
MSE Mean-squared error
PCR Polymerase chain reaction
PPI Protein-protein interaction
SCC Squared correlation coefficient
SVR Support vector regression
RNAi RNA interference
Databases and resources used in research chapters
Biological pathway annotations and related data
Gene Ontology (GO)
http://www.geneontology.org/
Human Ageing Genomic Resources (HAGR)
http://genomics.senescence.info/
Kyoto Encyclopedia of Genes and Genomes (KEGG)
http://www.genome.jp/kegg/pathway.html
xiii
Saccharomyces Genome Database
http://www.yeastgenome.org/
Wormbase
http://www.wormbase.org/
YEASTRACT
http://www.yeastract.com/
Experimental data
Functional interactions
Wormnet
http://www.functionalnet.org/wormnet/
Gene expression
Gene Expression Omnibus (GEO)
http://www.ncbi.nlm.nih.gov/geo/
Cancer Data Integration Portal (CDIP)
http://ophid.utoronto.ca/cdip
Connectivity Map (CMap)
http://www.broadinstitute.org/cmap/
Oncomine
https://www.oncomine.org/
Network visualization software
NAViGaTOR
http://ophid.utoronto.ca/navigator/
1
1 Introduction
Parts of this chapter are based on:
Kristen Fortney and Igor Jurisica (2011). Integrative computational biology for cancer research.
Human Genetics 130(4):465-81.
1.1 High-throughput technologies for aging and disease
Since the commercialization of DNA microarray technology in the late 1990s, high-throughput
(HTP) data relevant to aging and disease have been accumulating at an increasing rate. For
example, there have been over 80 published large-scale gene expression studies tracking the
transcriptional changes that occur with aging in humans and model organisms [1]. These data
have led to crucial insights into aging biology relevant to human health, including tissue-specific
aging [2, 3] and mechanisms of lifespan extension through dietary restriction or pharmacological
intervention [4-6]. In cancer research, HTP technologies have successfully been applied to help
elucidate the mechanisms of tumorigenesis, metastasis, and drug resistance [7]. They have also
had enormous clinical impact, e.g., several cancers can now be split into therapeutic subsets with
unique prognostic outcomes based on their molecular phenotypes [8-14].
Despite these successes, many challenges remain. For example, in differential expression
analyses of aging microarray data, few if any genes or biological processes have been identified
as age-regulated among multiple species [3, 15]; in genome-wide association studies of human
longevity (e.g., comparing centenarians to normal individuals), most longevity SNPs identified
2
to date are not robust across studies or populations [16-18]. And in studies of disease, predictive
and prognostic biomarkers derived from HTP transcriptomic or proteomic data are notoriously
inconsistent from study to study (i.e., they show poor overlap), and often cannot be validated by
other methods or in new cohorts of patients [19-21].
Aging and disease are highly complex; more and higher-throughput data do not immediately
translate into a better understanding of their mechanisms. The challenges to HTP data
interpretation fall under two main categories. The first is noise: HTP platforms are inherently
noisy (experimental noise) – results vary substantially from run to run and from lab to lab, and
are prone to false positives and negatives – and there is substantial heterogeneity (biological
noise) in the systems we wish to model. The second challenge is analysis: simple methods – such
as univariate analyses of microarray data – often miss much of the signal in the data.
Integrative computational methods will continue to play a central role in addressing these
challenges. We need new analytical methods to identify complex signals in different data
sources, as well as to combine information from diverse experimental platforms and other
sources that offer different perspectives on the problem, e.g., gene and protein expression,
protein-protein interactions and pathways, chromosomal aberrations, mutation events, epigenetic
changes, and clinical information from drug trials and the bedside. These methods will depend
on advances in many areas such as statistics, knowledge representation and ontology, machine
learning, data mining, graph theory, and visualization.
Below I review some of the major challenges to the interpretation of HTP data relevant to aging
and disease and give examples of integrative computational methods and strategies that have
helped to meet them. Next I outline how my four research projects, Chapters 2 through 5,
3
address some of these challenges. My projects focus on identifying biomarkers and therapeutics
for aging and disease, two tasks of particular concern to translational medicine. For these
projects, I apply and integrate multiple HTP data sources, including RNA microarray data of
aging, disease state, and drug response; protein-protein and functional interaction networks;
RNAi assays; DNA barcode chemogenomic screens; and large-scale in vitro growth inhibition
drug screens performed on human cell lines.
1.2 The challenges facing high-throughput biology
A wealth of genomic and proteomic data is now available from HTP screens. While these data
have improved our understanding of aging and disease and some have even translated into
improved patient diagnosis and treatment, significant challenges remain. In this section I briefly
review some of the major obstacles to progress.
1.2.1 NOISE AND HETEROGENEITY
Aging and disease cause heterogeneous changes, and HTP platforms are noisy. Consequently,
there can be large variation in results from lab to lab [22, 23]; methods are needed to circumvent
this noise and to integrate different data sources to make them more reliable. The main noise
issues that plague HTP research are false negatives and false positives, biological heterogeneity,
and platform bias.
4
1.2.1.1 False negatives and false positives in HTP data
Many HTP screens suffer from noise – leading to both false positives and false negatives –
which must be resolved through complementary experiments and computational analysis [24,
25]. For example, while HTP protein-protein interaction screens can identify thousands of
protein interactions at once, they do so at the cost of either high false discovery rates or poor
sensitivity. The false discovery rate is the proportion of detected interactions that are false, and
sensitivity is the proportion of true interactions that are successfully detected. When interactions
detected by two HTP studies were tested in small-scale screens, they were found to have false
discovery rates of 22% [26] and 38% [27]. An evaluation of five HTP methods found that their
sensitivity rates ranged from only 21% to 36% at false discovery rates of 0-11% [28]. Similarly,
mass spectrometry analyses of human serum typically produce many false negatives [29]. One
problem is that human serum has a high dynamic range – protein concentrations are estimated to
vary over 10 orders of magnitude [29]. Mass spectrometers have a much smaller range of
detection, leading to false negatives: low abundance proteins are not detected. This challenge
may be somewhat diminished by extensive sample fractionation [30].
1.2.1.2 Biological heterogeneity
Tumor heterogeneity. Often only a single sample from each tumor is available for analysis. But
tumors are highly heterogeneous, so these samples may not be representative of the whole tumor
[31-33]. Tumors comprise cells belonging to several distinct subpopulations – for example,
tumor regions can be hypoxic to different extents, or be made up of different proportions of
tumor initiating cells (cancer stem cells) – and these differences have consequences for
predicting drug response and prognosis [32-34]. Intra-tumor heterogeneity can lead to different
5
cell populations expressing different levels of protein, or having different mutations and copy
number alterations, which complicates analyses. For example, variable tumor epithelial and
stromal cell content in breast tumor samples can significantly affect gene expression profiles and
signature accuracy [35, 36]. Some of these difficulties can be alleviated with techniques such as
laser-capture micro-dissection [37], which can isolate regions of a sample that contain a more
uniform population of cells, but these techniques remain costly and slow. In addition, they may
not result in a sufficient amount of material for follow-up experiments. Though single samples
from tumors can suffice for population-level studies [33], the fact that two samples from the
same patient can be quite different means that this heterogeneity poses a significant challenge to
personalized medicine. On top of that, in cancer studies there are issues with the control samples
available for analysis: often these “normal” samples come from tissue directly adjacent to the
tumor. The properties of these samples, such as gene expression profiles, may be quite different
from those of more distant tissue or of a healthy patient [38].
Transcriptional noise in aging. Variability in the expression of single genes – also called gene
expression noise – increases with age. Cell-level expression variability increases with age in
mouse heart [39] (assessed using single cell PCR), and population-level variability increases in
rat retina [40] and human kidney, skeletal muscle, and brain [41] (assessed using microarrays),
though not for every single gene. Aging-induced transcriptional noise is thought to result from
the accumulation of random nuclear DNA damage with age [39]. As a consequence of this noise,
the signal in HTP aging data can be difficult to detect by conventional analysis methods [42].
6
1.2.1.3 Different experimental platforms can disagree
Experimental platforms produced by different companies can yield conflicting results [43-45].
For example, different microarray platforms use different probe design and labeling and have
different dynamic ranges. A gene may be overexpressed in disease on one platform, yet under-
expressed on another, simply because the two platforms use different DNA sequences to “probe”
the same gene. On a given platform, some genes are represented by many probes while others by
one or none, and this representation is different for different platforms. Genes that are expressed
at low levels are particularly problematic for concordance across array platforms [46]. Another
problem is that genome annotations continue to grow and change [47]. Updated probe set
definitions can substantially affect the number and the identity of differentially expressed genes
[48].
1.2.2 ANALYSIS
A primary goal of integrative computational biology analysis in aging and disease is to identify
small groups of genes/proteins/microRNAs etc. that can be used to improve diagnosis, predict
survival or predict treatment response, i.e., to identify prognostic or predictive biomarkers. Their
identification in HTP data is challenging since basic analysis methods fail to capture the entire
signal in the data, and good signatures comprise not only the most differentially expressed
molecules. For this section we will focus our attention on gene expression microarray studies
(they have been the most extensively studied), but the criticisms apply equally well to similar
experimental designs, such as those using HTP protein or microRNA assays.
7
1.2.2.1 Lists of differentially expressed genes show poor overlap across studies
The most popular way of analyzing microarray data is to identify individual genes that are
differentially expressed in one condition vs. in another, e.g., in non-responders vs. responders to
some drug treatment. Differentially expressed genes are widely used in aging and disease
research, e.g. as biomarkers for diagnosis, prognosis, and drug response. Unfortunately, there are
major challenges with their identification and interpretation. Previous work has shown that lists
of differentially expressed genes are poorly reproduced across studies [49, 50]; even random
subsets of samples from one experiment can yield widely divergent gene lists. This problem is
caused by high dimensionality, small number of samples, and noise (biological and technical
variability), but can be exacerbated by the analysis method. For example, analyses that quantify
the differential expression of gene groups rather than individual genes show higher conservation
across platforms and studies [51].
1.2.2.2 The most differentially expressed genes do not yield the best signatures
Very often in microarray experiments, the most differentially expressed genes are used to
construct prognostic or predictive signature: machine learning methods are trained to use the
expression levels of those genes to predict, for example, the disease state of a patient and
probability of survival. The problem with this approach is that single-gene analyses overlook
multivariate effects. More sophisticated analyses are needed to identify sets of genes that
complement one another, i.e., ones whose combined expression levels yield the best-performing
signatures. Such analyses show that genes essential to a good prognostic signature are often not
highly differentially expressed on their own [52, 53].
8
1.2.2.3 Signatures validate poorly on other datasets
One of the most important contributions of HTP biology to disease and aging research has been
to develop prognostic and predictive signatures. Unfortunately, as with differentially expressed
genes, many signatures have failed to validate by other methods or in new cohorts of patients
[19, 54]. Existing prognostic and predictive biomarkers for the same condition overlap only
partially, and the set of biomarker genes identified depends strongly on the subset of patients
used to generate it [55]. One study estimated that to achieve 50% overlap in prognostic gene sets
for breast cancer patients would require several thousand samples [56]. Several factors contribute
to this problem, including: 1) diverse patients and heterogeneous samples, 2) different profiling
platforms, 3) diverse statistical and bioinformatics approaches to biomarker identification [57],
4) an insufficient number of samples [56], and 5) the existence of multiple equivalent signatures
[55, 58].
1.3 How integrative computational biology can address these
challenges
The field of integrative computational biology uses techniques from computer science,
mathematics, physics and engineering to comprehensively analyze and interpret biological data.
Through the creation of new analysis and visualization methods, software tools and databases, it
can help diminish the challenges to HTP biology. Here we present some successful applications
of integrative computational biology to understanding aging and treating disease, drawing
examples from the databases and algorithms central to the work in this thesis. These applications
fall under two main categories: data integration and network analysis.
9
1.3.1 DATA INTEGRATION
As we have seen, noise in HTP studies can arise both from biological and technological
variability. One of the most effective strategies for reducing both types of noise is data
integration. The idea is simple: we can be more confident about the result of an experiment if
similar experiments yielded similar results. We can integrate different experiments that measure
the same biological entity, such as microarray studies measuring tumor vs. normal gene
expression differences on different experimental platforms. We can also integrate different data
types, such as mutation, expression, and proteomic data. Clearly, data integration can increase
our confidence in results that are consistent across multiple studies and experimental modalities.
But data integration can also increase sensitivity, since different platforms and methods exhibit
different biases – e.g., protein interactions may be undetectable by some methods.
1.3.1.1 Integrating the same type of data across multiple platforms and studies
With microarray and similar data, small sample numbers and different experimental platforms
can lead to highly variable results. These problems can be addressed by combining data from
different studies and platforms, which increases the effective number of samples and helps
control for inter-platform heterogeneity.
Most approaches to integrating microarray data can be divided into two general classes, pooling
and meta-analyses. In pooling, multiple expression datasets are merged into a single dataset [59,
60]; typically, gene measurements from each separate study are transformed before pooling to
make the experiments more comparable [61]. Previous work found that pooling six breast cancer
datasets (over 900 samples in total) yielded better-performing signatures [59]. In contrast, for a
10
meta-analysis, statistics are computed for each dataset separately and then combined. Meta-
analyses identify gene changes that are seen consistently across many studies [7]. Meta-analyses
have recently identified core sets of genes commonly regulated with age [42] or with calorie
restriction [62] across multiple species. Several databases gather information from multiple
studies to facilitate meta-analysis, including, for aging research, Gene Aging Nexus [63] and the
Human Ageing Genomic Resources [64], and for cancer research, Oncomine [65], GeneSigDB
[66], and the Cancer Data Integration Portal (CDIP; http://ophid.utoronto.ca/cdip).
One example of a successful meta-analytic strategy is the Rank Product method [67-69], which
we apply in Chapters 3 and 5. In this method, to identify (for example) genes that are
consistently up-regulated across studies in disease vs. normal samples, the approach is: 1) Within
each study, rank all genes from most to least up-regulated (using some criterion, such as the t-
statistic) 2) Compute products of ranks for all genes across all studies; 3) Generate a background
distribution of randomized Rank Products by permuting expression values for genes in each
array and repeating (1)-(2) many times.
Formally, say that there are G genes, K studies, and B randomizations. Let rgk be the rank of
gene g in the kth study. Then the Rank Product of gene g is:
(∏
)
⁄
Permuting expression values and repeating calculations yields randomized Rank Products RPg*b
;
we can then assign a P value to each gene g as:
⁄ ∑ ∑
11
We can then correct for multiple testing using, for example, the False Discovery Rate (FDR)
procedure.
1.3.1.2 Integrating different types of data
Integrating complementary data from different sources is helpful for reducing noise and
prioritizing targets [70, 71]. For example, Gortzak-Uzan et al. combined proteins identified in
ovarian cancer ascites with differentially expressed genes from CDIP and protein-protein
interactions from the Interologous Interaction Database (I2D) [72] to identify putative
biomarkers for early ovarian cancer detection in serum [71]. Methods of gene set analysis [73]
can be used to combine experimental data with annotation and pathway databases such as the
Gene Ontology [74], KEGG, and Reactome; these allow us to convert lists of differentially
expressed genes into lists of differentially expressed gene groups, which are more stable across
studies [51]. Gene set analysis is a crucial tool for most bioinformatics investigations, and some
form of it is used in each of Chapters 2-5. Gene set analysis methods differ in their choice of
gene-level statistics, the way they combine these into gene-set statistics, and their methods for
assigning significance [73].
One popular algorithm for analyzing gene sets is Gene Set Enrichment Analysis (GSEA) [51].
GSEA is the core algorithm used to combine drug response data with gene expression signatures
of disease in the Broad Institute’s Connectivity Map (CMap;
http://www.broadinstitute.org/cmap/) [75], a resource that we make extensive use of in Chapters
3 and 5; we also implement the same procedure as the first step of our CMapBatch algorithm
(Chapter 5). CMap (build 02) contains data on the gene expression responses of human cell lines
12
to 6100 drug treatments (corresponding to 1309 unique drugs; for some drugs, drug dosage or
cell line was varied). For each drug treatment, probe sets are ranked in order of most up-
regulated to most down-regulated in response to the treatment. As previously described [75], the
CMap online tool takes as input a gene signature of disease from a separate experiment – for
example, a set of genes up-regulated and a set of genes down-regulated with diabetes – and
applies GSEA to identify drugs that reverse these harmful gene expression changes.
Formally, for each drug treatment i, and for each of the two sets of signature genes, called tag
lists (the up-regulated genes and the down-regulated genes), the CMap resource computes a
“Kolmogorov-Smirnov statistic, ksupi and ksdown
i , and then combines these statistics into a
“connectivity score”. These statistics are calculated as follows.
Let n be the number of probe sets for which drug-treatment response was measured (in CMap, n
= 22,283), t the number of probe sets in the tag list, and V a vector (v1,…,vt) of the rank of each
probe set in the tag list for treatment instance i (so each vj ϵ (1,…,22283)), sorted in ascending
order.
Define a and b as:
[
]
[
]
Then the Kolmogorov-Smirnov statistic for the tag list and instance is calculated as:
{
13
Repeating these calculations using the set of up-regulated genes and the set of down-regulated
genes from the query signature yields ksupi and ksdown
i, respectively. Then if sgn(ksup
i) =
sgn(ksdowni), set the connectivity score S
i = 0. Otherwise, set s
i = ksup
i - ksdown
i, p = max(s
i), and q
= min(si), and calculate the connectivity score as:
{
For a given input signature, the connectivity scores will range from -1 to 1; large negative scores
correspond to drugs that reverse the gene expression changes in the query signature. The
Connectivity Map has been applied to identify new therapeutics for a wide range of diseases
including various cancers (e.g., [76, 77]).
1.3.1.3 Integrating data to predict gene function
Though many disease-related genes remain uncharacterized, HTP data and in silico methods can
be applied to assign them putative functions [78]. For example, in the successful MouseFunc
prediction challenge [79], teams of scientists competed to predict mouse gene function (Gene
Ontology categories) on the basis of several different data sources, including expression,
sequence, interactions, phenotype annotations, disease associations and phylogenetic profiles.
Several groups have integrated data from a variety of independent sources to create functional
interaction networks and predict new gene functions. For example, Wormnet [80] combines co-
expression, protein interaction, genetic interaction, and co-citation data to predict functional
relations between pairs of C. elegans genes. Individual data sources were weighted by a log-
likelihood score designed to reflect their ability to recover shared Gene Ontology functions, and
14
then combined to create weighted edges between genes. Wormnet edges were then applied to
identify new gene functions using a simple guilt-by-association approach. For example, the
authors demonstrated that Wormnet neighbors of genes that increase lifespan (identified from an
RNAi screen) were themselves highly enriched for lifespan-increasing genes (identified in
independent RNAi screens).
1.3.2 NETWORK ANALYSIS
Network approaches have been successful in addressing the analysis-based challenges of HTP
biology. Genes do not act in isolation; they form highly complex and interlinked molecular
networks. Examining genes in the context of these networks can yield valuable clues about their
function and relations, and expand our knowledge of individual pathways and their interactions
[81, 82]. For example, despite the noise in current protein interaction data sets, network analysis
can uncover biologically relevant information, such as lethality [83, 84], functional organization
[85-87], hierarchical structure [88, 89], modularity [90] and network-building motifs [91-93].
Three important applications of networks to aging and disease research include signature
generation, signature interpretation, and disease gene prediction.
1.3.2.1 Networks for signature generation
By integrating network information with gene expression data, we can identify predictive
signatures that perform better and are more conserved across studies than signatures based on
gene expression data alone. There are many ways of using networks to create improved gene
signatures. One class of methods that has proven very successful is score-based subnetwork
15
biomarkers [52, 94-97]. In these approaches, genes are aggregated around an initial “seed” gene
in a network to generate subnetworks whose pooled activity levels can be used to predict the
value of some response variable, such as disease status or survival time. For example, Chuang et
al. [52] calculated subnetwork scores as the mutual information [98] between subnetwork
activity (the mean normalized expression of subnetwork genes) and the class label (cancer vs.
normal). Subnetworks were grown outward iteratively from a seed node using a greedy search
procedure to maximize subnetwork score: at every step, the network neighbor of the current
subnetwork yielding the largest score increase was added to the subnetwork. Subnetworks
identified using this approach were shown to be highly conserved across studies, and to perform
better than individual genes or pre-defined gene groups at predicting breast cancer metastasis
[52]. Importantly, many crucial genes belonging to subnetwork biomarkers were not
differentially expressed on their own, demonstrating the added value of a network approach.
Related approaches have been used to develop subnetwork biomarkers for colon cancer using a
combination of proteome and transcriptome data [99, 100].
1.3.2.2 Networks for signature interpretation
Many genes that play a role in predictive signatures for a given disease have not been previously
linked to that disease, and thus can be considered novel candidate disease genes. Networks can
be used to link these genes with known disease mechanisms and pathways [101-104]. Gene
signatures mapped to protein interactions can be further annotated with other profiles (including
proteomic, CGH, and miRNA studies), and with network structures, such as graphlets [93, 105].
Networks can also reveal new connections between different signatures. For example, though a
recently-identified 15-gene prognostic and predictive signature in lung cancer [106] did not
16
directly overlap with previously published ones, network analysis revealed that they were highly
related: there were direct interactions between the protein products of genes from the new
signature and others. Similar results have been shown in other studies [14, 107]
1.3.2.3 Networks for identifying new disease genes
Analyses of the network connectivity of disease genes have shown that they can be characterized
by several topological properties. For example, proteins encoded by cancer genes tend to be
central in interaction networks (they have high degree and betweenness centrality [108-110]),
have high clustering coefficients [111], and are overrepresented in network motifs [110].
Several methods use the topological characteristics of known disease genes, in combination with
other features (such as Gene Ontology (GO) categories, protein domains, biological pathways,
and sequence features), to predict new disease genes [110, 112] or functional SNPs [113]. Many
algorithms identify modules in interaction networks, or groups of densely interconnected genes
that can be highly functionally related [114-117]. Module-finding algorithms can also be applied
to predict new disease genes [52, 118, 119]. In these approaches, graph modules are first
determined from network topology; next, a statistical test such as the hypergeometric test is
applied to identify functional categories that individual modules are enriched for; and finally, all
genes in a module are annotated with its enriched functions.
There are many distinct approaches to use network topology to identify modules. Some methods,
like Restricted Neighborhood Search Clustering (RNSC) [114], partition a network into disjoint
modules. In RNSC, network nodes are first randomly assigned to modules, and then at every
iteration of the algorithm one node is moved to a different module to reduce a cost function that
17
depends on the number of intramodule and intermodule edges. Other methods, such as that
proposed by Lancichinetti and Fortunato [120], use only local network topology and allow for
the possibility of overlapping modules. In that algorithm, modules are iteratively grown out from
individual seed nodes to greedily maximize a measure of modularity that is a function of the
number of edges internal to the module (i.e., edges connecting two internal nodes) and the total
number of edges connected to module nodes.
1.4 Research contributions of this thesis
The four projects constituting the research contributions of this thesis are concerned with
identifying biomarkers and therapeutics for aging and disease. Each project uses several of the
techniques described above to help deal with the challenges of HTP biology. In Chapter 2, I
proposed a new network analysis method for identifying biomarkers of aging that uses gene
modules rather than individual genes, and showed that it improved biomarker robustness and
classification performance. In Chapter 4, I adapted gene-level HTP chemogenomic data to study
drug response at the systems level; I connected drugs to pathways, phenotypes and networks, and
built the NetwoRx web portal (http://ophid.utoronto.ca/networx/) to make these data publicly
available. And in Chapters 3 and 5, I developed a novel meta-analysis pipeline to identify new
drugs that mimic the beneficial gene expression changes seen with calorie restriction (Chapter 3),
or that reverse the pathological gene changes associated with lung cancer (Chapter 5). In these
applications, noise and biological heterogeneity are mitigated by conducting meta-analyses using
large numbers of gene signatures (Chapters 3, 5), and modeling drug and disease responses at the
level of gene groups (Chapter 4) or subnetwork modules (Chapter 2) rather than individual
genes; and also by integrating multiple complementary HTP data sources e.g., gene expression
18
data with genome-wide RNAi phenotypes (Chapter 2) or large scale drug-induced growth-
inhibition data (Chapter 5).
Chapter 2. Inferring the functions of longevity genes with modular subnetwork biomarkers
of Caenorhabditis elegans aging.
Kristen Fortney, Max Kotlyar, and Igor Jurisica (2010). Genome Biology 11(2):R13. [97]
In this study we developed a new method to identify gene expression biomarkers of aging. We
overlaid expression data from two worm aging studies onto a functional interaction network, and
identified modular subnetwork biomarkers using a new performance criterion that trades off
modularity – internal cohesiveness at the network level – with relatedness to the class label (here,
worm age). We found that our method outperformed previous ones on key measures, yielding
biomarkers that were more conserved across studies and performed better on a difficult machine
learning task: predicting age based on expression data. We analyzed modular subnetwork
biomarkers to determine their relation to known mechanisms of aging, and found that they play
central roles in metabolic and DNA repair pathways, and are significantly enriched for longevity
genes. Finally, we applied them to assign putative aging-related functions to poorly characterized
longevity genes.
Chapter 3: In silico drug screen in mouse liver identifies candidate calorie restriction
mimetics.
19
Kristen Fortney, Eric Morgen, Max Kotlyar, and Igor Jurisica (2012). Rejuvenation Research
15(2). [121]
In this study we conducted a meta-analysis using multiple gene signatures to identify drugs that
mimic calorie restriction. Calorie restriction (CR) extends lifespan in mammals and can delay the
onset of age-related diseases, including cancer and diabetes. Drugs that target the same genes and
pathways as CR may have enormous therapeutic potential. We collected nine previously
published gene expression signatures of CR and screened them against HTP drug-response data
for over 1000 drugs (from the Connectivity Map) to obtain sets of drugs that mimic CR at the
transcriptional level. We implemented a novel meta-analysis method and identified 14 drugs that
consistently mimic CR across signatures. We characterized these drugs by relating them to
known lifespan-extending drugs and analyzed them using a mode-of-action network.
Chapter 4: NetwoRx: Connecting drugs to networks and phenotypes in S. cerevisiae.
Kristen Fortney, Wing Xie, Max Kotlyar, Joshua Griesman, Yulia Kotseruba and Igor Jurisica
(2012). Submitted. [122]
In this study we adapted gene-level HTP chemogenomic data to study drug response at the
systems level. We integrated the three largest S. cerevisiae chemogenomic experiments, which
together comprise the responses of thousands of gene knockout strains to 466 drugs, and applied
data-mining approaches to investigate drug effects at the system level. We identified yeast
pathways, functions, and phenotypes that are targeted by particular drugs, computed measures of
drug-drug similarity, and constructed drug-phenotype networks. We created the NetwoRx web
portal (http://ophid.utoronto.ca/networx/) to make the results of our analyses fully available and
20
to facilitate new systems-level analyses of drug response. We demonstrated with use case
examples how NetwoRx can be applied to target specific phenotypes, repurpose drugs using
mode-of-action analysis, investigate bipartite networks, and predict new drugs that affect yeast
aging.
Chapter 5: Computationally repurposing drugs for lung cancer with CMapBatch:
candidate therapeutics from an integrative meta-analysis of cancer gene signatures and
chemogenomic data.
Kristen Fortney, Joshua Griesman, Max Kotlyar, and Igor Jurisica (2012). In Preparation. [123]
Though existing methods that use signatures to repurpose drugs are based on the analysis of
individual signatures, for many diseases dozens of gene signatures are in the public domain. We
developed a new meta-analysis method, CMapBatch, to exploit these data, and made it publicly
available at http://ophid.utoronto.ca/cmapbatch. CMapBatch is a computational meta-analysis
pipeline that takes as input a collection of gene signatures of disease and outputs a list of drugs
predicted to consistently reverse pathological gene changes. We applied CMapBatch to a
collection of 21 gene expression signatures of lung cancer. We demonstrated that, while drug
candidates identified by Connectivity Map analysis of individual gene signatures are highly
variable, CMapBatch returns very stable sets of top drug candidates. Our meta-analysis of all 21
signatures revealed that 247 drugs consistently reversed lung cancer gene changes. In silico
validation on the NCI-60 collection showed that drug candidates significantly inhibit growth in
nine lung cancer cell lines.
22
2 Inferring the functions of longevity genes with
modular subnetwork biomarkers of Caenorhabditis
elegans aging
This chapter is based on:
Kristen Fortney, Max Kotlyar, and Igor Jurisica (2010). Inferring the functions of longevity
genes with modular subnetwork biomarkers of Caenorhabditis elegans aging. Genome Biology
11(2):R13.
2.1 Abstract
A central goal of biogerontology is to identify robust gene-expression biomarkers of aging. Here
we develop a method where the biomarkers are networks of genes selected based on age-
dependent activity and a graph-theoretic property called modularity. Tested on C. elegans, our
algorithm yields better biomarkers than previous methods — they are more conserved across
studies and better predictors of age. We apply these modular biomarkers to assign novel aging-
related functions to poorly characterized longevity genes.
23
2.2 Introduction
Aging is a highly complex biological process involving an elaborate series of transcriptional
changes. These changes can vary substantially in different species, in different individuals of the
same species, and even in different cells of the same individual [39, 124, 125]. Because of this
complexity, transcriptional signatures of aging are often subtle, making microarray data difficult
to interpret – more so than for many diseases [42, 63]. Interaction networks represent prior
biological knowledge about gene connectivity that can be exploited to help interpret complex
phenotypes like aging [126, 127]. Here for the first time, we integrate networks with gene
expression data to identify modular subnetwork biomarkers of chronological age.
With few exceptions, previous analyses of aging microarray data have been limited to studying
the differential expression of individual genes. However, single-gene analyses have been
criticized for several reasons. Briefly, they are insensitive to multivariate effects and often lead to
poor reproducibility across studies [49, 50, 58] – even random subsets of data from the same
experiment can produce widely divergent lists of significant genes. Recent studies have shown
that examining gene expression data at a systems level – in terms of appropriately chosen groups
of genes, rather than single genes – offers several advantages. Compared to significant genes,
significant gene groups are more replicable across different studies, lead to higher performance
in classification tasks, and are more biologically interpretable [49, 52].
Many complementary approaches to the systems-level analysis of microarray data have been
proposed. These range from methods like Gene Set Enrichment Analysis [51], which determines
whether members of pre-defined groups of biologically related genes (such as those supplied by
the Gene Ontology [74]) share significantly coordinated patterns of expression, to machine
learning methods that consider all possible combinations of genes and identify groups whose
24
combined expression pattern can distinguish between different phenotypes – with no constraint
that the genes in a group must be biologically related.
Network methods for interpreting gene expression data [52, 95, 128-132] fall in between these
two extremes: they incorporate prior biological knowledge in the form of an interaction network
– so that genes in a significant group are likely to participate in shared functions – but they
consider many different combinations of genes, and so are more flexible than methods using pre-
defined gene groups. Gene groups identified by these methods constitute novel biological
hypotheses about which genes participate together in common functions related to the class
variable.
Here, we propose a novel strategy for identifying subnetwork biomarkers: we incorporate a
measure of topological modularity into the expression for subnetwork score. This yields
subnetwork biomarkers that are biologically cohesive and that have different activity levels at
different ages. Using two aging microarray datasets, we show that our method improves on
previous approaches, yielding subnetworks that are more conserved across studies, and that
perform better in a machine learning task. We identify the subnetworks that play a role in worm
aging, and then explore their connection with known longevity genes. Finally, we apply them to
assign putative aging-related functions to longevity genes (genes that affect lifespan when
deleted or perturbed). Worm is the ideal model organism for studying these questions, since it
has the largest number of characterized longevity genes [64], and microarray datasets using
worms of 4 or more ages are publicly available [125, 133]. Our work builds on a family of
successful algorithms that incorporate supervised information to find subnetworks with
phenotype-dependent activity, which we discuss below.
25
2.2.1 Methods for extracting active subnetworks by integrating gene expression
data, network connectivity, and supervised class labels
To date, some of the most successful network-based methods of gene group identification for
class prediction have been the score-based subnetwork markers originally proposed in Ideker et
al. [94] and developed and expanded in later works, e.g. [52, 95, 128, 129, 134, 135].
Subnetworks identified using these approaches were recently shown to be highly conserved
across studies and to perform better than individual genes or pre-defined gene groups at
predicting breast cancer metastasis [52].
Most of these methods share the same basic architecture. Each algorithm aggregates genes
around a seed node in a way that maximizes some measure of performance. In previous
implementations, the score is a function of the subnetwork activity (often calculated as the mean
expression value of the genes in the subnetwork) and the class label – i.e. subnetworks get high
scores if their activity is different for different classes. Subnetworks are grown outward
iteratively from a seed node, typically using a greedy search procedure to maximize subnetwork
score: at every step, the network neighbour of the current subnetwork yielding the largest score
increase is added to the subnetwork.
Subnetwork scores are calculated differently in individual implementations (e.g. [95] uses the t-
statistic and [52] uses mutual information) but are always solely a function of what we refer to as
class relevance, i.e. of expression data and class labels. In particular, in all previous
implementations the subnetwork score is insensitive to network topology – the only topological
constraint is that subnetwork members must form a connected component.
26
However, a large body of work in network theory has demonstrated the value of more
sophisticated topological measures of network cohesiveness, or modularity [116, 117]. In fact,
many algorithms successfully identify groups of functionally related genes on the basis of
network topology alone. The simple intuition behind these algorithms is that genes that are
members of a highly interconnected group (that is only sparsely connected to the rest of the
network) are more likely to participate in the same biological function or process. In biological
networks, genes belonging to the same topological module are more likely to share functional
annotations or belong to the same protein complex [114, 115, 136].
No score-based subnetwork method proposed to date takes advantage of the rich modular
structure of biological interaction networks. Here, we propose incorporating topological
modularity into the expression for subnetwork score, and show that this approach offers
important advantages – increased conservation across studies, and improved performance on a
learning task. For the remainder of the paper, we refer to subnetworks grown using scores that
are a function of class relevance alone as regular subnetworks, and to those grown using our new
scoring criterion as modular subnetworks.
2.3 Results and Discussion
2.3.1 Identifying active subnetworks in aging by trading off network modularity
and class relevance
Here, we give a basic outline of our method for identifying subnetworks that are both highly
modular and relevant to the class variable (Fig.1), and then we discuss the novel aspect – the
subnetwork scoring method – in detail; other algorithm parameters are listed in Materials and
Methods. We compared the performance characteristics of modular and regular subnetworks
using two microarray studies of worm aging [125, 133].
27
Figure 2-1. High-scoring subnetworks fulfill two criteria: they are modular and related to
aging.
A. High-scoring subnetworks have high modularity, i.e., they are highly interconnected, and sparsely
connected to the rest of the network. B. High-scoring subnetworks have high class relevance, i.e. they
have activity levels that increase or decrease as a function of worm age.
2.3.2 Identifying modular subnetworks
Our method is summarized in Fig. 2-2. First, we assign a weight to every edge in the interaction
network that reflects the strength of the relation between the two genes that flank it (quantified
using Spearman correlation). For genes i and j with normalized expression vectors iz andjz ,
the weight ijw
is defined as:
1 if there is a network edge between nodes and ( , ) , where
0 otherwiseij i j ij ij
i jw corr
z z
Next, we grow subnetworks starting at particular seed genes in the network (see Materials and
Methods). At each stage of the network growth procedure, the algorithm considers all network
28
neighbours of the current subnetwork N. For each neighbour, the algorithm calculates the change
in subnetwork score that would result if that neighbour were added to N. Here, we define the
subnetwork score S as a weighted sum of class relevance R and modularity M, where R captures
how related subnetwork activity is to age and M measures subnetwork cohesiveness:
for some 0S R M
At every stage, the neighbour that leads to the highest score increase (without reducing either
class relevance or modularity) is added to the subnetwork.
The intuition behind the modularity parameter M is that it allows us to trade off the information
in gene expression data with the prior knowledge about gene connectivity encoded in the
functional interaction network: for noisy microarray studies, or ones with few samples, we
should place a greater emphasis on prior knowledge by choosing higher values for β. Previous
subnetwork scoring algorithms effectively assume that β = 0, or S = R.
Figure 2-2. Identifying modular subnetworks.
29
A. Start with the largest connected component of the functional interaction network representing all
genes whose expression has been measured B. Weight every edge of the network with the absolute
value of the Spearman correlation between the two genes flanking it. C. Identify age-related
subnetworks by growing subnetworks iteratively out from seed nodes.
2.3.3 Class relevance R
We measure class relevance as the Spearman correlation between subnetwork activity and age,
so that a subnetwork is considered age-related to the extent that its activity level either increases
or decreases monotonically with increasing age (Fig. 2-1B). Subnetwork activity is calculated as
the mean expression level of subnetwork genes. Thus, if the genes in subnetwork N have
normalized expression vectors 1{ ,..., }nz z , and c is the vector of ages for each sample, then the
activity is 1
1 n
i
in
a z , and the class relevance is ( , )R corr a c .
2.3.4 Network modularity M
To define the modularity of a connected set of genes in a network, we use a weighted
generalization of the local measure proposed in Lancichinetti and Fortunato [120]. We calculate
the modularity for a subnetwork as the edge weight internal to the subnetwork divided by the
total edge weight of all subnetwork nodes, squared. For subnetwork N, we define the internal,
external, and total weight:
int
,
1
2ij
i j N
w w
ext ij
i Nj N
w w
tot int extw w w
30
Then the modularity of N can be written as int
2
tot1
wM
w
. For all subnetworks, M lies between 0
and 1.
2.3.5 Comparing regular and modular subnetworks
To compare the performance of regular and modular subnetworks, we generated several
subnetworks of each type by adjusting algorithm parameters. For modular subnetworks, we set
the modularity coefficient β = 50, 100, 250, 500, or 1000 (significant subnetworks generated
using these parameters are called m1, m2, m3, m4 and m5). For regular networks we set β = 0,
and halted subnetwork growth at different score cut-off thresholds r = 0.01, 0.02, 0.05, 0.1 or 0.2
(groups of significant subnetworks are called r1, r2, r3, r4, and r5).
We generated modular subnetworks m1-m5 and regular subnetworks r1-r5 separately for
two different C. elegans aging microarray datasets: 104 microarrays of individual wild-type (N2)
worms over 7 ages (9-17 microarrays per age) [125], and 16 microarrays of pooled sterile (fer-
15) worms over 4 ages (4 microarrays per age) [133]. For each study, we grew subnetworks
seeded at every node in the functional interaction network, so that corresponding subnetworks
grown using different expression datasets could be directly compared. We used randomization
tests to determine which subnetworks were significantly associated with age in each study. For
further details, see Materials and Methods. Below, we compare these regular and modular
subnetworks in terms of their robustness across studies and performance on a machine learning
task.
31
2.3.6 Modular subnetworks are more robust across studies than regular
subnetworks
Comparing the modular subnetworks m1-m5 and the regular subnetworks r1-r5 derived from
both studies, we found that modular subnetworks identified as significant in one study were
highly likely to be significant in the other study (i.e., seed genes of significant modular
subnetworks were highly conserved across studies). Fig. 2-3 shows that 15-18% of significant
modular subnetworks were identified in both studies; in contrast, only 3-5% of significant
regular ones were.
For each modular and regular network type, we also calculated the significance of the overlap
between sets of significant seed genes using the hypergeometric test, and these values showed
the same trend (Fig. 2-3). While all subnetwork types were more conserved across studies than
would be expected by chance (p < 10-3
), modular subnetworks were much more conserved than
regular ones – they had enrichment p-values ranging from 10-84
to 10-137
, while regular
subnetworks had p-values from 10-3
to 10-38
.
While substantially more modular than regular subnetworks were conserved across studies, many
subnetworks were identified in only one study; this can be partially accounted for by noise in the
individual microarray studies, the fact that the two studies used different microarray platforms
and different strains of worm, and the fact that the current functional interaction network is not
complete and contains some errors.
32
Figure 2-3. Modular subnetworks are highly conserved across studies.
Modular subnetworks m1-m5 are shown in green and regular subnetworks r1-r5 in blue. Bar height
shows the percentage overlap across studies for seed genes of significant modular and regular
subnetworks derived from the data in Golden et al. and Budovskaya et al.; this is calculated as the size of
the intersection of sets of significant seed genes from both studies, divided by the union. P-values above
each bar show the significance of the overlap calculated using the hypergeometric test.
2.3.7 Modular subnetworks trained on aging gene expression data from wild-type
worms successfully predict age in fer-15 worms
We compared the performance of single genes, regular subnetworks, and modular subnetworks
on a machine learning task: predicting worm age on the basis of gene expression levels (Fig. 2-
4). We acquired sets of significant genes from [125]; g1 is made up of all the genes considered
33
significant in that study, and g2 is the aging gene signature used for machine learning in [125]
(i.e., g2 is the 100 most significant genes from g1). Using machine learning features drawn from
gene sets g1-g2, regular subnetworks r1-r5, or modular subnetworks m1-m5 derived from the
larger microarray study [125], we trained support vector regression (SVR) algorithms to predict
the age of wild-type worms on the basis of gene expression (for details, see Materials and
Methods). We then tested the performance of the learned feature weights on an independent data
set in a different strain of worm (fer-15) [133]. Performance on the test set was quantified as the
squared correlation coefficient (SCC) between worm ages predicted by the SVR and true worm
ages (measuring performance in terms of mean-squared error would be inappropriate here,
because the worms in the training and test sets had different lifespans). All p values reported in
this section were calculated using the Wilcoxon ranksum comparison of medians test.
To capture the typical performance of machine learners that used either genes or subnetworks as
features, we considered four different sizes of feature set (5, 10, 25, or 50 features). Then, for
each size of feature set, and for each set of genes (g1-g2) or subnetworks (r1-r5, m1-m5), we
performed 1000 tests. For example, for the 25-feature SVRs, and for the m1 significant
subnetworks, we randomly drew 25 subnetworks from m1, trained them on the wild-type worm
data, and then tested them on the fer-15 data – and repeated that process of drawing, training, and
testing 1000 times. Fig. 2-5 summarizes test results at each feature level, showing the typical
performance of the best sets of genes, regular subnetworks, and modular subnetworks. Full
results for every parameter setting are available in Figure 2-S1, and p-value comparisons in
Table 2-S1.
Over all tests, the SVRs using 25 or 50 modular subnetwork features (of the m1 and m3 types)
achieved the highest typical performance, with a median SCC of 0.91 between predicted and true
34
worm age; this is a statistically significant 7% and 26% improvement over the best performances
of regular subnetworks (p < 10-83
) and genes (p <10-202
), respectively (Fig. 2-5).
Figure 2-4. Predicting worm age using machine learning.
The activities of genes or subnetworks (subnetwork activity is calculated as the mean activity of its
member genes) are used by Support Vector Regression algorithms to predict age on the basis of gene
expression. Performance is typically measured using both the mean-squared error of the difference
between true and predicted ages, and the squared correlation coefficient between true and predicted
ages.
2.3.8 Subnetworks vs. genes
Modular and regular subnetworks dramatically outperform significant genes across a range of
parameters. For example, using 25 features (Fig. 2-5), the best modular subnetworks have a
median SCC of 0.91 and the best regular subnetworks of 0.85, versus 0.70 for the 100-gene
signature. This result was consistent across feature levels and parameter settings, and is highly
significant for all tests: i.e., for every comparison between modular subnetwork features and
gene features, we have p < 10-15
. For all sizes of feature set, the best-performing subnetworks
35
(m3) always showed a median SCC at least 0.16 higher than the best-performing genes (g2), i.e.
at least a 24% improvement.
2.3.9 Modular vs. regular subnetworks
For all sizes of feature set, the median SCC of the best modular subnetwork type always
exceeded that of the best regular subnetwork type by at 0.05-0.08, corresponding to a 6-10%
performance improvement (Fig. 2-5). The performance difference between the best modular
subnetworks and the best regular subnetworks is highly significant at all feature levels (p < 10-
32).
It was not only the best modular subnetworks that outperformed the best regular subnetworks; in
fact, modular subnetworks significantly outperformed the best regular subnetworks for most
parameter settings. With the exception of m5 ( 1000 ), each modular subnetwork type
significantly outperforms the best regular subnetwork type at all feature levels. For three types of
modular subnetwork (m1-m3), the performance difference between them and the best regular
subnetworks is highly significant (ranksum p < 10-26
for every comparison); m4 outperforms the
best regular subnetworks at p < 10-5
for three feature levels, and at p<10-2
for 5 features; for m5,
there is no consistent trend (Fig. 2-S1). All pairwise comparisons (p-values) between regular and
modular subnetworks are available in Table 2-S1.
36
Figure 2-5. Subnetworks and genes predict the age of fer-15 worms.
Modular subnetworks are shown in green, regular subnetworks in blue, and gene sets in gray. This figure
shows the best-performing type of modular subnetworks, regular subnetworks, and genes at each
feature level. For modular subnetworks, this is type m3 at every feature level; for regular subnetworks,
type r3 at 5 and 10 features, r2 at 25 features, and r4 at 50 features; for genes, g2 at all feature levels.
Support Vector Regression algorithms using 5, 10, 25, or 50 features were trained to predict age on the
data from Golden et al. [125] and tested on Budovskaya et al. [133]. For each size of feature set, 1000
different Support Vector Regression learners were computed; curves show their median performance
(quantified using the squared correlation coefficient between true and predicted age in the bottom
panel), and error bars indicate the 95% confidence intervals for the medians (calculated using a
bootstrap estimate).
2.3.10 The role of the modularity coefficient in machine learning
Different values of β correspond to giving different proportional weights to the information in
gene expression data and to the prior knowledge about gene connectivity encoded in the
37
functional interaction network: for noisy microarray studies, or ones with few samples, we might
want to depend more on prior knowledge by choosing a high value for β.
For the Golden et al. dataset [125] that we used for training, we found that a value of β = 100
corresponds roughly to treating class relevance and modularity as equally important in the
expression for subnetwork score: in simulations where we generated subnetworks using either
modularity or class relevance alone as the scoring criterion (i.e. S = M or S = R), the median
modularity of the S = M subnetworks was two orders of magnitude smaller than the median class
relevance of the S = R ones, i.e., ‘good’ values for modularity are roughly 100 times smaller than
‘good’ values for class relevance.
As β becomes larger, the proportional contribution of class relevance to the expression for
subnetwork score becomes smaller – and so for large enough values of β, the algorithm will
behave essentially like other purely unsupervised network clustering algorithms that greedily
aggregate nodes around a seed to maximize modularity [120, 136, 137]. In our tests, subnetworks
generated using β = 50, 100, or 250 behaved virtually identically on the learning task; the
performance of β = 500 subnetworks was typically a bit lower; and that of β = 1000 ones lower
still. For large enough values of β, we would expect the typical performance of modular
subnetworks to fall below that of regular subnetworks, because supervised feature selection is
superior to unsupervised feature selection [138].
In the previous two sections, we established that modular subnetworks are more robust across
studies than regular subnetworks and perform better in a worm age prediction task. Modular
subnetworks grown using the coefficient 250 showed both the highest robustness across
38
studies and the best performance on the test set, so we chose to analyze them in greater detail.
For the remainder of the paper, we will explore the relation between these subnetwork
biomarkers (generated from the larger microarray study [125]) and worm aging. The full set of
these subnetworks is available in Table S2.
2.3.11 Modular subnetworks predict wild-type worm age with low mean-squared
error
Here, we show using 5-fold cross-validation that modular subnetworks grown using 250 can
predict the age of individual wild-type worms in the original dataset (104 worm microarrays over
7 ages) with low mean-squared error and a high squared correlation coefficient. Again, we used
support regression algorithms (SVRs) for all learning tasks.
Because it would be circular to predict age on the same dataset that was used to determine the
features [139], we first divided the wild-type worm aging dataset into 5 stratified folds for cross-
validation. We repeated the search for significant subnetworks 5 times, each time using 4/5 of
the data to select significant subnetworks and train SVRs, and then the remaining 1/5 as a test set
to evaluate the learned feature weights. We compared the performance of modular subnetworks
with that of the top 100 differentially expressed genes reported in [125]. To construct SVRs
using genes as features, we used the same 5 stratified folds – i.e., we used 4/5 of the data to
select the top 100 most significant genes and learn feature weights, and the remaining 1/5 as test
data, and repeated this process for each of the 5 folds. As in the original study [125], for each
fold we selected the top 100 significant genes by performing an F-test and applying a False
Discovery Rate [140] (FDR) correction.
39
For four different sizes of feature set (5, 10, 25 or 50), we generated 1000 different SVRs using
either modular subnetworks or genes as features to capture their typical performance. All p-
values reported here were computed using the Wilcoxon ranksum test.
At every size of feature set (5, 10, 25 or 50), modular subnetworks significantly outperform
differentially expressed genes (p < 10-28
) according to the metrics of mean-squared error (MSE)
and squared correlation coefficient (SCC) between predicted age and true age. For example,
using feature sets of size 50, we obtained a median MSE of 7.9 for subnetworks vs. 11.2 for
genes (p < 10-98
), and a median SCC of 0.77 for subnetworks vs 0.69 for genes (p < 10-65
). Fig.
2-6A shows the median performance of modular subnetworks and genes across all tests, and Fig.
2-6B shows the predictions of a typical SVR learner built using 50 modular subnetworks as
features. At every size of feature set, the MSE for genes was at least 1.76 higher than the
corresponding MSE for subnetworks (i.e., at least 22% higher than the corresponding MSE for
subnetworks) (p < 10-28
), and the SCC for subnetworks was at least 0.05 higher (p < 10-28
).
Over all tests, the modular SVRs with 50 features achieved the best performance: a median SCC
of 0.77 and a median MSE of 7.9. This SCC is substantially lower than the highest one achieved
on the test set of pooled fer-15 worms in the last section (0.91) because predicting the age of an
individual worm is more difficult than predicting the age of a large pooled group of age-matched
worms (pooling removes individual variability).
40
Figure 2-6. Modular subnetwork biomarkers of aging predict the age of individual wild-type
worms.
A. Machine learners built from modular subnetworks or genes, predicting worm age in a cross-
validation task on the data from Golden et al. using 5, 10, 25, or 50 features. For each size of feature set,
1000 different Support Vector Regression learners were computed; curves show their median
performance (quantified using mean-squared error in the top panel, and the squared correlation
coefficient between true and predicted age in the bottom panel), and error bars indicate the 95%
confidence intervals for the medians (calculated using a bootstrap estimate). B. The performance of a
typical Support Vector Regression learner built using 50 modular subnetworks as features; true worm
age is shown on the x-axis, and predicted age on the y-axis.
2.3.12 Longevity genes play crucial roles in significant subnetworks
For these analyses, we compiled two sets of known longevity genes (see Materials and Methods,
Table S3): L1, a set of 233 genes that extend lifespan when perturbed, and L2, a larger set of 494
genes that either shorten or extend lifespan when perturbed.
41
2.3.13 Significant subnetworks are enriched for known longevity genes
We found that significant subnetworks derived using both C. elegans aging microarray studies
[125, 133] were significantly enriched for both sets of longevity genes, relative to the
background set of 12808 genes represented in the functional interaction network. All p-values
reported here were calculated using the hypergeometric test. For the Golden et al. [125] data, of
the 1957 genes that play a role in significant subnetworks, 65 are in L1 (p < 10-6
) and 124 are in
L2 (p < 10-8
), and of the 535 seed genes that produce significant subnetworks, 27 are in L1 (p <
10-5
) and 45 are in L2 (p < 10-6
). For the Budovskaya et al. [133] study, subnetworks seeds were
highly enriched for known longevity genes, and the set of all subnetwork genes was slightly
enriched for them. Of the 1559 seed genes of significant subnetworks, 43 are in L1 (p = 0.003)
and 90 are in L2 (p < 10-4
), and of the 4158 genes represented in some subnetwork, 88 are in L1
(p = 0.048) and 181 are in L2 (p = 0.025).
2.3.14 Examples of significant subnetworks containing known longevity genes
While HTP experimental methods have helped to identify hundreds of worm longevity genes
[64], their aging-related functions remain poorly understood. We found that subnetwork
biomarkers are highly enriched for longevity genes. Thus, subnetworks can provide a molecular
context for these genes in aging: they can be applied to uncover new connections between
different longevity genes, or to assign putative aging-related functions to them.
In Figure 2-7, we show several representative examples of significant subnetworks
derived from the Golden et al. [125] data that involve multiple known longevity genes. The
42
complete list is given in Table S2; individual NAViGaTOR XML [141] and PSI-MI XML [142]
files for each subnetwork are available from the website
http://www.cs.utoronto.ca/~juris/data/GB10/. Subnetwork A involves longevity genes vit-2
and vit-5. B has known longevity genes age-1, daf-18, and vit-2; previous work has uncovered
that a mutation in daf-18 will suppress the lifespan-extending effect of an age-1 mutation [35]. C
contains longevity genes rps-3 and skr-1, which are involved in protein anabolic and catabolic
processes, respectively. Subnetwork D contains longevity genes unc-60 and tag-300, which are
both involved in locomotion. E contains longevity genes fat-7 and elo-5, which are involved in
fatty acid desaturation and elongation. Subnetwork F has longevity genes rps-22 and rha-2, and
G has longevity genes blmp-1, his-71, and Y42G9A.4. Blmp-1 and his-71 are both involved in
DNA binding.
Figure 2-7. Some examples of significant longevity subnetworks.
43
Examples of significant modular subnetworks from Golden et al. [125] containing multiple known
longevity genes (from L2, see Materials and Methods). Edge width is proportional to gene-gene co-
expression, node size is proportional to the Spearman correlation between gene expression and age,
and known longevity genes are indicated by green circles.
2.3.15 Modular subnetworks participate in many different age-related biological
processes
Aging is highly stochastic and affects many distinct biochemical pathways. We analyzed the
union of all genes in significant modular subnetworks using biological process categories from
the Gene Ontology [74] (GO) and pathways from the Kyoto Encyclopaedia of Genes and
Genomes [143] (KEGG) databases to determine their relation to known mechanisms of aging.
Full results are given in Tables 2-1 and 2-2; all functions and pathways shown in the table and
discussed below are significant at p < 0.05 after an FDR correction.
In total, we identified 27 KEGG pathways and 37 non-redundant GO biological processes
(see Materials and Methods) that were significantly enriched for subnetwork genes. To test
whether these pathways and processes were also related to aging, we calculated the significance
of their overlap with the set of experimentally determined longevity genes (Table S3). We found
that one third of the GO biological processes (12 of 37) and KEGG pathways (10 of 27)
associated with subnetworks were significantly enriched for longevity genes (p < 0.05). Aging-
associated GO categories enriched for subnetwork genes include ‘locomotory behaviour,’ which
has recently been proposed as a biomarker of physiological aging [125] , and ‘determination of
adult life span’; KEGG pathways include ‘cell cycle’ and several metabolic pathways (including
‘citrate cycle,’ ‘glycolysis’).
44
Table 2-1. Gene Ontology biological process categories enriched in the set of genes
represented in modular subnetworks.
All categories shown are significant at p <0.05 after an FDR correction for multiple testing. GO
categories written in italics are also enriched for known longevity genes (Table S3).
Gene Ontology biological process P-Value
Translation 6.45E-17
Hermaphrodite genitalia development 1.20E-16
Embryonic cleavage 1.37E-15
Germline cell cycle switching, mitotic to meiotic cell cycle 8.32E-14
Locomotory behaviour 1.84E-13
Meiosis 1.10E-11
Positive regulation of multicellular organism growth 4.25E-11
Morphogenesis of an epithelium 3.85E-06
Protein catabolic process 1.13E-05
Phosphate transport 4.99E-04
Negative regulation of multicellular organism growth 8.07E-04
Ubiquitin-dependent protein catabolic process 1.94E-03
Nucleosome assembly 1.97E-03
Establishment of nucleus localization 2.37E-03
Tricarboxylic acid cycle 3.26E-03
DNA replication 4.64E-03
Protein transport 5.01E-03
Energy coupled proton transport, against electrochemical gradient 5.02E-03
Leucyl-tRNA aminoacylation 5.02E-03
Collagen and cuticulin-based cuticle development 5.12E-03
Organelle organization and biogenesis 5.19E-03
Chromosome segregation 7.48E-03
mRNA metabolic process 8.44E-03
Protein import into nucleus 1.15E-02
Purine base biosynthetic process 1.15E-02
Sulfur compound biosynthetic process 1.40E-02
DNA repair 1.45E-02
Determination of adult life span 1.74E-02
Threonine metabolic process 1.75E-02
Water-soluble vitamin biosynthetic process 1.78E-02
ATP synthesis coupled proton transport 3.14E-02
rRNA processing 3.85E-02
Isoleucyl-tRNA aminoacylation 4.02E-02
Methionyl-tRNA aminoacylation 4.02E-02
45
Valyl-tRNA aminoacylation 4.02E-02
Embryonic pattern specification 4.04E-02
Regulation of cell cycle 4.04E-02
Table 2-2. KEGG pathways enriched in the set of genes represented in modular
subnetworks.
All categories shown are significant at p <0.05 after an FDR correction for multiple testing. KEGG
pathways written in italics are also enriched for known longevity genes (Table S3).
KEGG Pathway P-Value
Ribosome 2.17E-27
Metabolic pathways 2.70E-15
Proteasome 2.33E-10
Pyrimidine metabolism 1.34E-09
Purine metabolism 7.08E-07
DNA replication 1.54E-06
Nucleotide excision repair 1.81E-05
Aminoacyl-tRNA biosynthesis 2.80E-05
Cell cycle 4.37E-05
Glutamate metabolism 1.54E-04
Glycolysis / Gluconeogenesis 2.97E-04
Citrate cycle (TCA cycle) 5.41E-04
Methionine metabolism 1.25E-03
Ubiquitin mediated proteolysis 7.19E-03
Pyruvate metabolism 7.27E-03
Base excision repair 7.38E-03
Glyoxylate and dicarboxylate metabolism 7.39E-03
Arginine and proline metabolism 8.35E-03
Glycine, serine and threonine metabolism 8.38E-03
Pentose phosphate pathway 1.23E-02
Valine, leucine and isoleucine biosynthesis 1.30E-02
One carbon pool by folate 1.30E-02
RNA polymerase 1.76E-02
Alanine and aspartate metabolism 1.76E-02
Non-homologous end-joining 2.15E-02
Selenoamino acid metabolism 2.17E-02
Mismatch repair 2.20E-02
46
2.3.16 Modular subnetworks can be used to annotate longevity genes with novel
functions
An important advantage of subnetwork over single-gene biomarkers is that they can be applied to
infer novel functions for subnetwork members [119]. Most worm longevity genes were identified
in HTP RNA interference screens, and thus many remain poorly characterized. And though
several longevity genes do have some previously known functions, their aging-related function is
still unknown.
We used modular subnetworks (derived from the expression data in [125]) to assign putative
functions in aging to known longevity genes by annotating them with the Gene Ontology (GO)
Biological Process categories that their associated subnetworks were significantly enriched for.
In total, we provided 49 longevity genes with novel annotations; nine of these genes had no
previous Gene Ontology biological process annotations (apart from those electronically inferred)
or well-characterized orthologs (named NCBI KOGs [144]). The most significant novel
annotation for each longevity gene is given in Table 2-3, as an example of our approach (poorly
characterized genes are indicated with an asterisk). The full list of all longevity gene GO
categories inferred by subnetwork annotations is available in Table S4, and on the website
http://www.cs.utoronto.ca/~juris/data/GB10/. All GO categories in the tables are significant
with p < 0.05 (after an FDR correction), and annotated to at least 25% of subnetwork genes.
Table 2-3. Assigning putative functions to longevity genes.
The first column lists longevity genes, column 2 shows the most highly enriched Gene Ontology
biological process in subnetworks containing that gene, and the p-value of the enrichment
(hypergeometric test with FDR correction) is shown in column 3. Genes with no previously known
manual GO BP annotation are indicated with an asterisk.
47
Gene GO biological process P-Value
rpl-4 cellular macromolecular complex assembly 2.16E-02
vit-5 phosphate transport 3.70E-05
rha-2 cellular macromolecular complex assembly 2.16E-02
C06E7.1 protein complex assembly 2.26E-02
C25H3.6* transcription from RNA polymerase II promoter 4.87E-02
pat-4 chromatin assembly or disassembly 4.92E-03
C33H5.18 chromatin assembly or disassembly 3.02E-03
unc-60 protein complex assembly 2.26E-02
vit-2 phosphate transport 3.70E-05
ril-1* cell adhesion 3.57E-02
CD4.4* ribosome biogenesis 1.85E-02
eif-3.F organelle organization and biogenesis 3.75E-03
F09F7.5* pigment metabolic process 5.01E-03
pab-2 chromatin assembly or disassembly 8.99E-05
hpk-1 growth 2.78E-02
mdh-1 lipid metabolic process 3.36E-02
blmp-1 chromatin assembly 7.22E-04
daf-3 protein complex assembly 2.26E-02
F28B3.5* amine metabolic process 3.04E-03
rps-23 tRNA aminoacylation for protein translation 1.04E-03
F30A10.10
chromatin assembly or disassembly 4.95E-02
dlk-1 transcription from RNA polymerase II promoter 4.87E-02
F40F8.5* nucleobase metabolic process 5.08E-05
elo-5 lipid metabolic process 4.34E-02
F43G9.3 water-soluble vitamin metabolic process 2.04E-03
ife-1 organelle organization and biogenesis 3.75E-03
spt-4 chromatin assembly or disassembly 8.40E-05
aakb-1 nucleobase, nucleoside and nucleotide metabolic process
1.45E-03
dod-22* gene expression 1.85E-02
F57B9.3 amine metabolic process 2.83E-02
cdc-25.1 amine metabolic process 1.90E-02
nac-3 cellular macromolecular complex assembly 2.16E-02
lin-23 cytoskeleton organization and biogenesis 2.59E-02
K10D2.2 anion transport 5.54E-04
ifg-1 organelle organization and biogenesis 3.75E-03
sir-2.1 lipid transport 2.44E-04
wip-1* chromatin assembly or disassembly 1.99E-02
skn-1 chromatin assembly or disassembly 3.56E-04
vha-6 regulation of metabolic process 3.84E-02
W01B11.3 establishment of protein localization 1.93E-04
W06B11.3*
fatty acid metabolic process 6.78E-03
rpl-30 chromatin assembly or disassembly 3.02E-03
tag-300 cytoskeleton organization and biogenesis 2.59E-02
Y42G9A.4 chromatin assembly or disassembly 3.32E-02
gdi-1 secondary metabolic process 1.98E-02
48
spl-1 sulfur metabolic process 2.33E-02
pod-1 intracellular protein transport 2.04E-02
lrs-2 intracellular protein transport 2.04E-02
let-60 nucleotide-excision repair 1.11E-02
2.4 Conclusions
Aging results not from individual genes acting in isolation of one another, but from the combined
activity of sets of associated genes representing a multiplicity of different biological pathways.
For the most part, the organization and function of these aging-related pathways remain poorly
understood. In particular, the role of most longevity genes in aging is still unknown.
In this work, we showed that high-throughput information about which genes are
likely associated with which other genes – in the form of a functional interaction network – can
yield new insights into the transcriptional programs of aging. We identified modular
subnetworks associated with worm aging – highly interconnected groups of genes that change
activity with age – and showed that they are effective biomarkers for predicting worm age on the
basis of gene expression. In particular, they outperform biomarkers of aging based on the activity
of single genes or regular subnetworks. Furthermore, we found that modular subnetwork
biomarkers were significantly enriched for known longevity genes. Thus, modular subnetwork
biomarkers can provide a molecular context for each longevity gene in aging – in effect, each
longevity subnetwork constitutes a biological hypothesis as to which genes interact with known
longevity genes in some common age-related function.
This work is the first to use a new subnetwork performance criterion that incorporates modularity
into the expression for subnetwork score, and the first to integrate network information with gene
expression data to identify biomarkers of aging. The subnetwork biomarkers identified by our
49
method are highly conserved across studies, and this opens the door to studying longevity genes
– or indeed, any age-related gene set of interest – over a range of different health and disease
conditions. In particular, we are interested in investigating the different subnetworks associated
with longevity genes in diseases like cancer, and in aging across species.
2.5 Materials and Methods
2.5.1 Code
Code for most simulations was written in Matlab R2008b and is available on the website,
http://www.cs.utoronto.ca/~juris/data/GB10/. For support vector regression experiments, we
used the Matlab wrapper to LIBSVM [145]. We analyzed gene sets for enriched gene ontology
using the topGO package (ver. 1.10.1, [146]) in R 2.8.0. Subnetworks were visualized using
NAViGaTOR ver. 2.1.7 (http://ophid.utoronto.ca/navigator; [141]).
2.5.2 Data sets
Microarray experiments. Aging expression datasets for two recent studies were downloaded
from GEO [147]. From Golden et al. [125], we obtained data for 104 microarrays of individual
wild-type (N2) worms over 7 ages (9-17 microarrays per age). From Budovskaya et al. [133],
we obtained 16 microarrays of pooled sterile (fer-15) worms over 4 ages (4 microarrays per age).
For both studies, we discarded probesets containing more than 30% missing values for some age
group.
50
Interaction network. Functional interactions for C. elegans ORFs were downloaded from
WormNet [148]. The network used in our analyses consists of the largest connected component
of the network formed from all WormNet ORFs represented by some probeset in two separate
worm aging microarray studies [125, 133], and represents 12808 distinct C. elegans ORFs and
275525 interactions.
Longevity genes. We obtained L1, our high confidence set of genes that extend lifespan when
perturbed or knocked out, from the recent list compiled in [149]. In total, 233 genetic
perturbations that extend lifespan belonged to the largest connected component of WormNet
made up of genes covered by both expression studies. We constructed L2, our larger set of
longevity genes, by taking the union of L1 and the set of mutations that affect worm lifespan
downloaded from the GenAge database [64]. This yielded 494 genes that either shorten or extend
lifespan when perturbed (and are annotated to the network we use). Both gene lists are available
in Supplementary Table S3.
2.5.3 Subnetwork analyses
Subnetwork search parameters
Seed genes: Previous methods [52, 95] seed the subnetwork search process at a random subset of
genes on the network; a problem with this approach is that different choices of seed genes might
yield substantially different significant subnetworks. To avoid this bias, we grew subnetworks
seeded from every node of the interaction network. For all machine learning tests, the total set of
significant subnetworks was reduced to a non-redundant set, i.e. if two significant subnetworks
51
shared more than 25% overlap (as measured with the Jaccard index), the lower-scoring
subnetwork was deleted from the set of candidate features.
Stopping criteria: For modular subnetworks grown iteratively out from a seed node, the search
was halted when there were no nodes that would increase both subnetwork modularity and class
relevance. For regular subnetworks, the search was halted when there were no nodes that would
increase the subnetwork score (class relevance) past some threshold r (r = 0.01, 0.02, 0.05, 0.1
and 0.2 for regular subnetworks r1-r5), or when there were no remaining local nodes (i.e., nodes
at most two edges away from the seed).
Identifying significant subnetworks
We calculate subnetwork significance using both self-contained and competitive gene set tests
[73]. Our competitive test is identical to that used in [52], and our self-contained test is more
stringent – we use the method suggested in [95].
For the self-contained test, we randomized the assignment of ages to worms (samples), and then
repeated the search for subnetworks starting from each network node. The subnetwork score of
the original subnetwork determined from the true data was then ranked against the corresponding
subnetworks determined from the artificial data that seeded from the same gene. This process
was repeated 1000 times.
For the competitive test, we generated 100 artificial interactomes by randomizing the assignment
of gene names to nodes on the functional interaction network and recalculating the weight for
each network edge based on the new genes that flanked it (only for modular networks – regular
networks do not use edge information). We repeated the search for significant subnetworks on
52
each artificial interactome. Scores for subnetworks determined from the true interactome were
ranked against the scores of all subnetworks generated from the artificial interactomes.
Subnetworks were considered significant if they achieved p < 0.001 on the local self-contained
test and p < 0.05 on the global competitive test.
2.5.4 Machine learning comparisons
We used ε-insensitive support vector regression (SVR) algorithms [150] to learn worm age as a
function of the activity of regular subnetworks, modular subnetworks or differentially expressed
genes. All SVRs were trained using a linear kernel and the default parameters provided by
LIBSVM [145]. For SVR features made up of subnetworks, subnetwork activity for a sample
was calculated as the mean activity of all the genes in the subnetwork.
2.5.5 GO and KEGG enrichment analyses
The union of all genes present in some significant modular subnetwork (β = 250; derived using
data from [125]) was compared with the background network, i.e. the set of 12808 genes present
in the largest connected component of the network formed from all WormNet ORFs represented
by some probeset in both microarray studies [125, 133].
Because there is a lot of redundancy in the Gene Ontology tree, we used the ‘elim’ method [146]
to determine the most specific significant biological process categories (i.e., those at the deepest
level of the tree), and then controlled for multiple testing using an FDR [140] cut-off of 0.05. For
53
KEGG, we calculated an enrichment p-value for each term using the hypergeometric test, and
again controlled for multiple testing using an FDR cut-off of 0.05.
2.6 Abbreviations
SVR: Support Vector Regression; SCC: Squared correlation coefficient; MSE: Mean-squared
error.
2.7 Supplementary Materials
Supplementary figures and tables can be accessed online at
http://genomebiology.com/2010/11/2/R13.
2.8 Acknowledgments
This work was supported by Genome Canada via the Ontario Genomics Institute, the Canada
Foundation for Innovation (Grant Nos. 12301 and 203383), and IBM to IJ. We thank K. Brown
and D. Tweed for their helpful comments.
54
3 In silico drug screen in mouse liver identifies
candidate calorie restriction mimetics
This chapter is based on:
Kristen Fortney, Eric Morgen, Max Kotlyar, and Igor Jurisica (2012). In silico drug screen in
mouse liver identifies candidate calorie restriction mimetics. Rejuvenation Research 15(2).
3.1 Abstract
Calorie restriction (CR) extends lifespan in mammals and delays the onset of age-related
diseases, including cancer and diabetes. Drugs that target the same genes and pathways as CR
may have enormous therapeutic potential. Recently, genome-scale data on the responses of
human cell lines to over 1000 drug treatments have become available. Here we integrate these
data with gene expression signatures of CR in mouse liver to generate a prioritized list of
candidate CR mimetics. We identify 14 drugs that reproduce the effects of CR at the
transcriptional level.
55
3.2 Introduction
Direct screens for testing the effect of drugs on mouse lifespan – in which mice are treated with a
drug, tracked for several years, and their lifespan curve compared with that of control animals –
are time-consuming, expensive, and methodologically challenging [1]. As a consequence, few
drugs have been screened in this way. For example, only 17 compounds have been (or are being)
evaluated as part of the National Institute on Aging’s Intervention Testing Program [2].
Because of these limitations, there is a pressing need for faster and higher-throughput surrogate
assays to help identify and prioritize new drug candidates for healthspan extension. A promising
alternative to direct screening of lifespan is expression-based screening for calorie restriction
(CR) mimetics [3-6]. CR is one of the most reproducible and effective lifespan interventions; it
extends lifespan in model organisms from yeast to mammals, and delays the age of onset for
many diseases of aging, including cancer and diabetes [7]. Thus, drugs that mimic CR at the
transcriptional level may be of great therapeutic value [8]. In an expression-based screen, mice
are treated with drug, tissue samples are taken, and gene expression changes measured using
microarrays and compared to those induced by CR. Recent screens found that the drugs
metformin and resveratrol reproduce many gene changes seen with CR [4,9,10].
In this work, we develop an in silico version of expression-based drug screening for CR
mimetics that allows us to test hundreds of drugs at once. We collect nine previously published
transcriptional signatures of CR in mouse liver and screen each one against the Connectivity
Map [11], a public resource containing genome-scale data on the responses of human cell lines to
over 1000 drug treatments. We then conduct a meta-analysis and identify 14 drugs that
consistently rank among the top CR mimics across multiple studies.
56
3.3 Materials and Methods
3.3.1 Code
Code for all analyses was written in R 2.13.0. Several Bioconductor 2.8 packages were used; we
normalized raw Affymetrix CEL files with affy [151], used limma [152] to identify differentially
expressed probesets, and converted mouse IDs (array-specific, GenBank, or MGI) to human HG-
U133A probeset IDs for Connectivity Map analysis using annotationTools [153]. The drug-drug
network was visualized using NAViGaTOR 2.2.1 [141].
3.3.2 Drug-drug interaction network
We downloaded the DN drug-drug interaction network – where two drugs share an edge if they
share a common mode of action – from MANTRA [154].
3.3.3 Acquiring transcriptional signatures of calorie restriction
We collected gene signatures of CR from published studies [5, 155-161]. Our analysis pipeline
differed depending on whether the source publications made their raw data available, as
described below.
Identifying differentially expressed genes from raw microarray data. For publications where
Affymetrix CEL files were available [155-157], we re-analyzed the raw data to derive lists of
genes significantly up- or down-regulated following CR. We normalized CEL files using
the RMA method [162] implemented in affy [151], and identified differentially expressed probe
sets using the empirical Bayes method in limma [152].
Curating differentially expressed gene lists from published papers. For publications where no
raw Affymetrix data were available [5, 158-161], we downloaded lists of genes reported by the
57
authors to be differentially expressed (in paper text, tables, or supplementary materials). Where
fold change and FDR-corrected p-values were available, we used these data to filter gene lists.
For both types of signature, we removed genes with FDR values greater than 0.05 and (positive
or negative) fold change less than 1.25, sorted the remaining genes by FDR, and retained only
the top 250 up-regulated and the top 250 down-regulated genes for further analysis.
3.3.4 Connectivity map analysis of CR signatures
Mapping mouse CR signatures to human probeset IDs. We mapped mouse gene IDs to human
Affymetrix HG-U133A IDs for connectivity map analysis following previously established
protocols [75]. First, mouse CR signature genes were mapped to Entrez Gene IDs using either
the org.Mm.eg.db Bioconductor 2.8 library (for GenBank or MGI identifiers) or Affymetrix
annotation files Mouse430_2.na31.annot.csv or MG_U74Av2.na32.annot.csv (for probeset IDs).
Mouse Entrez Gene IDs were then mapped to human Entrez Gene IDs using homologene.data
(release 65; www.ncbi.nlm.nih.gov/homologene), and finally to HGU133A IDs using the
Affymetrix annotation file HG-U133A.na31.annot.csv.
Acquiring drug connectivity scores for CR signatures. For each CR signature, mean connectivity
scores for 1309 drugs were calculated as previously described [75] using data on 6100 drug
treatments downloaded from Connectivity Map build 02 at
http://www.broadinstitute.org/cmap/. Drug mean connectivity scores were then converted to
ranks. The connectivity score quantifies the extent to which a drug treatment mimics the query
signature and is based on the Kolmogorov-Smirnov statistic.
58
3.3.5 Meta-analysis of drug-response data
Combining ranked lists of drugs to identify CR mimetics. (Ranked connectivity scores) We
adapted the Rank Product method [67] to identify drugs that consistently mimic CR at the
transcriptional level. For each drug, we calculated the product of its ranks in all CR signatures.
Computing p-values. We randomly permuted the assignment of connectivity scores to drugs for
the 6100 instances (drug treatments), recalculated mean connectivity scores and drug ranks for
1309 drugs in each signature, and re-calculated randomized rank products 10000 times to
estimate p-values and false discovery rates.
3.4 Results
3.4.1 Transcriptional signatures of CR
We obtained nine transcriptional signatures of CR in mouse liver by collecting and analyzing
data from eight previous publications (we collected two signatures from Tsuchiya et al. [156],
one for wild-type and one for dwarf mice). The mice used to generate the CR signatures came
from both sexes and a variety of ages and genetic backgrounds, and CR mice consumed between
56-70% of the calories of the matched control group, depending on the study. For each mouse
CR signature, we constructed an orthologous human signature made up of Affymetrix HG-
U133A probe IDs for Connectivity Map analysis (see Materials and methods). For each human
CR signature, we calculated mean connectivity scores for the 1309 drugs in the Connectivity
Map collection [75]. Connectivity scores range between -1 and 1; a high, positive mean
connectivity score indicates that drug treatment reproduces many of the gene changes with CR.
For each human signature, we then constructed a ranked list of drugs based on the connectivity
scores.
59
3.4.2 Meta-analysis identified fourteen candidate CR mimetics
We combined the nine ranked lists of drugs into a single matrix, and identified drugs that were
consistently highly ranked across signatures using the Rank Product method [67] (see Materials
and Methods). At a false-discovery rate cut-off of 25% (corresponding to unadjusted p-values <=
0.0026), we found that 14 drugs significantly mimic the CR response in mouse liver (Figure 3-
1A).
While most signatures were consistent, i.e. gave high ranks to most of the significant drugs
(Figure 3-1A), two signatures stood out (the two leftmost columns); these correspond to the
earliest signature included in this study [160], and the signature derived from dwarf mice on a
CR diet [156]. These two relative outliers highlight the usefulness of meta-analyses that identify
consistent trends.
Many of the significant drugs show overlapping modes of action (MoA); 10 of 14 form a
connected component in the MoA drug-drug interaction network (downloaded from MANTRA;
[154]) where two drugs were joined by an edge if both drug treatments induced significantly
similar gene changes (Figure 3-1B). We also queried the network with the three best-known
longevity therapeutics (metformin, rapamycin, and resveratrol), and found that one of the drugs
identified in our screen has MoA similar to resveratrol, and two to rapamycin (Figure 3-1B).
Significant drugs (Figure 3-1A) are indicated for a wide variety of diseases. For example,
pioglitazone is a prescription drug used to treat type 2 diabetes [163]; colchicine is used to treat
gout [164]; MG-262 is a proteasome inhibitor with anti-inflammatory effects in the heart [165];
and Gly-His-Lys can activate wound repair [166]. Three of the drugs have been previously
linked to aging: the PI3K inhibitors wortmannin and LY-294002 and the anti-diabetes drug
60
pioglitazone increase lifespan in Drosophila [167, 168]. Other drugs, such as the Chembridge
compounds 5155877 and 5224221, are not yet well characterized in terms of their biological
effects.
To our knowledge, none of the significant drugs identified in this study has yet been evaluated as
a CR mimetic; they should be prioritized for further analyses and biological validation.
61
Figure 3-1. Fourteen drug treatments significantly mimic the effects of CR on hepatic gene
expression.
A. Significant drugs, and a heatmap showing the rank of each drug in each CR signature queried (top
ranks are shown in red). Source publications for signatures are shown on top and ordered from least to
most recent; drugs are shown on the right and ordered from most to least significant. B. Drug-drug
interaction network showing links between significant drugs from our screen (grey) and known longevity
62
drugs resveratrol and rapamycin (green). Two drugs are linked if they show similar modes of action
[154].
3.5 Discussion
While few drugs have been directly tested for their effect on mouse lifespan, over a thousand
drugs have been characterized in terms of their effects on gene expression, and these data are in
the public domain. We have applied this resource to identify fourteen drugs that have similar
transcriptional signatures to CR in mouse liver.
Several dozen other transcriptional signatures of CR are publicly available – mostly in mouse
and rat, but some in primates, including a few in humans. We plan to follow up our pilot study in
mouse liver by conducting in silico expression-based screening on the full set. Expression-based
screening can also be applied to other life-extending treatments for which microarray response
data are available, to see for example which compounds induce the gene changes seen in long-
lived Ames or Snell dwarf mice (versus wild-type).
Longevity drugs have great potential to help treat the diseases of aging, yet few such drugs are
known. Ours and similar approaches that leverage the large quantity of public data on drugs and
mammalian aging can accelerate the identification and development of new longevity
therapeutics.
3.6 Acknowledgments
This work was supported in part by Ontario Research Fund (GL2-01-030), Canada Institutes for
Health Research (BIO-99745), the Canada Foundation for Innovation (CFI #12301 and
#203383), the Canada Research Chair Program, and IBM to IJ, and the Ontario Ministry of
64
4 NetwoRx: Connecting drugs to networks and
phenotypes in S. cerevisiae
This chapter is based on:
Kristen Fortney, Wing Xie, Max Kotlyar, Joshua Griesman, Yulia Kotseruba and Igor Jurisica
(2012). NetwoRx: Connecting drugs to networks and phenotypes in S. cerevisiae. Submitted.
4.1 Abstract
Motivation: Drug modes of action are complex and still poorly understood. The set of known
drug targets is widely acknowledged to be biased and incomplete, and so gives only limited
insight into the system-wide effects of drugs. But a high-throughput assay unique to yeast –
barcode-based chemogenomic screens – can measure the individual drug response of every yeast
deletion mutant in parallel.
Results: We integrated the three largest S. cerevisiae chemogenomic experiments, which
together comprise the responses of thousands of gene knockout strains to 466 drugs, and applied
data mining approaches to investigate drug effects at the system level. We identified yeast
pathways, functions, and phenotypes that are targeted by particular drugs, computed measures of
drug-drug similarity, and constructed drug-phenotype networks. We built the NetwoRx web
portal to make the results of these analyses freely available. NetwoRx also implements
automated analysis routines; users can query new gene groups against the entire collection of
drug profiles and NetwoRx will calculate which drugs target them. We demonstrate with
65
example use cases how NetwoRx can be applied to target specific phenotypes, repurpose drugs
using mode-of-action analysis, investigate bipartite networks, and predict new drugs that affect
yeast aging.
Availability: NetwoRx is freely available on the web at http://ophid.utoronto.ca/networx.
66
4.2 Introduction
The modes of action of many FDA-approved drugs remain poorly characterized: drugs have off-
target effects and these cause unanticipated side effects. In recent years, HTP experiments have
begun to provide crucial clues to the global cellular response to drugs. Computational
interrogation of these data has many important applications relevant to human health, including
target and side-effect prediction, drug repurposing, and mode-of-action analysis.
Chemogenomic barcode screens are particularly valuable HTP drug assays that are unique to
yeast – comparable data is not yet available for any mammalian model organism. These screens
report the change in colony growth in response to drug treatment for every one of the ~6000
deletion strains in the yeast deletion collection [169, 170]. Deletion strains in the yeast deletion
collection are each tagged with unique bar codes, permitting the growth response of every
deletion strain to be measured in parallel (bar codes are hybridized to microarrays). Previous
studies have demonstrated the relevance of these yeast data to human disease. For example [171]
tested 81 psychoactive drugs in yeast and identified secondary drug targets that help explain side
effects in human patients, and [172] applied the screen to identify the molecular targets of
elesclomol, a promising chemotherapy adjuvant. These unique chemogenomic data can
complement other HTP measures of drug effects such as gene expression [75].
Bioinformatics analyses of individual chemogenomics datasets have provided valuable insights
into drug mode of action. These analyses have included the unsupervised clustering of drug
fitness profiles (growth responses) to identify groups of drugs that affect genes in the same way
[173, 174], and calculating gene co-fitness and using it to predict gene function [175].
67
Here we integrate the three largest chemogenomic experiments, covering several thousand yeast
genes and 466 drugs, and investigate drug effects at the systems level. We apply gene set
analysis to identify pathways and phenotypes targeted by drugs, compute drug-drug similarity
metrics for mode-of-action analysis, and build drug-phenotype networks. We applied our
methods to four gene set collections of high biological relevance: Gene Ontology categories [74],
KEGG pathways [143], SGD mutant phenotypes [176], and YEASTRACT targets of
transcription factors [177]. We make the full results of our analyses available through NetwoRx,
a web database linking drugs to networks and phenotypes. We also set up automated analysis
routines in NetwoRx; users can query new gene lists against the entire collection of drug profiles
and NetwoRx will retrieve the drugs that target them.
We demonstrate with example use cases how NetwoRx can be applied to (1) identify drugs that
modulate the oxidative stress response; (2) repurpose drugs for cancer by examining pathways
involved in DNA damage; (3) investigate the druggability of transcription factor targets with a
bipartite network; (4) cluster the drug-pathway network to identify drugs with shared modes of
action; and (5) predict new drugs that modulate yeast aging.
4.3 Materials and Methods
4.3.1 NetwoRx methods
4.3.1.1 Data sets
Chemogenomic data. Log-ratio data of control to drug treatment strain abundance, and P values
for individual drug-gene associations, were obtained from the three largest previously published
yeast chemogenomic studies [171, 173, 174]. The union of these datasets comprised 5924 genes
and 466 drugs. [173] and [171] used the diploid yeast deletion collection, including both
homozygous and heterozygous deletion strains [169, 170]. [174] used the haploid yeast deletion
68
collection [169]. NetwoRx treats the experimental data in the same manner as in the original
publications. The data of [173] are treated as two separate experiments, an experiment with
homozygous deletion strains (4742 distinct ORFs, 132 drugs) and an experiment with
heterozygous deletion strains (5272 ORFs, 318 drugs). The data of [171] are treated as a single
experiment with a mix of heterozygous and homozygous deletion strains (5200 ORFs, 81 drugs).
The data of [174] are treated as a single experiment with the haploid deletion collection (4111
ORFs, 82 drugs).
Gene sets. KEGG Pathways [143] and Gene Ontology categories [74] were obtained from the
Bioconductor 2.8 package org.Sc.sgd.db; mutant phenotypes were downloaded from SGD [176];
transcription factor targets were obtained from the YEASTRACT database [177].
4.3.1.2 Gene set scores
Gene-level scores. Scores for individual gene-drug relations were calculated as log strain
abundance ratios (of control to drug treatment); these data were downloaded from the individual
publications used in NetwoRx. If a gene was represented more than once in a dataset, for each
drug treatment we selected the gene’s largest score.
Gene set score (GSS). The statistic for a set of genes was calculated as the mean of the gene-
level scores for set genes, adjusted for set size (Figure 4-1). For a drug treatment with mean µ
and standard deviation σ, and a gene set P of size n, if S(P) was the average of the gene-level
scores si (for genes i in P), then the GSS was S’(P) = (S(P) - µ)/(σ/sqrt(n)). For each drug
treatment, we calculated scores for those gene sets where gene-level scores were available for at
least 5 and no more than 500 genes; other gene sets were assigned a value of NA.
69
P-values. For a GSS corresponding to a given gene set and drug treatment, we calculated two P
values, P1 and P2. For P1, we computed the one-sided P-value corresponding to the Z-score
defined by the GSS. For P2, we considered the matrix of GSS values for all drug treatments in a
dataset and all gene sets of a given type (e.g. all GO categories); we calculated P2 as the fraction
of these values equal to or exceeding the GSS. The P value for the gene set under the drug
treatment was then reported as P = max(P1,P2). For user queries of new gene sets we report only
P1 (as there is no appropriate background gene set collection to use for P2).
4.3.1.3 Drug-drug similarity
For each chemogenomic dataset, we calculated two measures of drug-drug similarity for all pairs
of drugs, S1 and S2.
For S1, we took the matrix of gene-level scores (genes vs. drugs), eliminated columns or rows
where more than half of values were NA, and then calculated the Pearson correlation between all
pairs of columns (drugs). For drugs represented more than once in a data set, we merged
replicates by calculating average correlations. For each drug-drug similarity score, we calculated
its associated P value as the fraction of other drug-drug similarity scores equal to or exceeding it.
For S2, we repeated the same filtering, calculations, and merging on the matrix of GSS scores
(gene sets vs. drugs); all gene set types (GO, KEGG, YEASTRACT, SGD phenotype) were
included in the GSS matrix.
70
4.3.1.4 Bipartite interaction networks
For a given gene set collection and chemogenomic dataset, a drug and a gene-set were
considered to interact if the GSS had an associated P value ≤ 0.05 for at least one treatment with
that drug. For each drug/gene-set association we report the lowest P value observed over all
treatments of the same drug.
4.3.1.5 Code
Code for all analyses was written in R 2.13.0; we also used the Bioconductor 2.8 GSEABase and
org.Sc.sgd.db.
4.3.1.6 Database implementation
The NetwoRx portal was written in Java and runs on the WebSphere 6.1 application server on an
IBM P595 server with a secondary P595 backup server. The database runs on DB2 9.5 on an
IBM P570 server with a mirror running on P595 for redundancy and workload balancing.
Figure 4-1. Gene set analysis of chemogenomic data.
The score S of a pathway P is calculated as a function f of the gene-level scores s for genes in P.
71
4.3.2 Use-case methods
4.3.2.1 Data sets
Chronological aging. Three sets of genes that extend yeast chronological lifespan were obtained
from previously published genome-wide experiments [178-180].
Drugs that modulate aging. Drugs known to modulate aging in S. cerevisiae were downloaded
from the Lifespan Observation Database at http://lifespandb.sageweb.org/.
4.3.2.2 Code and software
Code for all analyses was written in R 2.13.0. We used the WGCNA R package for drug-drug
similarity network analysis [181]. Networks were visualized with NAViGaTOR 2.2.1
(http://ophid.utoronto.ca/navigator/).
4.4 NetwoRx content and functionality
Here we briefly describe basic database content and functionality.
4.4.1 Database contents
The NetwoRx web portal contains drug-response data calculated for 466 drugs and thousands of
S. cerevisiae genes. Drugs are linked to their PubChem Compound IDs [182], yeast genes to
their SGD entries [183], and gene sets to their relevant databases (GO, KEGG, YEASTRACT, or
SGD phenotype).
Drug-gene associations. P values for associations between drugs and individual genes. Link:
http://ophid.utoronto.ca/networx/singleid
72
Drug-pathway associations. P values and gene-set scores for associations between drugs and
KEGG pathways [143], GO categories, YEASTRACT targets of transcription factors [177], and
SGD mutant phenotypes [176]. Link: http://ophid.utoronto.ca/networx/drug2pathway
Drug-drug similarity metrics. Similarity values S1 and S2 (between -1 and 1) and associated P
values for all pairs of drugs, quantifying the extent to which drugs affect genes (S1) or pathways
(S2) in the same way. Link: http://ophid.utoronto.ca/networx/drug2drug
Drug-pathway networks. Bipartite networks of significant drug-pathway associations are
available as tab-delimited text files or Navigator 2.2.1 files for network visualization [184]. Link:
http://ophid.utoronto.ca/networx/drugnetworks
New pathway search. Users can specify a new set of genes, and NetwoRx will calculate which
drugs interact with it. Link: http://ophid.utoronto.ca/networx/newmodule
4.4.2 Accessing data
Search by drug. Users can search for drugs by their name or by their PubChem Compound ID
(e.g., rapamycin or 5284616).
Search by gene or list of genes. Users can search for yeast genes by their systematic names (e.g.,
YKL203C).
Search by gene set identifier. Users can search for gene sets by their set-specific identifiers (e.g.,
GO:0006979).
73
4.5 NetwoRx use case examples
Here we provide several NetwoRx use cases, using NetwoRx data alone (cases 1-4) or in
combination with data from other high-throughput experiments (case 5).
4.5.1 Retrieving drugs that perturb phenotypes: oxidative stress
Querying NetwoRx with gene sets related to oxidative stress – from the Gene Ontology
(‘response to oxidative stress’, GO:0006979) or SGD mutant phenotypes (‘oxidative stress
resistance’) returns drugs that perturb these pathways. Both compounds known to cause
oxidative stress (e.g., hydrogen peroxide, paraquat) and to protect from it (e.g., allyl disulfide,
rapamycin) are returned. Other significant drugs have not yet been tested for their impact on
oxidative stress (Figure 4-2).
Figure 4-2. Drugs that perturb oxidative stress pathways.
Drugs are shown in order of increasing P value; some drugs (green) are known to ameliorate the effects
of oxidative stress while other drugs (red) induce it. Data set: homozygous collection of [173].
74
4.5.2 Focused searches identify drugs with shared mode of action: drugs that
target the same DNA damage pathways as Cisplatin
Querying NetwoRx with the chemotherapeutic agent Cisplatin (CID 441203) to identify its mode
of action returns four significant KEGG pathways related to DNA damage: base excision repair
(sce03410), nucleotide excision repair (sce03420), DNA mismatch repair (sce03430), and
homologous recombination (sce03440). Querying NetwoRx with these four DNA damage
pathways and extracting the drug-pathway network reveals that many significant drugs are
known cancer drugs that are connected to multiple pathways (Figure 4-3). Other significant
drugs have not yet been tested for cancer and should be prioritized for further study.
Figure 4-3. Mode of action analysis of the chemotherapeutic cisplatin.
Node size is proportional to degree. Known cancer drugs are indicated in green. Data set: homozygous
collection of [173].
75
4.5.3 Bipartite networks reveal that some gene sets are druggable hubs
NetwoRx users can choose to download the entire collection of significant drug–pathway
connections for a given gene set type, either as a tab-delimited text file or as a graph that can be
visualized in NAViGaTOR 2.2.1 [184]. Downloading the entire set of associations for
YEASTRACT transcription factors reveals that while most targets of TFs are affected by only
few drugs, some (e.g., GCR1, IFH1) are perturbed by very many (Figure 4-4).
Figure 4-4. Bipartite network showing all connections between drugs and YEASTRACT
targets of transcription factors.
Node size is proportional to degree. Data set: [174]. We highlight the high degree nodes and their
connectivity.
4.5.4 Clustering the drug-pathway matrix identifies drug modules that share
modes of action
NetwoRx provides measures of drug-drug similarity that quantify the extent to which pairs of
genes impact pathways in the same way. NetwoRx users can search these data by drug name or
download them in bulk. We downloaded the entire matrix of drug-drug similarities from
NetwoRx for the heterozygous experiments of [173]. We then used the R package WGCNA
76
[181] to cluster drugs into modules sharing mode-of-action. These modules can be applied for
drug repurposing. For example, one module was highly enriched for psychoactive drugs (Figure
4-5). Five of the six drugs in the module are used as sedatives and antipsychotics. The last drug,
hexestrol, is a synthetic estrogen that NetwoRx predicts to be psychoactive.
Figure 4-5. Drug module identified by clustering the matrix of drug-drug similarity scores.
5 of 6 drugs in this module are known to be psychoactive (indicated in bold). Data set: heterozygous
collection of [173].
4.5.5 User-defined gene sets: identifying new drugs that modulate yeast
chronological aging
NetwoRx can perform gene set analysis of new gene sets specified by the user. Here we apply
this functionality to identify new drugs that may modulate yeast aging. Three previous studies
have conducted genome-wide assays in yeast to identify gene deletions that lead to increased
survival in prolonged stationary phase [178-180]. We obtained sets of longevity genes from each
77
publication (42, 57, and 90 genes, respectively). Notably, the overlap among these sets was very
poor (Figure 4-6, bottom right). There were 3 genes common to [179] and [180], and one gene
common to [178] and [180]. No gene was common to all three studies; furthermore, no gene was
common to the two most recent studies, despite the fact that they shared a very similar
experimental methodology.
Querying these gene sets against all datasets in the NetwoRx collection revealed that these gene
sets share many targeting drugs. In total, 125 drugs target at least one gene set, 29 target at least
two sets, and 8 target all three. We downloaded the set of drugs previously shown to extend yeast
chronological lifespan from the Lifespan Observation Database at http://lifespandb.sageweb.org/.
Three of these drugs (rapamycin, caffeine, and sodium chloride) are included in the NetwoRx
collection, and our analysis identified all three as significantly associated with one or more aging
gene sets (Figure 4-6, green nodes). Rapamycin, a well-known anti-aging drug which has been
shown to extend lifespan in multiple species [185] targets all three gene sets; NaCl targets two
gene sets; caffeine targets one. Other drugs in the network have been reported to extend life in
other species, e.g., curcumin and wortmannin extend life in Drosophila [167, 186].
Other NetwoRx functionalities can be applied to narrow down a list of interesting candidates
from the set of 125 significant drugs. For example, we retrieved from NetwoRx a list of the top
10 drugs most similar in terms of their pathway-based mode of action to the anti-aging drug
rapamycin, from the heterozygous experiment of [173]. Six of these ten drugs also target at least
one aging gene set, i.e. are represented in the aging-drug network (Figure 4-6): allyl disulfide,
allyl sulfide, CDL 14A, CDL 3F2, CID 688028, and CID 697443.
78
Figure 4-6. Drugs predicted by NetwoRx to modulate yeast chronological lifespan.
Drugs known to increase yeast lifespan are indicated in green. Node size is proportional to degree, and
edge width is proportional to the statistical significance of the drug/gene-set connection (for all
connections P <= 0.05). Diagram at bottom right indicates the overlap between the genes identified as
significant in each aging study. Data set: union of all 3.
4.6 Discussion
The NetwoRx web portal brings together data from the major S. cerevisiae barcode
chemogenomics experiments and facilitates their systems-level analysis. These unique
chemogenomic data can help shed light on the genome-wide effects of drug treatment,
accelerating the identification and development of new therapeutics.
As with any assay, yeast barcode chemogenomic screens have several limitations; these have
been discussed elsewhere, e.g., [187, 188]. Importantly, these screens can be used only with
79
those compounds bioactive in yeast; can capture only those drug-gene interactions which impact
growth; and be relevant to disease for only those human proteins having yeast homologs.
Integrative computational methods that mine chemogenomic data are fast, cheap, and can
complement traditional methods of drug screening. We illustrated with examples how NetwoRx
can be applied to analyze mode-of-action of cancer drugs, repurpose psychoactive drugs, and
predict new drugs that modulate yeast aging.
In the future, developments in RNAi technology will allow experiments of comparable
throughput to be conducted in mammalian cell lines, and we will expand NetwoRx to include
these data. Pooled shRNA screens have already helped elucidate the mode-of-action of
individual cancer drugs and show enormous promise for speeding drug development [189, 190].
4.7 Acknowledgements
We thank Marc Angeli and Abraham Heifets for their helpful comments on the manuscript.
Funding: This work was supported in part by Ontario Research Fund (GL2-01-030 and RE-03-
020), Canada Institutes for Health Research (BIO-99745), the Canada Foundation for Innovation
(CFI #12301 and #203383), the Canada Research Chair Program, and IBM to IJ, and the Ontario
Ministry of Health and Long Term Care. The views expressed do not necessarily reflect those of
the OMOHLTC.
80
5 Computationally repurposing drugs for lung
cancer with CMapBatch: candidate therapeutics
from an integrative meta-analysis of cancer gene
signatures and chemogenomic data
This chapter is based on:
Kristen Fortney, Joshua Griesman, Max Kotlyar, and Igor Jurisica (2012). Computationally
repurposing drugs for lung cancer with CMapBatch: candidate therapeutics from an integrative
meta-analysis of cancer gene signatures and chemogenomic data. In Preparation.
5.1 Abstract
Background
Using gene signatures to computationally repurpose FDA-approved drugs can accelerate the
development of new therapeutics. Though existing methods for signature-based repurposing are
based on the analysis of individual signatures, for many diseases dozens of gene signatures are in
the public domain. We develop CMapBatch to exploit these data. CMapBatch is a
computational meta-analysis pipeline that takes as input a collection of gene signatures of
disease and outputs a list of drugs predicted to consistently reverse pathological gene changes.
We apply CMapBatch to identify new therapeutics for lung cancer.
81
Results
We applied CMapBatch to a collection of 21 gene expression signatures of lung cancer. We
demonstrate that, while drug candidates identified by CMap analysis of individual gene
signatures (http://www.broadinstitute.org/cmap/) are highly variable, CMapBatch returns very
stable sets of top drug candidates. Our meta-analysis of all 21 signatures revealed that 247 drugs
consistently reversed lung cancer gene changes. In silico validation on the NCI-60 collection
showed that drug candidates significantly inhibit growth in nine lung cancer cell lines (of nine
tested). Common protein targets of drug candidates included CALM1 and PLA2G4A. We
characterized these drugs’ chemical properties and drug-target network, and applied multiple
criteria to rank them in terms of therapeutic promise.
Conclusions
CMapBatch can improve signature-based drug repurposing by leveraging the large number of
disease signatures; we have made this method publicly available at
http://ophid.utoronto.ca/cmapbatch. We applied CMapBatch to identify a prioritized list of new
candidate drugs for lung cancer.
82
5.2 Background
Lung cancer accounts for the largest number of cancer-related deaths, and the 5-year survival
rate (across all stages) is only 16% [191]; there is an urgent need for new therapeutics to help
treat it. Over the past two decades, the application of HTP technologies has led to the rapid
accumulation of comprehensive and diverse public datasets cataloguing genome-wide molecular
alterations seen with lung cancer or with drug administration. Integrative computational methods
that mine these data are fast, cheap, and can complement traditional methods of drug screening;
complementary information in these distinct resources can be leveraged to develop
comprehensive in silico screens for novel cancer therapeutics [192].
One such resource, the Connectivity Map (CMap), which is the focus of our analyses, catalogues
the transcriptional responses to drug treatment in human cell lines for over a thousand small
molecules [75]. CMap has been successfully applied to identify novel therapeutics for a diverse
set of indications including various cancers (e.g., [76, 77]), and most recently osteoarthritic pain
[193] and muscle atrophy [194].
CMap was applied in two earlier studies to identify novel therapeutics for lung cancer. Wang et
al. [195] combined two microarray data sets to create a single transcriptional signature of lung
adenocarcinoma and screened it against CMap. They tested one of their drug hits (17-AAG) in
vitro and found that in inhibited growth in two lung adenocarcinoma cell lines. Ebi et al. [196]
constructed a transcriptional signature of survival in patients with lung adenocarcinoma; CMap
analysis identified several drugs that might improve outcome. The authors experimentally
83
confirmed the growth inhibitory activity of several drug hits, including rapamycin, LY-294002,
prochlorperazine, and resveratrol.
Nearly every previous analysis using Connectivity Map data to link drugs to diseases has done so
with the CMap online tool (http://broadinstitute.org/cmap/). The CMap tool takes as input a set
of up-regulated probe sets and a set of down-regulated probe sets, and returns a list of drugs that
reverts or mimics those gene expression changes. However, for most diseases, not one but many
– often dozens – of distinct gene signatures are available. For example, the cancer-specific
database Oncomine (version 4.4) stores mRNA data from 566 different studies [65]. As the
CMap tool only deals with one gene signature at a time, the question of how best to take
advantage of the information in a large collection of disease signatures remains an important
open problem.
While a few studies have used multiple disease signatures in CMap analysis, e.g., [194, 195]
(though with one exception [197], they used only two or three signatures per disease), they have
all relied on essentially the same strategy of collapsing all disease signatures into a single meta-
signature (by e.g., intersecting lists of significant genes from different studies, as in [194]) and
querying the CMap data with this signature. Since each of the individual disease signatures was
constructed using dozens or even hundreds of microarrays, there is fairly strong evidence for
every gene in each signature. In comparison, the drug response data in CMap is noisy: the 1309
drugs have each been tested only a median of 4 times (4 treatment microarrays). This noise has
consequences: previous work has shown that even small changes in the input gene signature can
84
lead to large changes in the list of drugs identified as significant by CMap analysis (with the
sscMap program)[198] [199].
Here we propose an alternative strategy for connecting a set of disease gene signatures to drugs,
CMapBatch. Rather than collapsing all the gene signatures in the set into a single gene signature,
we propose to screen each disease signature separately against CMap to produce a set of ranked
lists of drug candidates. Next, we apply a rank-based meta-analysis method to identify which
drugs are consistently ranked as the best candidates across all disease signatures. I.e., we perform
the meta-analysis at a later step: our method combines lists of drugs rather than lists of genes.
We show that this strategy returns more stable sets of top drug candidates.
Next, we applied CMapBatch to lung cancer. We used three steps to identify and prioritize new
lung cancer therapeutics. First, we conducted a meta-analysis using CMapBatch to identify drugs
that reverse the transcriptional changes seen with lung cancer across 21 gene signatures. We
found that 247 CMap drugs consistently counter the gene changes that occur with lung cancer.
Second, we performed in silico validation of drug candidates with the NCI-60 growth inhibition
data. We found that drug candidates identified by CMapBatch were significantly more likely to
slow growth in nine lung cancer cell lines than other CMap drugs (P < 0.01). Third, we
implemented data integration for drug prioritization. We identified common protein targets of
significant drugs, and used chemical structure similarity and drug-target relationships to
prioritize candidate therapeutics.
85
5.3 Results and discussion
5.3.1 CMapBatch meta-analysis strategy: From individual cancer gene signatures
to candidate therapeutics
Our CMapBatch meta-analysis pipeline comprises the following steps (Figure 5-1): For each
individual lung cancer signature (tumour vs. normal comparison), we calculate mean
connectivity scores for 1309 small molecules (as previously described [75]). Connectivity scores
range between -1 and 1; a large, negative mean connectivity score indicates that drug treatment
reverses many of the gene changes with lung cancer. We use the mean connectivity score to
construct a ranked list of drugs for each signature. We combine the ranked lists of drugs into a
single matrix, and identify drugs that were consistently highly ranked across signatures using the
Rank Product method [67] (see Materials and Methods).
Our analyses are based on 21 previously published gene expression signatures of lung cancer
obtained from Oncomine [65] and CDIP, the Cancer Data Integration Portal
(http://ophid.utoronto.ca/cdip/). The samples used to derive each signature have diverse
histologies, and mRNA levels were measured on various commercial platforms (Table 5-S1).
86
Figure 5-1. CMapBatch meta-analysis pipeline.
Given a set of disease signatures, CMapBatch calculates mean connectivity scores for 1,309 drugs and
converts them to ranks. Next, CMapBatch applies the Rank Product method to identify drugs that are
consistently highly ranked across signatures. On a set of 21 transcriptional signatures of lung cancer, we
identified 247 drugs that significantly reverse these pathological gene expression changes (at FDR < 1%).
5.3.2 Candidate drugs identified via CMapBatch are more conserved across
signature subsets than candidate drugs identified from single gene signatures
Previous work has shown that CMap analysis of different gene signatures for the same disease
can return very different lists of drug candidates [199]. This is undesirable, if perhaps
unsurprising as gene signatures themselves can be highly variable [192]. Consistent with
previous work, when we retrieved lists of the top 50 drugs for each of the 21 different gene
signatures of lung cancer (using the CMap online tool), overlap was poor. The median number of
87
drug candidates present in top 50 drug candidate lists from two different signatures was only 22
(Figure 5-2 in blue). Repeating the same test using lung cancer signatures of the same type – 10
adenocarcinoma signatures – did not lead to much improvement. For adenocarcinoma, the
median number of drugs identified by two signatures was 26 (Figure 5-2 in gray), but the
difference is not statistically significant.
Next, we sought to determine whether using a large set of signatures with CMapBatch would
lead to a more stable list of top drug candidates. For this test, we randomly assigned the 21 lung
cancer gene signatures to two groups, one with 10 and the other with 11 signatures. We ran
CMapBatch separately on the two disjoint sets of signatures, and compared lists of the top 50
drugs identified for each set. We repeated this test 100 times. We found that CMapBatch
consistently identifies the same drugs as combatting lung cancer, even when it is trained on
completely different sets of lung cancer signatures. A median of 39 drugs were found to be
common to both to the lists of top 50 drugs identified from two disjoint sets of signatures (Figure
5-2 in green), significantly more than are found with individual gene signatures (Wilcox test P
<< 0.01).
88
Figure 5-2. CMapBatch produces more stable lists of significant drugs than individual gene
signatures.
Shown are boxplots of the number of conserved drug candidates when any two lists of top 50 drug
candidates are intersected. Green: 21 gene signatures were split into two disjoint sets of 10 and 11
signatures, CMapBatch was run on both sets, and top drugs from each set were compared; this
experiment was repeated 100 times. Blue: 21 gene signatures were used to retrieve 21 lists of drugs
with the CMap online tool; top drugs from all pairs of signatures were compared. Grey: 10 gene
signatures of the same lung cancer type (adenocarcinoma) were used to retrieve 10 lists of drugs with
the CMap online tool; top drugs from all pairs of signatures were compared. CMapBatch results showed
a significantly higher median overlap (Wilcox test P << 0.01).
89
5.3.3 Characterizing and prioritizing candidate lung cancer therapeutics
For the remainder of this paper, we focus on characterizing and prioritizing the full set of
significant drugs identified by CMapBatch using all 21 gene signatures of lung cancer.
CMapBatch meta-analysis identified 247 candidate lung cancer therapeutics. At an FDR cut-off
of 0.01, we find that 247 drugs (out of 1,309 drugs in CMap Build 2) significantly reverse the
gene expression changes seen with lung cancer in the full set of 21 lung cancer signatures. This
is a large number of drugs, but in line with previous results obtained using similar data; e.g., a
recent paper examining disease-drug relationships using the 164 drugs tested in CMap Build 1
linked 72 of them to adenocarcinoma of the lung, and 67 to squamous cell carcinoma of the lung
[197].
5.3.4 Candidate therapeutics inhibit growth in nine lung cancer cell lines
As an independent validation of our results, we used growth inhibition data from the NCI-60
collection [200] to determine whether the drug candidates we identified are better at slowing
growth in lung cancer cell lines. For all our NCI-60 analyses we used the nine lung cancer cell
lines in which over 100 Connectivity Map drugs were tested (see Methods).
Significant drugs are more effective at inhibiting growth more than other Connectivity Map
drugs. In all nine cell lines, drugs that CMapBatch identifies as reversing the transcriptional
90
changes seen with lung cancer are significantly better (Wilcox test P < 0.01) than other CMap
drugs at inhibiting growth (Figure 5-3).
Figure 5-3. Drug candidates inhibit growth in lung cancer cell lines more than other
Connectivity Map drugs.
We tested whether the drugs that we identified as significantly reversing the gene changes seen with
lung cancer were better at inhibiting growth using NCI-60 GI50 data in 9 lung cancer cell lines. In every
cell line, significant drugs are better than other Connectivity Map drugs at inhibiting growth (Wilcox test
P < 0.01).
23 significant drugs inhibit growth in a majority of lung cancer cell lines. For each of the nine
cell lines, and using data from every drug tested on that line, we define the threshold for
sensitivity to a drug to be the top 20% of the –logGI50 values; i.e., we say that the cell line is
sensitive to those drugs with –logGI50 values in the top 20%. By this definition, of all the NCI60
drugs that have been tested in 5-9 lung cancer cell lines, 7,794 of 44,802, or 17%, inhibit growth
in 5 or more cell lines. Of the significant drugs tested, 23/41, or 56% inhibit growth in 5 or more
lung cancer cell lines (Figure 5-4, left).
91
Among these 23 are several that are already in use to treat cancer. For example, daunorubicin
and the chemically related doxorubicin are topoisomerase inhibitors and commonly-used
chemotherapeutic agents; sirolimus (rapamycin) is currently in clinical trials for several cancers
and was recently shown to increase NSCLC tumour cell sensitivity to erlotinib [201]; vorinostat,
a histone deacetylase inhibitor, enhanced the response to carboplatin or paclitaxel in patients
with advanced NSCLC [202]; MS-275, also a histone deacetylase inhibitor, enhanced the
response to erlotinib in an erlotinib-resistant lung adenocarcinoma cell line [203].
Others of the 23 have not yet been investigated as cancer therapeutics (i.e., there are fewer than
20 Pubmed abstracts linking the drug to any type of cancer) and should be prioritized for further
biological validation. For example, spiperone and pimozide are antipsychotics. Recently,
pimozide was shown to reduce the viability of several cancer cell lines while sparing normal
cells [22].
We call this set of 23 drugs that transcriptionally reverse lung cancer gene changes and slow
growth in lung cancer cell lines – TOP drugs; in subsequent sections, we prioritize significant
drugs that have not been tested in NCI-60 by linking them to TOP drugs using a variety of
metrics.
92
Figure 5-4. Prioritizing drug candidates with GI50 values and chemical structures.
Twenty-three of the significant drugs inhibit growth in a majority of lung cancer cell lines (left). A further
11 significant drugs not tested in NCI-60 are highly structurally similar (Tanimoto similarity >= 0.8) to
one or more of the sixteen (right).
93
5.3.5 Prioritizing drugs by structural similarity: eleven significant drugs are highly
structurally similar to TOP drugs
The Tanimoto coefficient quantifies the chemical structure similarity between two molecules
[204]; here, we call two molecules structurally similar if this number exceeds 0.8. We found that
eleven drugs that reverse the transcriptional changes observed in lung cancer were structurally
similar to one or more drugs in TOP (Figure 5-4, right). These drugs were not evaluated as part
of the NCI-60 project; furthermore, 9 of 11 appear in fewer than 20 Pubmed abstracts concerned
with cancer. These are novel anticancer therapeutics identified by our computational screen.
5.3.6 Prioritizing drugs by shared target: thirty-eight significant drugs share a
protein target with one or more TOP drugs
We used drug-target data from DrugBank [205] and ChemBank [206] (as provided in MANTRA
[154] ) to construct a drug-drug interaction network on the set of CMap drugs; two drugs are
linked by an edge if they share one or more protein targets (Figure 5-5). In total, 83 of the
significant drugs were present in this network (the protein targets of many drugs are still
unknown), including 9 TOP drugs. Thirty-eight significant drugs that were not tested in the NCI-
60 collection share one or more protein targets with a TOP drug (Figure 5-5A, purple and green
nodes), indicating they may have a similar mode of action and may inhibit growth in lung cancer
cell lines.
Seven of these 38 drugs were also found to be structurally similar to TOP drugs (Figure 5-5A,
green nodes): prochlorperazine, promazine, trifluoperazine, fluspirilene, phenindione,
vidarabine, and chlorpromazine. As these drugs are linked to TOP drugs by two separate lines of
evidence, they are promising candidates for further biological validation.
94
Figure 5-5. Significant drugs share many protein targets.
A. In the drug-target network for drug candidates, two drugs are connected by an edge if they have the
same protein target. Shown in colour are the drugs that slow growth in 5 or more lung cancer cell lines
(blue), their immediate neighbours (purple), and the drugs that are structurally similar to them (green).
95
Green edges indicate drug pairs that, in addition to sharing a protein target, were also found to be highly
structurally similar (see Figure 5-4). B. 83 significant drugs are represented in the drug-target network,
and the largest connected component contains 72 drugs. 10,000 random draws of 83 drugs from the
drug-target network resulted in smaller connected components (median size 42 drugs; P << 0.01).
5.3.7 Common protein targets of significant drugs
The largest connected component in the drug-target interaction network comprised 72 drugs,
which is significantly larger (P << 0.01) than what would be expected by chance; random sets of
83 drugs in the drug-drug network yield largest connected components with a median size of
only 42 drugs (Figure 5-5B). This indicates that some gene targets are overrepresented among
significant drugs; these genes may be valuable drug targets for lung cancer. We applied the
hypergeometric test to each gene target of a significant drug and identified ten over-represented
targets (P < 0.05; Table 5-1).
The top over-represented gene is Calmodulin 1 (Calm1), a gene involved in the cell cycle and in
signal transduction; it’s a target of 9 CMap drugs, and we found that 8 of these reverse the
transcriptional changes seen with lung cancer. Recent research suggests that CBP501, a drug
currently in Phase II clinical trials for NSCLC, may sensitize tumors to the chemotherapeutic
agents bleomycin and cisplatin by inhibiting Calm1 [25]. Thus, other significant drugs that target
Calm1 may also enhance the effect of chemotherapy. The 8 drugs we identified are bepridil,
felodipine, flunarizine, fluphenazine, loperamide, phenoxybenzamine, pimozide, and
miconazole.
96
The second-most overrepresented gene is PLA2G4A, whose protein product is a member of the
cytosolic phospholipase A2 family. Cytosolic phospholipase A2 (cPLA2) has been previously
implicated in cancer progression and metastasis. Furthermore, in a mouse model of lung cancer,
the inhibition of cPLA2 activity led to delayed tumour growth [207, 208]. There are 4 drugs
targeting PLA2G4A included in the CMap collection, and all 4 significantly reverse lung cancer
gene changes in our analyses: flunisolide, fluocinonide, fluorometholone, and medrysone.
Table 5-1. Common protein targets of candidate drugs.
Gene P value Count (significant CMap
drugs)
Count (all CMap
drugs)
CALM1 5.42E-07 8 9
PLA2G4A 0.000288 4 4
DRD2 0.0006 12 34
HTR2A 0.000614 9 21
SERPINA6 0.001465 5 8
ABCC8 0.002252 3 3
CYP3A3 0.002252 3 3
SLC6A4 0.002321 7 16
KCNH2 0.002956 5 9
SLC6A2 0.005594 6 14
ADRA1A 0.008209 8 24
ABCB1 0.008724 5 11
97
5.3.8 Significant drugs are broad-acting: they affect more genes than other drugs
We used the CMap gene expression profiles from before and after drug treatment to calculate the
number of genes differentially expressed in response to a drug, for each of the 1,309 drugs in the
collection (see Materials and Methods). We found that significant drugs affect a median of 8.5
genes, while other CMap drugs affect only a median of 3 (Figure 5-6; Wilcox test P << 0.01).
Figure 5-6. Significant drugs affect more genes than other Connectivity Map drugs.
We used CMap data to calculate the number of genes that were significantly differentially regulated (P <
0.05) for each of 1,309 drugs. Drugs that we identified as reversing the gene changes seen with lung
cancer affected significantly more genes than other drugs (Median of 8.5 vs. 3 genes; Wilcox test P <<
0.01).
98
5.3.9 Many drugs are indicated for lung cancer independently of subtype
We investigated the top drugs that revert expression changes in different lung cancer subtypes by
running CMapBatch on the two largest signature subsets in our collection, adenocarcinoma (10
signatures) and squamous cell carcinoma (6 signatures). We found a very high concordance
among top drugs; 79 drugs are common to the top 100 drugs lists for adenocarcinoma and
squamous cell carcinoma. Furthermore, all 79 drugs are significant in the full 21-signature meta-
analysis.
5.4 Conclusions
Dozens of distinct gene signatures are available for many diseases. We developed CMapBatch to
efficiently integrate these data with the Connectivity Map to automate drug repurposing and
identify stable lists of candidate therapeutics. Using the example of lung cancer, we showed that
CMapBatch improved on previous strategies for drug repurposing based on the analysis of
individual gene signatures. We have made our method publicly available as an online tool at
http://ophid.utoronto.ca/cmapbatch.
5.5 Methods
5.5.1 Code and software
Code for all analyses was written in R 2.14.0. We converted gene names to HG-U133A probeset
IDs for Connectivity Map analysis using the hgu133a.db (Bioconductor 2.8). The drug-target and
mode of action networks were analyzed using igraph (Bioconductor 2.8) and visualized using
NAViGaTOR 2.2.1 [141], and drug structures were visualized with PyMOL [209]. We
calculated Tanimoto similarity for all pairs of 1148 CMap drugs for which PubChem IDs were
available using the PubChem Chemical Structure Clustering Tool [182].
99
5.5.2 Data sources
Transcriptional signatures of lung cancer. We downloaded 21 gene signatures of lung cancer
from CDIP version 1.0, the Cancer Data Integration Portal (http://ophid.utoronto.ca/cdip/). We
included signatures from all lung cancer vs. normal comparisons where 10 or more genes were
found to be differentially up- and down-regulated. For Oncomine signatures, we sorted up- and
down-regulated genes by adjusted P value, using a threshold of FDR <= 0.05.; we retained only
the top 250 up-regulated and top 250 down-regulated genes.
Drug-response data. We downloaded rankMatrix, containing the ranks of genes in response to
6,100 drug treaments (corresponding to 1,309 unique drugs), from Connectivity Map Build 02 at
http://www.broadinstitute.org/cmap/.
Interaction networks. We downloaded the drug-target interaction network, where two drugs
share an edge if they share a physical binding partner, from MANTRA [154]. We visualized the
drug target interaction network with NAViGaTOR 2.2.1 [184].
Lists of genes differentially regulated by CMap drugs. We downloaded lists of genes
significantly up-or down-regulated by CMap drugs from [210].
5.5.3 Connectivity map analysis of lung cancer signatures
Mapping gene names to probeset IDs. We mapped human gene IDs to Affymetrix HG-U133A
IDs for connectivity map analysis following previously established protocols [3].
100
Calculating mean connectivity scores for each signature. For each lung cancer signature, mean
connectivity scores for 1,309 drugs were calculated as previously described [3] and converted to
ranks.
5.5.4 Meta-analysis of drug-response data
Combining ranked lists of drugs to construct a consensus ranked list. We adapted the Rank
Product method [17] to identify drugs that consistently reverse the transcriptional changes seen
with lung cancer across a large collection of signatures. For each drug, we calculated the product
of its ranks in all lung cancer signatures.
Identifying drugs with significantly small rank products. We randomly permuted the assignment
of connectivity scores to drugs for the 6,100 instances (drug treatments), recalculated mean
scores and drug ranks for 1,309 drugs in each signature, and re-calculated randomized rank
products 10,000 times. We used this background distribution to calculate p-values and estimate
false discovery rates.
5.5.5 NCI-60 analysis of significant drugs
We restricted our analyses to the NCI-60 GI50 (50% growth inhibition) data and to those lung
cancer cell lines where at least 100 Connectivity Map drugs were tested (there were nine of
these, all NSCLC: NCI-H23, NCI-H522, 549/ATCC, EKVX, NCI-H226, NCI-H322M, NCI-
H460, HOP-62, HOP-92). As different GI50 thresholds were used to denote minimal activity in
response to a drug for different concentration ranges, we filtered the data to make results
comparable across drugs. We retained only those entries with an LCONC (maximum log10
concentration) of -4 and where the drug concentration was measured in units of molarity.
101
5.6 Acknowledgements
This work was supported in part by Ontario Research Fund (GL2-01-030 and RE-03-020),
Canada Institutes for Health Research (BIO-99745), the Canada Foundation for Innovation (CFI
#12301 and #203383), the Canada Research Chair Program, and IBM to IJ, and the Ontario
Ministry of Health and Long Term Care. The views expressed do not necessarily reflect those of
the OMOHLTC.
5.7 Supplementary Material
Table 5-S1. Twenty-one lung cancer gene signatures (tumour vs. normal comparisons).
Histology PMID Source
Adenocarcinoma 18992152 CDIP
Adenocarcinoma 16549822 CDIP
Adenocarcinoma 18927117 CDIP
Adenocarcinoma 12118244 Oncomine
Adenocarcinoma 11707567 Oncomine
Carcinoid 11707567 Oncomine
Small Cell 11707567 Oncomine
Squamous 11707567 Oncomine
Large Cell 11707590 Oncomine
Adenocarcinoma 11707590 Oncomine
Small Cell 11707590 Oncomine
Squamous 11707590 Oncomine
Large Cell 20421987 Oncomine
Adenocarcinoma 20421987 Oncomine
Squamous 20421987 Oncomine
Adenocarcinoma 18297132 Oncomine
102
Adenocarcinoma 16314486 Oncomine
Adenocarcinoma 17540040 Oncomine
Squamous 15833835 Oncomine
Squamous 16188928 Oncomine
Squamous 14581339 Oncomine
103
6 General conclusions and significance
6.1 Conclusions
Aging and disease are complex and heterogeneous biological processes. HTP technologies
invented in the past two decades are only just beginning to provide us with a comprehensive
genome-wide view of how aging and disease affect cells, tissues, and organisms, from the
perspectives of transcription, translation, methylation, and several other modalities. New analysis
methods that integrate these comprehensive and complementary data have enormous potential to
transform our understanding of the basic mechanisms of aging and disease and to suggest new
and better therapies to treat their pathological effects.
As discussed in Chapter 1, there are several major challenges to extracting the maximum
information from these new data. HTP data are noisy, and analysis techniques designed for
small-scale biological experiments often do not translate well to the setting of ‘big data’. In the
four research chapters of this thesis, I described the development and application of new analysis
methods and strategies to the problems of identifying biomarkers and therapeutics for aging and
disease.
Chapters 2, 3, and 5 made novel methodological contributions, and in Chapter 4 existing
bioinformatics methods were applied to gain insight into a particular biological problem. In
Chapter 2, I proposed a novel algorithm for identifying subnetwork biomarkers of aging. In this
algorithm, biomarkers are networks of genes selected based on a score that takes into account
age-dependent activity (Spearman correlation of subnetwork activity with age) and a locally-
defined graph-theoretic measure of modularity. Subnetworks are grown starting from seed genes
104
in an interaction network; at each stage of the growth procedure, the algorithm considers all
network neighbors of the current subnetwork, and greedily maximizes subnetwork score by
adding the neighbor leading to the largest score increase. Subnetworks identified with this
algorithm outperformed previous ones on key measures, yielding biomarkers that were more
conserved across studies and performed better on a machine learning task (predicting age based
on expression data using Support Vector Regression algorithms). This work was the first to use a
new subnetwork performance criterion that incorporates modularity into the expression for
subnetwork score, and the first to integrate network information with gene expression data to
identify biomarkers of aging. In Chapters 3 and 5, I developed the CMapBatch tool; CMapBatch
takes advantage of the large quantity of public gene expression data to help speed drug
discovery. In CMapBatch, Kolmogorov-Smirnov statistics are first calculated to determine
connectivity scores that link individual gene signatures of some disease (e.g. lung cancer) to
1309 drugs from the CMap collection [75]. Connectivity scores reflect the extent to which drug
treatments reverse (or mimic) the gene expression changes in the query signature. The drugs are
next ranked by connectivity score for each signature, and finally an adapted Rank Product
method [67] is applied to combine the ranked lists and identify drugs that are consistently highly
ranked as the best therapeutics for the disease across a large set of independent signatures of
disease. CMapBatch produced more stable sets of drug candidates for lung cancer than previous
methods (Chapter 5), and in silico validation revealed that CMapBatch drug candidates
significantly inhibited growth in nine lung cancer cell lines, of nine tested. In Chapter 4, I
applied existing methods of gene-set and drug-network analysis to study drug effects at the level
of systems, networks, and phenotypes in S. cerevisiae, and built the NetwoRx web portal to store
these data.
105
The four projects described in this thesis were tested in a range of model systems – yeast, worm,
mouse, and human cell lines – and were concerned with a variety of distinct computational tasks
using diverse HTP data sources. The common threads running through all four were (1) the shift
of focus away from single genes to the systems level, and (2) the integration of complementary
HTP data sources.
6.1.1 From genes to pathways, phenotypes, and networks
Past work has borne out the hypothesis that shifting focus away from individual genes and
towards more holistic gene modules, networks, pathways, and phenotypes can bring several
advantages – as discussed in Chapter 1, systems-level differences tend to be more reproducible
across studies, and biomarkers based on modules or gene groups can perform better on
classification tasks. Systems-level analyses played central roles in each of the four projects that
constitute this thesis. For example, in Chapter 2 I showed that high-throughput information about
the higher-level associations between genes – in the form of a functional interaction network –
can yield new insights into the transcriptional programs of aging. I identified modular
subnetworks associated with worm aging – highly interconnected groups of genes that change
activity with age – and showed that they are effective biomarkers for predicting worm age on the
basis of gene expression. And in Chapter 4, I built the NetwoRx web portal to facilitate the
systems-level interrogation of yeast chemogenomic data. I illustrated with examples how
NetwoRx can be applied to analyze mode-of-action of cancer drugs, repurpose psychoactive
drugs, and predict new drugs that modulate yeast aging.
106
6.1.2 Integrating complementary HTP data sources
As reviewed in Chapter 1, noise and biological heterogeneity complicate the analysis of HTP
data. Chapters 2-5 relied on HTP data integration to reduce noise and identify more robust or
accurate biomarkers and therapeutics. For example, dozens of distinct gene signatures are
available for many diseases. I developed CMapBatch to efficiently integrate these data with the
Connectivity Map to automate drug repurposing and identify stable lists of candidate
therapeutics, and applied this method to identify candidate calorie restriction mimetics (Chapter
2) and lung cancer therapeutics (Chapter 5). Using the example of lung cancer, I showed that
CMapBatch improved on previous strategies for drug repurposing based on the analysis of
individual gene signatures (Chapter 5). In most projects, I also integrated HTP data of multiple
types, e.g., gene expression data with genome-wide RNAi phenotypes (Chapter 2) and large
scale drug-induced growth-inhibition data (Chapter 5).
6.2 Open questions and future work
In the future, more and different HTP data will enable the development of improved biomarkers
and therapeutics for aging and disease.
6.2.1 Limitations in HTP data
The quality of any integrative computational analysis is necessarily limited by the data that are
available. For example, in the case of intra- and inter-tumor heterogeneity, we may sometimes
have too few samples or too much noise to be able to develop cancer signatures with the required
accuracy and reproducibility; similarly, existing microarray studies of aging that sample only
107
two time-points (old vs. young animals) may not contain sufficient information for the accurate
modeling of complex age-related biological processes. To take a couple of specific examples
relevant to the work in this thesis, Chapter 4 uses chemogenomic data taken from experiments in
S. cerevisiae. Importantly, these screens can be used only with those compounds bioactive in
yeast; can capture only those drug-gene interactions which impact growth; and be relevant to
disease for only those human proteins having yeast homologs. And Chapters 3 and 5 depend on
the Connectivity Map, which contains data on the transcriptional response to drugs in human cell
lines. Cell lines are an imperfect model for in vivo drug response.
6.2.2 Future work
As HTP technologies become cheaper and more widely adopted, the biological states of health,
aging and disease will be sampled to a density sufficient to support sophisticated new analyses
and models that will transform the field of translational medicine. Furthermore, newer
technologies such as whole genome sequencing will soon provide entirely different perspectives
on aging and disease. For example, roughly 100 cancer genomes have been sequenced so far –
most of these just within the past year – and several major projects are underway, which should
see that number quickly increase. The International Cancer Genome Consortium (ICGC) plans to
sequence 500 tumors from each of 50 different cancers [211], and the Cancer Genome Atlas
(TCGA) will sequence more than 20 different tumor types in the next 5 years [212]. Making
these data and related clinical information publicly available will significantly contribute to our
understanding of the molecular changes in disease, enabling new discoveries as well as more
comprehensive validation of novel prognostic and predictive signatures. The rapid pace of
technology development will need to be paced by the equally rapid development of new tools
108
and algorithms to handle the large volumes of HTP data and integrate them with existing
knowledge.
The four research projects described in this thesis represent preliminary steps towards a true
systems biology understanding of aging, disease, and drug response. The methods developed in
all four projects are general, in the sense that they could be applied to a number of distinct
biological problems and domains. These and similar approaches that leverage the large quantity
of public HTP data on drugs, aging, and disease can substantially accelerate the identification
and development of new biomarkers and therapeutics.
109
7 References
1. Wieser D, Papatheodorou I, Ziehm M, Thornton JM: Computational biology for
ageing. Philos Trans R Soc Lond B Biol Sci 2011, 366:51-63.
2. Zahn JM, Poosala S, Owen AB, Ingram DK, Lustig A, Carter A, Weeraratna AT, Taub
DD, Gorospe M, Mazan-Mamczarz K, et al: AGEMAP: a gene expression database for
aging in mice. PLoS Genetics 2007, 3:e201-e201.
3. Zahn JM, Kim SK: Systems biology of aging in four species. Current Opinion in
Biotechnology 2007, 18:355-359.
4. Antosh M, Whitaker R, Kroll A, Hosier S, Chang C, Bauer J, Cooper L, Neretti N,
Helfand SL: Comparative transcriptional pathway bioinformatic analysis of dietary
restriction, Sir2, p53 and resveratrol life span extension in Drosophila. Cell Cycle
2011, 10:904-911.
5. Estep PW, 3rd, Warner JB, Bulyk ML: Short-term calorie restriction in male mice
feminizes gene expression and alters key regulators of conserved aging regulatory
pathways. PLoS One 2009, 4:e5242.
6. Spindler SR, Mote PL: Screening candidate longevity therapeutics using gene-
expression arrays. Gerontology 2007, 53:306-321.
7. Rhodes DR, Chinnaiyan AM: Integrative analysis of the cancer transcriptome. Nat
Genet 2005, 37 Suppl:S31-37.
8. Pegram MD, Lipton A, Hayes DF, Weber BL, Baselga JM, Tripathy D, Baly D,
Baughman SA, Twaddell T, Glaspy JA, Slamon DJ: Phase II study of receptor-
enhanced chemosensitivity using recombinant humanized anti-p185HER2/neu
monoclonal antibody plus cisplatin in patients with HER2/neu-overexpressing
metastatic breast cancer refractory to chemotherapy treatment. J Clin Oncol 1998,
16:2659-2671.
9. Slamon DJ, Press MF: Alterations in the TOP2A and HER2 genes: association with
adjuvant anthracycline sensitivity in human breast cancers. J Natl Cancer Inst 2009,
101:615-618.
10. Lowe JA, Jones P, Wilson DM: Network biology as a new approach to drug
discovery. Curr Opin Drug Discov Devel 2010, 13:524-526.
11. Buyse M, Loi S, van't Veer L, Viale G, Delorenzi M, Glas AM, d'Assignies MS, Bergh J,
Lidereau R, Ellis P, et al: Validation and clinical utility of a 70-gene prognostic
signature for women with node-negative breast cancer. J Natl Cancer Inst 2006,
98:1183-1192.
110
12. Spentzos D, Levine DA, Ramoni MF, Joseph M, Gu X, Boyd J, Libermann TA,
Cannistra SA: Gene expression signature with independent prognostic significance in
epithelial ovarian cancer. J Clin Oncol 2004, 22:4700-4710.
13. Dhanasekaran SM, Barrette TR, Ghosh D, Shah R, Varambally S, Kurachi K, Pienta KJ,
Rubin MA, Chinnaiyan AM: Delineation of prognostic biomarkers in prostate cancer.
Nature 2001, 412:822-826.
14. Zhu CQ, Strumpf D, Li CY, Li Q, Liu N, Der S, Shepherd FA, Tsao MS, Jurisica I:
Prognostic gene expression signature for squamous cell carcinoma of lung. Clin
Cancer Res 2010, 16:5038-5047.
15. Fraser HB, Khaitovich P, Plotkin JB, Pääbo S, Eisen MB: Aging and Gene Expression
in the Primate Brain. PLoS Biology 2005, 3:e274 EP --e274 EP -.
16. Flachsbart F, Caliebe A, Kleindorp R, Blanche H, von Eller-Eberstein H, Nikolaus S,
Schreiber S, Nebel A: Association of FOXO3A variation with human longevity
confirmed in German centenarians. Proc Natl Acad Sci U S A 2009, 106:2700-2705.
17. Christensen K, Johnson TE, Vaupel JW: The quest for genetic determinants of human
longevity: challenges and insights. Nat Rev Genet 2006, 7:436-448.
18. Kleindorp R, Flachsbart F, Puca AA, Malovini A, Schreiber S, Nebel A: Candidate gene
study of FOXO1, FOXO4, and FOXO6 reveals no association with human longevity
in Germans. Aging Cell 2011, 10:622-628.
19. Lau SK, Boutros PC, Pintilie M, Blackhall FH, Zhu CQ, Strumpf D, Johnston MR,
Darling G, Keshavjee S, Waddell TK, et al: Three-gene prognostic classifier for early-
stage non small-cell lung cancer. J Clin Oncol 2007, 25:5562-5569.
20. Dupuy A, Simon RM: Critical review of published microarray studies for cancer
outcome and guidelines on statistical analysis and reporting. J Natl Cancer Inst 2007,
99:147-157.
21. Diamandis EP: Cancer biomarkers: can we turn recent failures into success? J Natl
Cancer Inst 2010, 102:1462-1467.
22. Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia
JG, Geoghegan J, Germino G, et al: Multiple-laboratory comparison of microarray
platforms. Nat Methods 2005, 2:345-350.
23. Bell AW, Deutsch EW, Au CE, Kearney RE, Beavis R, Sechi S, Nilsson T, Bergeron JJ:
A HUPO test sample study reveals common problems in mass spectrometry-based
proteomics. Nat Methods 2009, 6:423-430.
24. Auffray C, Chen Z, Hood L: Systems medicine: the future of medical genomics and
healthcare. Genome Med 2009, 1:2.
111
25. Augen J: Information technology to the rescue! Nat Biotechnol 2001, 19 Suppl:BE39-
40.
26. Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF,
Gibbons FD, Dreze M, Ayivi-Guedehoussou N, et al: Towards a proteome-scale map
of the human protein-protein interaction network. Nature 2005, 437:1173-1178.
27. Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M,
Zenkner M, Schoenherr A, Koeppen S, et al: A human protein-protein interaction
network: a resource for annotating the proteome. Cell 2005, 122:957-968.
28. Braun P, Tasan M, Dreze M, Barrios-Rodiles M, Lemmens I, Yu H, Sahalie JM, Murray
RR, Roncari L, de Smet AS, et al: An experimentally derived confidence score for
binary protein-protein interactions. Nat Methods 2009, 6:91-97.
29. Gstaiger M, Aebersold R: Applying mass spectrometry-based proteomics to genetics,
genomics and network biology. Nat Rev Genet 2009, 10:617-627.
30. Kislinger T, Cox B, Kannan A, Chung C, Hu P, Ignatchenko A, Scott MS, Gramolini
AO, Morris Q, Hallett MT, et al: Global survey of organ and organelle protein
expression in mouse: combined proteomic and transcriptomic profiling. Cell 2006,
125:173-186.
31. Bachtiary B, Boutros PC, Pintilie M, Shi W, Bastianutto C, Li JH, Schwock J, Zhang W,
Penn LZ, Jurisica I, et al: Gene Expression Profiling in Cervical Cancer: An
Exploration of Intratumor Heterogeneity. Clin Cancer Res 2006, 12:5632-5640.
32. Blackhall FH, Pintilie M, Wigle D, Jurisica I, Liu N, Radulovitch N, Keshavjee S,
Johnston M, Shepherd FA, Tsao M-S: Stability and heterogeneity of expression
profiles in lung cancer specimens harvested following surgical resection. Neoplasia
2004, 6:761-767.
33. Axelrod DE, Miller N, Chapman JA: Avoiding Pitfalls in the Statistical Analysis of
Heterogeneous Tumors. Biomed Inform Insights 2009, 2:11-18.
34. Jubb AM, Buffa FM, Harris AL: Assessment of tumour hypoxia for prediction of
response to therapy and cancer prognosis. J Cell Mol Med 2010, 14:18-29.
35. Cleator SJ, Powles TJ, Dexter T, Fulford L, Mackay A, Smith IE, Valgeirsson H,
Ashworth A, Dowsett M: The effect of the stromal component of breast tumours on
prediction of clinical outcome using gene expression microarray analysis. Breast
Cancer Res 2006, 8:R32.
36. Myhre S, Mohammed H, Tramm T, Alsner J, Finak G, Park M, Overgaard J, Borresen-
Dale AL, Frigessi A, Sorlie T: In silico ascription of gene expression differences to
tumor and stromal cells in a model to study impact on breast cancer outcome. PLoS
One 2010, 5:e14002.
112
37. Fend F, Raffeld M: Laser capture microdissection in pathology. J Clin Pathol 2000,
53:666-672.
38. Chandran UR, Dhir R, Ma C, Michalopoulos G, Becich M, Gilbertson J: Differences in
gene expression in prostate cancer, normal appearing prostate tissue adjacent to
cancer and prostate tissue from cancer free organ donors. BMC Cancer 2005, 5:45.
39. Bahar R, Hartmann CH, Rodriguez KA, Denny AD, Busuttil RA, Dolle ME, Calder RB,
Chisholm GB, Pollock BH, Klein CA, Vijg J: Increased cell-to-cell variation in gene
expression in ageing mouse heart. Nature 2006, 441:1011-1014.
40. Li Z, Wright FA, Royland J: Age-dependent variability in gene expression in male
Fischer 344 rat retina. Toxicol Sci 2009, 107:281-292.
41. Somel M, Khaitovich P, Bahn S, Paabo S, Lachmann M: Gene expression becomes
heterogeneous with age. Curr Biol 2006, 16:R359-360.
42. de Magalhaes JP, Curado J, Church GM: Meta-analysis of age-related gene expression
profiles identifies common signatures of aging. Bioinformatics 2009, 25:875-881.
43. Elias JE, Haas W, Faherty BK, Gygi SP: Comparative evaluation of mass
spectrometry platforms used in large-scale proteomics investigations. Nat Methods
2005, 2:667-675.
44. Tan PK, Downey TJ, Spitznagel EL, Jr., Xu P, Fu D, Dimitrov DS, Lempicki RA, Raaka
BM, Cam MC: Evaluation of gene expression measurements from commercial
microarray platforms. Nucleic Acids Res 2003, 31:5676-5684.
45. Curtis C, Lynch AG, Dunning MJ, Spiteri I, Marioni JC, Hadfield J, Chin SF, Brenton
JD, Tavare S, Caldas C: The pitfalls of platform comparison: DNA copy number
array technologies assessed. BMC Genomics 2009, 10:588.
46. Barnes M, Freudenberg J, Thompson S, Aronow B, Pavlidis P: Experimental
comparison and cross-validation of the Affymetrix and Illumina gene expression
analysis platforms. Nucleic Acids Res 2005, 33:5914-5923.
47. Eggle D, Debey-Pascher S, Beyer M, Schultze JL: The development of a comparison
approach for Illumina bead chips unravels unexpected challenges applying newest
generation microarrays. BMC Bioinformatics 2009, 10:186.
48. Sandberg R, Larsson O: Improved precision and accuracy for microarrays using
updated probe set definitions. BMC Bioinformatics 2007, 8:48.
49. Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering
statistically significant pathways in expression profiling studies. Proc Natl Acad Sci
U S A 2005, 102:13544-13549.
113
50. Zhang M, Yao C, Guo Z, Zou J, Zhang L, Xiao H, Wang D, Yang D, Gong X, Zhu J, et
al: Apparently low reproducibility of true differential expression discoveries in
microarray studies. Bioinformatics 2008, 24:2057-2063.
51. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich
A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a
knowledge-based approach for interpreting genome-wide expression profiles. Proc
Natl Acad Sci U S A 2005, 102:15545-15550.
52. Chuang HY, Lee E, Liu YT, Lee D, Ideker T: Network-based classification of breast
cancer metastasis. Mol Syst Biol 2007, 3:140.
53. Fujita A, Gomes LR, Sato JR, Yamaguchi R, Thomaz CE, Sogayar MC, Miyano S:
Multivariate gene expression analysis reveals functional connectivity changes
between normal/tumoral prostates. BMC Syst Biol 2008, 2:106.
54. Zhu CQ, da Cunha Santos G, Ding K, Sakurada A, Cutz JC, Liu N, Zhang T, Marrano P,
Whitehead M, Squire JA, et al: Role of KRAS and EGFR as biomarkers of response
to erlotinib in National Cancer Institute of Canada Clinical Trials Group Study
BR.21. J Clin Oncol 2008, 26:4268-4275.
55. Ein-Dor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast
cancer: is there a unique set? Bioinformatics 2005, 21:171-178.
56. Ein-Dor L, Zuk O, Domany E: Thousands of samples are needed to generate a robust
gene list for predicting outcome in cancer. Proc Natl Acad Sci U S A 2006, 103:5923-
5928.
57. Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S,
Jurisica I, Giordano TJ, Misek DE, et al: Gene expression-based survival prediction in
lung adenocarcinoma: a multi-site, blinded validation study. Nat Med 2008, 14:822-
827.
58. Boutros PC, Lau SK, Pintilie M, Liu N, Shepherd FA, Der SD, Tsao MS, Penn LZ,
Jurisica I: Prognostic gene signatures for non-small-cell lung cancer. Proc Natl Acad
Sci U S A 2009, 106:2824-2828.
59. van Vliet MH, Reyal F, Horlings HM, van de Vijver MJ, Reinders MJ, Wessels LF:
Pooling breast cancer datasets has a synergetic effect on classification performance
and improves signature stability. BMC Genomics 2008, 9:375.
60. Warnat P, Eils R, Brors B: Cross-platform analysis of cancer microarray data
improves gene expression based classification of phenotypes. BMC Bioinformatics
2005, 6:265.
61. Fierro AC, Vandenbussche F, Engelen K, Van de Peer Y, Marchal K: Meta Analysis of
Gene Expression Data within and Across Species. Curr Genomics 2008, 9:525-534.
114
62. Plank M, Wuttke D, van Dam S, Clarke SA, de Magalhaes JP: A meta-analysis of
caloric restriction gene expression profiles to infer common signatures and
regulatory mechanisms. Mol Biosyst 2012, 8:1339-1349.
63. Pan F, Chiu CH, Pulapura S, Mehan MR, Nunez-Iglesias J, Zhang K, Kamath K,
Waterman MS, Finch CE, Zhou XJ: Gene Aging Nexus: a web database and data
mining platform for microarray data on aging. Nucleic Acids Res 2007, 35:D756-759.
64. de Magalhaes JP, Budovsky A, Lehmann G, Costa J, Li Y, Fraifeld V, Church GM: The
Human Ageing Genomic Resources: online databases and tools for
biogerontologists. Aging Cell 2009, 8:65-72.
65. Rhodes DR, Kalyana-Sundaram S, Mahavisno V, Varambally R, Yu J, Briggs BB,
Barrette TR, Anstet MJ, Kincead-Beal C, Kulkarni P, et al: Oncomine 3.0: genes,
pathways, and networks in a collection of 18,000 cancer gene expression profiles.
Neoplasia 2007, 9:166-180.
66. Culhane AC, Schwarzl T, Sultana R, Picard KC, Picard SC, Lu TH, Franklin KR, French
SJ, Papenhausen G, Correll M, Quackenbush J: GeneSigDB--a curated database of
gene expression signatures. Nucleic Acids Res 2010, 38:D716-725.
67. Breitling R, Armengaud P, Amtmann A, Herzyk P: Rank products: a simple, yet
powerful, new method to detect differentially regulated genes in replicated
microarray experiments. FEBS Lett 2004, 573:83-92.
68. Hong F, Breitling R: A comparison of meta-analysis methods for detecting
differentially expressed genes in microarray experiments. Bioinformatics 2008,
24:374-382.
69. Hong F, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J: RankProd: a
bioconductor package for detecting differentially expressed genes in meta-analysis.
Bioinformatics 2006, 22:2825-2827.
70. Varambally S, Yu J, Laxman B, Rhodes DR, Mehra R, Tomlins SA, Shah RB, Chandran
U, Monzon FA, Becich MJ, et al: Integrative genomic and proteomic analysis of
prostate cancer reveals signatures of metastatic progression. Cancer Cell 2005,
8:393-406.
71. Gortzak-Uzan L, Ignatchenko A, Evangelou AI, Agochiya M, Brown KA, St Onge P,
Kireeva I, Schmitt-Ulms G, Brown TJ, Murphy J, et al: A proteome resource of ovarian
cancer ascites: integrated proteomic and bioinformatic analyses to identify putative
biomarkers. J Proteome Res 2008, 7:339-351.
72. Brown KR, Jurisica I: Unequal evolutionary conservation of human protein
interactions in interologous networks. Genome Biol 2007, 8:R95.
73. Ackermann M, Strimmer K: A general modular framework for gene set enrichment
analysis. BMC Bioinformatics 2009, 10:47.
115
74. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski
K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The
Gene Ontology Consortium. Nat Genet 2000, 25:25-29.
75. Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP,
Subramanian A, Ross KN, et al: The Connectivity Map: using gene-expression
signatures to connect small molecules, genes, and disease. Science 2006, 313:1929-
1935.
76. De Preter K, De Brouwer S, Van Maerken T, Pattyn F, Schramm A, Eggert A,
Vandesompele J, Speleman F: Meta-mining of neuroblastoma and neuroblast gene
expression profiles reveals candidate therapeutic compounds. Clin Cancer Res 2009,
15:3690-3696.
77. Vilar E, Mukherjee B, Kuick R, Raskin L, Misek DE, Taylor JM, Giordano TJ, Hanash
SM, Fearon ER, Rennert G, Gruber SB: Gene expression patterns in mismatch repair-
deficient colorectal cancers highlight the potential therapeutic role of inhibitors of
the phosphatidylinositol 3-kinase-AKT-mammalian target of rapamycin pathway.
Clin Cancer Res 2009, 15:2829-2839.
78. Hu P, Bader G, Wigle DA, Emili A: Computational prediction of cancer-gene
function. Nat Rev Cancer 2007, 7:23-34.
79. Pena-Castillo L, Tasan M, Myers CL, Lee H, Joshi T, Zhang C, Guan Y, Leone M,
Pagnani A, Kim WK, et al: A critical assessment of Mus musculus gene function
prediction using integrated genomic evidence. Genome Biol 2008, 9 Suppl 1:S2.
80. Lee I, Lehner B, Vavouri T, Shin J, Fraser AG, Marcotte EM: Predicting genetic
modifier loci using functional gene networks. Genome Res 2010, 20:1143-1153.
81. Mills GB, Jurisica I, Yarden Y, Norman JC: Genomic amplicons target vesicle
recycling in breast cancer. J Clin Invest 2009, 119:2123-2127.
82. Agarwal R, Gonzalez-Angulo AM, Myhre S, Carey M, Lee JS, Overgaard J, Alsner J,
Stemke-Hale K, Lluch A, Neve RM, et al: Integrative analysis of cyclin protein levels
identifies cyclin b1 as a classifier and predictor of outcomes in breast cancer. Clin
Cancer Res 2009, 15:3654-3662.
83. Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein
networks. Nature 2001, 411:41-42.
84. Hahn MW, Kern AD: Comparative genomics of centrality and essentiality in three
eukaryotic protein-interaction networks. Mol Biol Evol 2005, 22:803-806.
85. Maslov S, Sneppen K: Specificity and stability in topology of protein networks.
Science 2002, 296:910-913.
116
86. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM,
Michon AM, Cruciat CM, et al: Functional organization of the yeast proteome by
systematic analysis of protein complexes. Nature 2002, 415:141-147.
87. Wuchty S: Topology and weights in a protein domain interaction network--a novel
way to predict protein interactions. BMC Genomics 2006, 7:122.
88. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL: Hierarchical
organization of modularity in metabolic networks. Science 2002, 297:1551-1555.
89. Yu H, Paccanaro A, Trifonov V, Gerstein M: Predicting interactions in protein
networks by completing defective cliques. Bioinformatics 2006.
90. Han JD, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy D, Walhout AJ,
Cusick ME, Roth FP, Vidal M: Evidence for dynamically organized modularity in the
yeast protein-protein interaction network. Nature 2004, 430:88-93.
91. Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U: Network motifs:
simple building blocks of complex networks. Science 2002, 298:824-827.
92. Rice JJ, Kershenbaum A, Stolovitzky G: Lasting impressions: motifs in protein-
protein maps may provide footprints of evolutionary events. Proc Natl Acad Sci U S
A 2005, 102:3173-3174.
93. Przulj N, Corneil DG, Jurisica I: Efficient estimation of graphlet frequency
distributions in protein-protein interaction networks. Bioinformatics 2006, 22:974-
980.
94. Ideker T, Ozier O, Schwikowski B, Siegel AF: Discovering regulatory and signalling
circuits in molecular interaction networks. Bioinformatics 2002, 18 Suppl 1:S233-
240.
95. Nacu S, Critchley-Thorne R, Lee P, Holmes S: Gene expression network analysis and
applications to immunology. Bioinformatics 2007, 23:850-858.
96. Hwang YC, Lin CC, Chang JY, Mori H, Juan HF, Huang HC: Predicting essential
genes based on network and sequence analysis. Mol Biosyst 2009, 5:1672-1678.
97. Fortney K, Kotlyar M, Jurisica I: Inferring the functions of longevity genes with
modular subnetwork biomarkers of Caenorhabditis elegans aging. Genome Biol
2010, 11:R13.
98. Cover TM, Thomas JA: Elements of Information Theory. Wiley-Interscience; 1991.
99. Nibbe RK, Markowitz S, Myeroff L, Ewing R, Chance MR: Discovery and scoring of
protein interaction subnetworks discriminative of late stage human colon cancer.
Mol Cell Proteomics 2009, 8:827-845.
117
100. Nibbe RK, Koyuturk M, Chance MR: An integrative -omics approach to identify
functional sub-networks in human colorectal cancer. PLoS Comput Biol 2010,
6:e1000639.
101. Rhodes DR, Tomlins SA, Varambally S, Mahavisno V, Barrette T, Kalyana-Sundaram S,
Ghosh D, Pandey A, Chinnaiyan AM: Probabilistic model of the human protein-
protein interaction network. Nat Biotechnol 2005, 23:951-959.
102. Radulovich N, Pham NA, Strumpf D, Leung L, Xie W, Jurisica I, Tsao MS: Differential
roles of cyclin D1 and D3 in pancreatic ductal adenocarcinoma. Mol Cancer 2010,
9:24.
103. Tomasini R, Tsuchihara K, Wilhelm M, Fujitani M, Rufini A, Cheung CC, Khan F, Itie-
Youten A, Wakeham A, Tsao MS, et al: TAp73 knockout shows genomic instability
with infertility and tumor suppressor functions. Genes Dev 2008, 22:2677-2691.
104. Sodek KL, Evangelou AI, Ignatchenko A, Agochiya M, Brown TJ, Ringuette MJ,
Jurisica I, Kislinger T: Identification of pathways associated with invasive behavior
by ovarian cancer cells using multidimensional protein identification technology
(MudPIT). Mol Biosyst 2008, 4:762-773.
105. Przulj N: Biological network comparison using graphlet degree distribution.
Bioinformatics 2007, 23:e177-183.
106. Zhu CQ, Ding K, Strumpf D, Weir BA, Meyerson M, Pennell N, Thomas RK, Naoki K,
Ladd-Acosta C, Liu N, et al: Prognostic and predictive gene signature for adjuvant
chemotherapy in resected non-small-cell lung cancer. J Clin Oncol 2010, 28:4417-
4424.
107. Zhu CQ, Pintilie M, John T, Strumpf D, Shepherd FA, Der SD, Jurisica I, Tsao M-S:
Understanding prognostic gene expression signatures in lung cancer. Clin Lung
Cancer 2009, 10.
108. Jonsson PF, Bates PA: Global topological features of cancer proteins in the human
interactome. Bioinformatics 2006, 22:2291-2297.
109. Syed AS, D'Antonio M, Ciccarelli FD: Network of Cancer Genes: a web resource to
analyze duplicability, orthology and network properties of cancer genes. Nucleic
Acids Res 2010, 38:D670-675.
110. Rambaldi D, Giorgi FM, Capuani F, Ciliberto A, Ciccarelli FD: Low duplicability and
network fragility of cancer genes. Trends Genet 2008, 24:427-430.
111. Li L, Zhang K, Lee J, Cordes S, Davis DP, Tang Z: Discovering cancer genes by
integrating network and functional properties. BMC Med Genomics 2009, 2:61.
112. Aragues R, Sander C, Oliva B: Predicting cancer involvement of genes from
heterogeneous data. BMC Bioinformatics 2008, 9:172.
118
113. Savas S, Geraci J, Jurisica I, Liu G: A comprehensive catalogue of functional genetic
variations in the EGFR pathway: protein-protein interaction analysis reveals novel
genes and polymorphisms important for cancer research. Int J Cancer 2009,
125:1257-1265.
114. King AD, Przulj N, Jurisica I: Protein complex prediction via cost-based clustering.
Bioinformatics 2004, 20:3013-3020.
115. Spirin V, Mirny LA: Protein complexes and functional modules in molecular
networks. Proc Natl Acad Sci U S A 2003, 100:12123-12128.
116. Palla G, Derenyi I, Farkas I, Vicsek T: Uncovering the overlapping community
structure of complex networks in nature and society. Nature 2005, 435:814-818.
117. Newman ME: Modularity and community structure in networks. Proc Natl Acad Sci
U S A 2006, 103:8577-8582.
118. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL: The human disease
network. Proc Natl Acad Sci U S A 2007, 104:8685-8690.
119. Sharan R, Ulitsky I, Shamir R: Network-based prediction of protein function. Mol Syst
Biol 2007, 3:88.
120. Lancichinetti A, Fortunato S, Kertész J: Detecting the overlapping and hierarchical
community structure in complex networks. New Journal of Physics 2009, 11.
121. Fortney K, Morgen E, Kotlyar M, Jurisica I: In silico drug screen in mouse liver
identifies candidate calorie restriction mimetics. Rejuvenation Res 2012, 15.
122. Fortney K, Xie W, Kotlyar M, Griesman J, Kotseruba Y, Jurisica I: NetwoRx:
Connecting drugs to networks and phenotypes in S. Cerevisiae. (Submitted) 2012.
123. Fortney K, Griesman J, Kotlyar M, Jurisica I: Computationally repurposing drugs for
lung cancer with CMapBatch: candidate therapeutics from an integrative meta-
analysis of cancer gene signatures and chemogenomic data. In Preparation 2012.
124. Kim SK: Common aging pathways in worms, flies, mice and humans. J Exp Biol
2007, 210:1607-1612.
125. Golden TR, Hubbard A, Dando C, Herren MA, Melov S: Age-related behaviors have
distinct transcriptional profiles in Caenorhabditis elegans. Aging Cell 2008, 7:850-
865.
126. Budovsky A, Abramovich A, Cohen R, Chalifa-Caspi V, Fraifeld V: Longevity
network: construction and implications. Mech Ageing Dev 2007, 128:117-124.
127. Promislow DE: Protein networks, pleiotropy and the evolution of senescence. Proc
Biol Sci 2004, 271:1225-1234.
119
128. Hwang T, Park T: Identification of differentially expressed subnetworks based on
multivariate ANOVA. BMC Bioinformatics 2009, 10:128.
129. Liu M, Liberzon A, Kong SW, Lai WR, Park PJ, Kohane IS, Kasif S: Network-based
analysis of affected biological processes in type 2 diabetes models. PLoS Genet 2007,
3:e96.
130. Xue H, Xian B, Dong D, Xia K, Zhu S, Zhang Z, Hou L, Zhang Q, Zhang Y, Han JD: A
modular network model of aging. Mol Syst Biol 2007, 3:147.
131. Wang X, Dalkic E, Wu M, Chan C: Gene module level analysis: identification to
networks and dynamics. Curr Opin Biotechnol 2008, 19:482-491.
132. Ulitsky I, Shamir R: Identification of functional modules using network topology and
high-throughput data. BMC Syst Biol 2007, 1:8.
133. Budovskaya YV, Wu K, Southworth LK, Jiang M, Tedesco P, Johnson TE, Kim SK: An
elt-3/elt-5/elt-6 GATA transcription circuit guides aging in C. elegans. Cell 2008,
134:291-303.
134. Ulitsky I, Karp R, Shamir R: Detecting Disease-Specific Dysregulated Pathways Via
Analysis of Clinical Expression Profiles. In Proceedings of 12th Int'l Conf Research in
Comp Molecular Biology (RECOMB'08). 2008
135. Dittrich M, Klau G, Rosenwald A, Dandekar T, Müller T: Identifying functional
modules in protein-protein interaction networks: an integrated exact approach.
Bioinformatics 2008, 24:i223-231.
136. Marbach D, Schaffter T, Mattiussi C, Floreano D: Generating realistic in silico gene
networks for performance assessment of reverse engineering methods. J Comput Biol
2009, 16:229-239.
137. Clauset A: Finding local community structure in networks. Phys Rev E Stat Nonlin
Soft Matter Phys 2005, 72:026132.
138. Bair E, Tibshirani R: Semi-supervised methods to predict patient survival from gene
expression data. PLoS Biol 2004, 2:E108.
139. Simon R, Radmacher MD, Dobbin K, McShane LM: Pitfalls in the use of DNA
microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 2003,
95:14-18.
140. Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and
Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society 1995,
57:289-300.
141. Brown KR, Otasek D, Ali M, McGuffin MJ, Xie W, Devani B, Toch IL, Jurisica I:
NAViGaTOR: Network Analysis, Visualization and Graphing Toronto.
Bioinformatics 2009, 25:3327-3329.
120
142. Hermjakob H, Montecchi-Palazzi L, Bader G, Wojcik J, Salwinski L, Ceol A, Moore S,
Orchard S, Sarkans U, von Mering C, et al: The HUPO PSI's molecular interaction
format--a community standard for the representation of protein interaction data.
Nat Biotechnol 2004, 22:177-183.
143. Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima
S, Okuda S, Tokimatsu T, Yamanishi Y: KEGG for linking genomes to life and the
environment. Nucleic Acids Res 2008, 36:D480-484.
144. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS,
Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new
developments in phylogenetic classification of proteins from complete genomes.
Nucleic Acids Res 2001, 29:22-28.
145. Lin C-CCaC-J: LIBSVM: a library for support vector machines. 2001.
146. Alexa A, Rahnenfuhrer J, Lengauer T: Improved scoring of functional groups from
gene expression data by decorrelating GO graph structure. Bioinformatics 2006,
22:1600-1607.
147. Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression
and hybridization array data repository. Nucleic Acids Res 2002, 30:207-210.
148. Lee I, Lehner B, Crombie C, Wong W, Fraser AG, Marcotte EM: A single gene network
accurately predicts phenotypic effects of gene perturbation in Caenorhabditis
elegans. Nat Genet 2008, 40:181-188.
149. Smith ED, Tsuchiya M, Fox LA, Dang N, Hu D, Kerr EO, Johnston ED, Tchao BN, Pak
DN, Welton KL, et al: Quantitative evidence for conserved longevity pathways
between divergent eukaryotic species. Genome Res 2008, 18:564-570.
150. Smola A, Scholkopf B: A tutorial on support vector regression. Statistics and
Computing 2004, 14:199-222.
151. Gautier L, Cope L, Bolstad BM, Irizarry RA: affy--analysis of Affymetrix GeneChip
data at the probe level. Bioinformatics 2004, 20:307-315.
152. Smyth GK: Limma: Linear Models for Microarray Data. In Bioinformatics and
Computational Biology Solutions Using R and Bioconductor. Edited by Gentleman R,
Carey V, Huber W, Irizarry R, Dudoit S. Heidelberg: Springer; 2005: 397-420
153. Kuhn A, Luthi-Carter R, Delorenzi M: Cross-species and cross-platform gene
expression studies with the Bioconductor-compliant R package 'annotationTools'.
BMC Bioinformatics 2008, 9:26.
154. Iorio F, Bosotti R, Scacheri E, Belcastro V, Mithbaokar P, Ferriero R, Murino L,
Tagliaferri R, Brunetti-Pierri N, Isacchi A, di Bernardo D: Discovery of drug mode of
action and drug repositioning from transcriptional responses. Proc Natl Acad Sci U S
A, 107:14621-14626.
121
155. Selman C, Kerrison ND, Cooray A, Piper MD, Lingard SJ, Barton RH, Schuster EF,
Blanc E, Gems D, Nicholson JK, et al: Coordinated multitissue transcriptional and
plasma metabonomic profiles following acute caloric restriction in mice. Physiol
Genomics 2006, 27:187-200.
156. Tsuchiya T, Dhahbi JM, Cui X, Mote PL, Bartke A, Spindler SR: Additive regulation of
hepatic gene expression by dwarfism and caloric restriction. Physiol Genomics 2004,
17:307-315.
157. Dhahbi JM, Mote PL, Fahy GM, Spindler SR: Identification of potential caloric
restriction mimetics by microarray profiling. Physiol Genomics 2005, 23:343-350.
158. Corton JC, Apte U, Anderson SP, Limaye P, Yoon L, Latendresse J, Dunn C, Everitt JI,
Voss KA, Swanson C, et al: Mimetics of caloric restriction include agonists of lipid-
activated nuclear receptors. J Biol Chem 2004, 279:46204-46212.
159. Fu C, Hickey M, Morrison M, McCarter R, Han ES: Tissue specific and non-specific
changes in gene expression by aging and by early stage CR. Mech Ageing Dev 2006,
127:905-916.
160. Miller RA, Chang Y, Galecki AT, Al-Regaiey K, Kopchick JJ, Bartke A: Gene
expression patterns in calorically restricted mice: partial overlap with long-lived
mutant mice. Mol Endocrinol 2002, 16:2657-2666.
161. Dhahbi JM, Kim HJ, Mote PL, Beaver RJ, Spindler SR: Temporal linkage between the
phenotypic and genomic responses to caloric restriction. Proc Natl Acad Sci U S A
2004, 101:5524-5529.
162. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of
Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31:e15.
163. Govindan J, Evans M: Pioglitazone in clinical practice: where are we now? Diabetes
Ther 2012, 3:1-8.
164. Terkeltaub RA, Furst DE, Bennett K, Kook KA, Crockett RS, Davis MW: High versus
low dosing of oral colchicine for early acute gout flare: Twenty-four-hour outcome
of the first multicenter, randomized, double-blind, placebo-controlled, parallel-
group, dose-comparison colchicine study. Arthritis Rheum 2010, 62:1060-1068.
165. Ludwig A, Fechner M, Wilck N, Meiners S, Grimbo N, Baumann G, Stangl V, Stangl K:
Potent anti-inflammatory effects of low-dose proteasome inhibition in the vascular
system. J Mol Med (Berl) 2009, 87:793-802.
166. Jagtap P, Soriano FG, Virág L, Liaudet L, Mabley J, Szabó É, Haskó G, Marton A,
Lorigados CB, Gallyas FJ, et al: Novel phenanthridinone inhibitors of poly(adenosine
5'-dipho... : Critical Care Medicine. Critical Care Medicine 2002, 30:1071-1082.
122
167. Moskalev AA, Shaposhnikov MV: Pharmacological inhibition of phosphoinositide 3
and TOR kinases improves survival of Drosophila melanogaster. Rejuvenation Res
2010, 13:246-247.
168. Jafari M, Khodayari B, Felgner J, Bussel, II, Rose MR, Mueller LD: Pioglitazone: an
anti-diabetic compound with anti-aging properties. Biogerontology 2007, 8:639-651.
169. Giaever G, Chu AM, Ni L, Connelly C, Riles L, Veronneau S, Dow S, Lucau-Danila A,
Anderson K, Andre B, et al: Functional profiling of the Saccharomyces cerevisiae
genome. Nature 2002, 418:387-391.
170. Winzeler EA, Shoemaker DD, Astromoff A, Liang H, Anderson K, al. e: Functional
characterization of the S. cerevisiae genome by gene deletion and parallel analysis.
Science 1999, 285:901-906.
171. Ericson E, Gebbia M, Heisler LE, Wildenhain J, Tyers M, Giaever G, Nislow C: Off-
target effects of psychoactive drugs revealed by genome-wide assays in yeast. PLoS
Genet 2008, 4:e1000151.
172. Blackman RK, Cheung-Ong K, Gebbia M, Proia DA, He S, Kepros J, Jonneaux A,
Marchetti P, Kluza J, Rao PE, et al: Mitochondrial electron transport is the cellular
target of the oncology drug elesclomol. PLoS One 2012, 7:e29798.
173. Hillenmeyer ME, Fung E, Wildenhain J, Pierce SE, Hoon S, Lee W, Proctor M, St Onge
RP, Tyers M, Koller D, et al: The chemical genomic portrait of yeast: uncovering a
phenotype for all genes. Science 2008, 320:362-365.
174. Parsons AB, Lopez A, Givoni IE, Williams DE, Gray CA, Porter J, Chua G, Sopko R,
Brost RL, Ho CH, et al: Exploring the mode-of-action of bioactive compounds by
chemical-genetic profiling in yeast. Cell 2006, 126:611-625.
175. Hillenmeyer ME, Ericson E, Davis RW, Nislow C, Koller D, Giaever G: Systematic
analysis of genome-wide fitness data in yeast reveals novel gene function and drug
action. Genome Biol 2010, 11:R30.
176. Engel SR, Balakrishnan R, Binkley G, Christie KR, Costanzo MC, Dwight SS, Fisk DG,
Hirschman JE, Hitz BC, Hong EL, et al: Saccharomyces Genome Database provides
mutant phenotype data. Nucleic Acids Res 2010, 38:D433-436.
177. Teixeira MC, Monteiro P, Jain P, Tenreiro S, Fernandes AR, Mira NP, Alenquer M,
Freitas AT, Oliveira AL, Sa-Correia I: The YEASTRACT database: a tool for the
analysis of transcription regulatory associations in Saccharomyces cerevisiae.
Nucleic Acids Res 2006, 34:D446-451.
178. Fabrizio P, Hoon S, Shamalnasab M, Galbani A, Wei M, Giaever G, Nislow C, Longo
VD: Genome-wide screen in Saccharomyces cerevisiae identifies vacuolar protein
sorting, autophagy, biosynthetic, and tRNA methylation genes involved in life span
regulation. PLoS Genet 2010, 6:e1001024.
123
179. Matecic M, Smith DL, Pan X, Maqani N, Bekiranov S, Boeke JD, Smith JS: A
microarray-based genetic screen for yeast chronological aging factors. PLoS Genet
2010, 6:e1000921.
180. Powers RW, 3rd, Kaeberlein M, Caldwell SD, Kennedy BK, Fields S: Extension of
chronological life span in yeast by decreased TOR pathway signaling. Genes Dev
2006, 20:174-184.
181. Peter Langfelder SH: WGCNA: an R package for weighted correlation network
analysis. BMC Bioinformatics 2012, 9:559.
182. Bolton EE, Wang Y, Thiessen PA, Bryant SH: Chapter 12 PubChem: Integrated
Platform of Small Molecules and Biological Activities. In Annual Reports in
Computational Chemistry. Volume Volume 4. Edited by Ralph AWaDCS: Elsevier;
2008: 217-241
183. Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET, Christie KR,
Costanzo MC, Dwight SS, Engel SR, et al: Saccharomyces Genome Database: the
genomics resource of budding yeast. Nucleic Acids Res 2012, 40:D700-705.
184. Brown KR, Otasek D, Ali M, McGuffin M, Xie W, Devani B, van Toch IL, Jurisica I:
NAViGaTOR: Network Analysis, Visualization and Graphing Toronto.
Bioinformatics 2009, 25:3327-3329.
185. Fontana L, Partridge L, Longo VD: Extending healthy life span--from yeast to
humans. Science 2010, 328:321-326.
186. Lee KS, Lee BS, Semnani S, Avanesian A, Um CY, Jeon HJ, Seong KM, Yu K, Min KJ,
Jafari M: Curcumin extends life span, improves health span, and modulates the
expression of age-associated aging genes in Drosophila melanogaster. Rejuvenation
Res 2010, 13:561-570.
187. Roemer T, Davies J, Giaever G, Nislow C: Bugs, drugs and chemical genomics. Nat
Chem Biol 2012, 8:46-56.
188. Smith AM, Ammar R, Nislow C, Giaever G: A survey of yeast genomic assays for
drug and target discovery. Pharmacol Ther 2010, 127:156-164.
189. O'Connell BC, Adamson B, Lydeard JR, Sowa ME, Ciccia A, Bredemeyer AL,
Schlabach M, Gygi SP, Elledge SJ, Harper JW: A genome-wide camptothecin
sensitivity screen identifies a mammalian MMS22L-NFKBIL2 complex required for
genomic stability. Mol Cell 2010, 40:645-657.
190. Schlabach MR, Luo J, Solimini NL, Hu G, Xu Q, Li MZ, Zhao Z, Smogorzewska A,
Sowa ME, Ang XL, et al: Cancer proliferation gene discovery through functional
genomics. Science 2008, 319:620-624.
124
191. Hayat MJ, Howlader N, Reichman ME, Edwards BK: Cancer statistics, trends, and
multiple primary cancer analyses from the Surveillance, Epidemiology, and End
Results (SEER) Program. Oncologist 2007, 12:20-37.
192. Fortney K, Jurisica I: Integrative computational biology for cancer research. Hum
Genet 2011, 130:465-481.
193. Chang M, Smith S, Thorpe A, Barratt MJ, Karim F: Evaluation of phenoxybenzamine
in the CFA model of pain following gene expression studies and connectivity
mapping. Mol Pain 2010, 6:56.
194. Kunkel SD, Suneja M, Ebert SM, Bongers KS, Fox DK, Malmberg SE, Alipour F,
Shields RK, Adams CM: mRNA expression signatures of human skeletal muscle
atrophy identify a natural compound that increases muscle mass. Cell Metab 2011,
13:627-638.
195. Wang G, Ye Y, Yang X, Liao H, Zhao C, Liang S: Expression-based in silico screening
of candidate therapeutic compounds for lung adenocarcinoma. PLoS One 2011,
6:e14573.
196. Ebi H, Tomida S, Takeuchi T, Arima C, Sato T, Mitsudomi T, Yatabe Y, Osada H,
Takahashi T: Relationship of deregulated signaling converging onto mTOR with
prognosis and classification of lung adenocarcinoma shown by two independent in
silico analyses. Cancer Res 2009, 69:4027-4035.
197. Sirota M, Dudley JT, Kim J, Chiang AP, Morgan AA, Sweet-Cordero A, Sage J, Butte
AJ: Discovery and preclinical validation of drug indications using compendia of
public gene expression data. Sci Transl Med 2011, 3:96ra77.
198. Zhang S-D, Gant TW: sscMap: An extensible Java application for connecting small-
molecule drugs using gene-expression signatures. BMC Bioinformatics 2012, 10:236.
199. McArt DG, Zhang SD: Identification of candidate small-molecule therapeutics to
cancer by gene-signature perturbation in connectivity mapping. PLoS One 2011,
6:e16382.
200. DTP: Developmental Therapeutics Program NCI/NIH. [http://dtp.nci.nih.gov/]
201. Gorzalczany Y, Gilad Y, Amihai D, Hammel I, Sagi-Eisenberg R, Merimsky O:
Combining an EGFR directed tyrosine kinase inhibitor with autophagy-inducing
drugs: a beneficial strategy to combat non-small cell lung cancer. Cancer Lett 2011,
310:207-215.
202. Ramalingam SS, Maitland ML, Frankel P, Argiris AE, Koczywas M, Gitlitz B, Thomas
S, Espinoza-Delgado I, Vokes EE, Gandara DR, Belani CP: Carboplatin and Paclitaxel
in combination with either vorinostat or placebo for first-line therapy of advanced
non-small-cell lung cancer. J Clin Oncol 2010, 28:56-62.
125
203. Suda K, Tomizawa K, Fujii M, Murakami H, Osada H, Maehara Y, Yatabe Y, Sekido Y,
Mitsudomi T: Epithelial to mesenchymal transition in an epidermal growth factor
receptor-mutant lung cancer cell line with acquired resistance to erlotinib. J Thorac
Oncol 2011, 6:1152-1161.
204. Willett P: Similarity searching using 2D structural fingerprints. Methods Mol Biol
2011, 672:133-158.
205. Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, et
al: DrugBank 3.0: a comprehensive resource for 'omics' research on drugs. Nucleic
Acids Res 2011, 39:D1035-1041.
206. Seiler KP, George GA, Happ MP, Bodycombe NE, Carrinski HA, Norton S, Brudz S,
Sullivan JP, Muhlich J, Serrano M, et al: ChemBank: a small-molecule screening and
cheminformatics resource database. Nucleic Acids Res 2008, 36:D351-359.
207. Meyer AM, Dwyer-Nield LD, Hurteau GJ, Keith RL, O'Leary E, You M, Bonventre JV,
Nemenoff RA, Malkinson AM: Decreased lung tumorigenesis in mice genetically
deficient in cytosolic phospholipase A2. Carcinogenesis 2004, 25:1517-1524.
208. Weiser-Evans MC, Wang XQ, Amin J, Van Putten V, Choudhary R, Winn RA,
Scheinman R, Simpson P, Geraci MW, Nemenoff RA: Depletion of cytosolic
phospholipase A2 in bone marrow-derived macrophages protects against lung
cancer progression and metastasis. Cancer Res 2009, 69:1733-1738.
209. The PyMOL Molecular Graphics System, Version 1.3, Schrödinger, LLC.
210. Kotlyar M, Fortney K, Jurisica I: Network-based characterization of drug-regulated
genes, drug targets, and toxicity. (Submitted) 2012.
211. Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, Bernabe RR, Bhan MK, Calvo F,
Eerola I, Gerhard DS, et al: International network of cancer genome projects. Nature
2010, 464:993-998.
212. Ledford H: Big science: The cancer genome challenge. Nature 2010, 464:972-974.