24
A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information Zolt´ an Kutalik 1,2 , Jacques S. Beckmann 1,3 and Sven Bergmann 1,21 Department of Medical Genetics, University of Lausanne, Switzerland 2 Swiss Institute of Bioinformatics, Switzerland 3 Service of Medical Genetics, Centre Hospitalier Universitaire Vaudois, Lausanne, Switzerland *Corresponding author: [email protected], tel.: +41-21 692 5452 1 Data sets Gene expression data (mRNA: Affy-U133B, GCRMA-normalized) were downloaded from the NCI website 1 . For the drug sensitivity data we selected those drugs which appeared both in the NIH cancer database 2 and at least in one of the following: (i) the study by Staunton et al. [8] 3 ; (ii) the DrugBank database 4 ; (iii) the Connectivity Map database 5 . The created data files (gene data.txt, drug data.txt) can be found at http://serverdgm.unil.ch/bergmann/PingPong.html. Data quality was checked by correlating (i) gene expression profiles that probe the same gene (Fig. 1) and (ii) replicated drug sensitivity experiments (Fig. 2). This quality control verified that, on average, replicated gene profiles have 0.96 correlation. The same correlation for replicated drug profiles was found to be 0.75. 2 Model for in-silico data Pairs of data sets (E, R) were generated such that they contain both independent modules and co-modules (Fig. 3). The data consist of a gene-expression data matrix E (whose elements E gc 1 http://discover.nci.nih.gov/cellminer/loadDownload.do 2 http://dtp.nci.nih.gov/dtpstandard/cancerscreeningdata/index.jsp 3 http://www.broad.mit.edu/mpr/NCI60/ 4 http://redpoll.pharmacy.ualberta.ca/drugbank/cgi-bin/drugcard field expl.cgi 5 http://www.broad.mit.edu/cmap/instanceServlet?servletAction=viewAll

A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

A Modular Approach for Integrative Analysis of Large-scale Gene-expressionand Drug-response Data

Supplementary Information

Zoltan Kutalik1,2, Jacques S. Beckmann1,3 and Sven Bergmann1,2∗

1Department of Medical Genetics, University of Lausanne, Switzerland2Swiss Institute of Bioinformatics, Switzerland

3Service of Medical Genetics, Centre Hospitalier Universitaire Vaudois, Lausanne, Switzerland

*Corresponding author: [email protected], tel.: +41-21 692 5452

1 Data sets

Gene expression data (mRNA: Affy-U133B, GCRMA-normalized) were downloaded from the NCI

website1. For the drug sensitivity data we selected those drugs which appeared both in the NIH

cancer database2 and at least in one of the following: (i) the study by Staunton et al. [8]3; (ii) the

DrugBank database4; (iii) the Connectivity Map database5. The created data files (gene data.txt,

drug data.txt) can be found at http://serverdgm.unil.ch/bergmann/PingPong.html.

Data quality was checked by correlating (i) gene expression profiles that probe the same gene

(Fig. 1) and (ii) replicated drug sensitivity experiments (Fig. 2). This quality control verified that,

on average, replicated gene profiles have 0.96 correlation. The same correlation for replicated drug

profiles was found to be 0.75.

2 Model for in-silico data

Pairs of data sets (E, R) were generated such that they contain both independent modules and

co-modules (Fig. 3). The data consist of a gene-expression data matrix E (whose elements Egc

1http://discover.nci.nih.gov/cellminer/loadDownload.do

2http://dtp.nci.nih.gov/dtpstandard/cancerscreeningdata/index.jsp

3http://www.broad.mit.edu/mpr/NCI60/

4http://redpoll.pharmacy.ualberta.ca/drugbank/cgi-bin/drugcard field expl.cgi

5http://www.broad.mit.edu/cmap/instanceServlet?servletAction=viewAll

Page 2: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

represent the expression change of gene g in cell line c) and a drug-response matrix R (whose

elements Rdc represent the response of cell line c when treated with drug d).

We assume that the expression- and response-patterns in these data are governed by three types

of “factors”: (1) drug-related “transcription factors” (F1 and F2 in Fig. 3b) induce the expression of

certain genes in cell lines where they are activated, and – at the same time – affect the sensitivity of

these cell-lines to certain drugs. (2) pure “transcription factors” (F3 and F4 in Fig. 3a) induce the

expression of certain genes in cell lines, but do not alter their sensitivity with respect to any drug.

(3) pure “drug-response factors” (F5 and F6 in Fig. 3c) alter the sensitivity of cell-lines to certain

drugs, but do not affect their expression profiles. We note that within our model these “factors”

simply correspond to modules and they should be viewed as an abstract means to describe cell-line

dependent co-expression of gene sets or co-sensitivity to drug sets (or both) and not necessarily as

realistic entities.

We use three matrices to characterize the properties of all these factors: (1) The elements of

the “activity matrix” A indicate for each cell line the factors that are “activated”: i.e. Afc = 1 if

factor f is active in cell line c and zero otherwise. Note that for simplicity we only describe the

activity of a factor in a binary fashion. Inactivity includes the possibility that a factor is simply

absent. (2) The elements of the “promoter matrix” P specify for each gene g the factors that

alter their expression if activated (i.e. Pfg = 1 if so and zero otherwise). (3) The elements of the

susceptibility matrix S denote for each drug its “target” factors (i.e. Sfd = 1 if drug d targets

factor f and zero otherwise). For pure transcription factors Sfd = 0 for all drugs, while for pure

drug-response factors Pfg = 0 for all genes.

The in-silico gene-expression and drug-response matrix are then computed as follows:

E(0)gc =

f

PfgAfc =

{

1 : ∃ factor f in cell line c that induces gene g0 : otherwise

R(0)dc =

f

SfdAfc =

{

1 : ∃ factor f in cell line c that is sensitive to drug d0 : otherwise

Finally Gaussian noise is added to the matrices:

E = E(0) + ε ,

R = R(0) + η ,

2

Page 3: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

where

ε, η ∼ N (0, σ) .

We varied two parameters in this model: (1) the level σ (“noise”), and (2) the number of factors

per gene or drug (“complexity”).

We used these models to simulate data sets in order to investigate and assess different approaches

(see Section 3 for details) proposed to handle multiple data sets. In our simulations we allowed

for a total of NF = 18 factors (cf. Fig. 3d): Six drug-related “transcription factors”, six pure

“transcription factors” and six pure “drug-response factors” governing the expression of NG = 400

genes and the response with respect to ND = 300 drugs within NC = 140 cell lines. About one

third of the drugs and the genes were (in sets of different sizes) attributed to at least one “drug-

related transcription factors”. Another third was attributed to at least one “pure factor”, and the

last third was not attributed to any factor. We created different data sets for testing how well the

different algorithms can deal with noisy and complex data: We used σ = 0, 0.2, 0.4, 0.6, 0.8, 1 with

minimal complexity (one factor per gene or drug), and between one and five factors per gene or

drug with a intermediate level of noise (σ = 0.5).

3 The algorithms

3.1 ISA(E)

The Iterative Signature Algorithm (ISA) has previously been described in great detail [4]. Briefly,

the ISA starts with an initial set of genes, and it identifies from a set of expression data the subset

of samples for which these genes have a coherent expression pattern. Next, it identifies the set

of genes that are co-expressed in those samples. These two steps are iterated until convergence

to stable sets of genes and samples. Such combined sets are referred to as transcription modules.

By definition they consist of co-expressed genes as well as the experimental conditions for which

this coherent expression is the most pronounced. Using many different initial starting points, the

ISA generates a compendium of modules that reduces the complexity of gene expression data and

represents the transcription program in terms of its coherent units [4].

3.2 ISA(E ·R⊤)

The first approach to identify gene-drug associations is to apply the ISA on the correlation matrix

corr(E,R) ∈ RNG×ND . (Note that for column normalized matrices E and R: corr(E,R) = E·R⊤.)

3

Page 4: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

From E·R⊤ the ISA generates gene-drug modules, which however have no reference to the involved

cell-lines, because they have been summed over when calculating the correlation matrix.

ISA(E·R⊤) has the advantage that the modularization is directly applied to a processed dataset

containing only possible signatures of drug-gene associations. Yet, a limitation is that the pair-wise

correlations E ·R⊤ are computed across all cell-lines. As a consequence, correlations across only a

few cell-lines may be too weak to stand out against incoherent contributions from the remaining

cell-lines (due to complexity in regulation or noise).

3.3 ISA(E) & ISA(R)

An alternative approach is to first generate gene-condition modules {(gi, ci)} by applying the ISA

applied to E and drug-condition modules {(dj , cj)} by applying the ISA to R. (Here the vectors

denote the associated gene-scores gi , sample-scores ci (cj) and drug-scores dj , respectively, cf.

notations in Section 3.6.) These modules are then matched using their condition vectors. This was

done by computing all the pairwise correlations:

Hij = corr(ci, cj)

Then for all i, j with |Hij| > hmin the gene-condition module i was matched with drug-condition

module j to form a gene-condition-drug co-module. Various thresholds (hmin = 0.2, 0.4, 0.6, 0.8)

were tested and 0.8 was selected since it gave the best Receiver Operator Characteristic (ROC) [9]

performance, cf. Fig. 2 of the main paper.

The ISA(E)&ISA(R) overcomes the abovementioned limitation of the ISA(E ·R⊤) by first

detecting independently coherent patterns of gene-expression and drug-response which may occur

over any subset of cell-lines. Subsequently, patterns which are consistent across the same subset

of cell-lines in both datasets are matched up. The drawback of this method is that some of these

subsets exhibiting coherent patterns in both datasets may not be identified, either because they

converge into disparate sample sets when iterated independently in ISA(E) and ISA(R), or because

our heuristic matching procedure (see Section ISA(E)& ISA(R)) filters out potential signals from

modules that have several or only weak links across the two datasets.

3.4 Cluster(E ·R⊤)

Genes (rows of the E ·R⊤ matrix) were subjected to hierarchical clustering, using the Matlab

implementation with default parameters. The cutoff was set such that k clusters were identified.

4

Page 5: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

For the in-silico data we used k = 7 corresponding to the six co-modules plus one remaining cluster

of unchanged genes. For the NCI-60 data k was set to 100. (k = 50 and 200 were also examined,

but yielded worse ROC performance than for k = 100.)

We then performed the same clustering procedure for drugs (corresponding to the columns of

the E ·R⊤ matrix). Next, gene- and drug-clusters were matched up as follows: Let C(G)i = i-th

gene-cluster, C(D)j = j-th drug-cluster, and

Vij =

g∈C(G)i ,d∈C

(D)j

∣E ·R⊤∣

g,d

|C(G)i | · |C

(D)j |

.

Here |C| refers to the number of elements of a set C and∣

∣E ·R⊤∣

g,dto the absolute values of the

correlation matrix.

We selected i∗ and j∗ such that Vi∗j∗ was maximal and matched up gene-cluster C(G)i∗ and

drug-cluster C(D)j∗ . Then, row i∗ and column j∗ were excluded from matrix V . This procedure was

repeated until all gene- and drug-clusters were matched up.

3.5 Regress(E,R)

Along the same lines as the authors described for transcription factor - condition “associations” in

[6], we computed “gene-drug association scores” as follows. For drug d and gene g the “prediction

score” was defined as the regression coefficient when the expression of gene g was regressed on drug

d (allowing for an additional constant term). Normalizing both data sets such that the mean gene

expression (and drug response) value over all cell lines is zero, the regression coefficient can be

obtained as

βgd =Eg,·R

⊤d,·

Eg,·E⊤g,·

where Mj,· refers to the j-th row of matrix M .

3.6 Ping-pong algorithm

For a given threshold combination (tC : condition threshold, tG: gene threshold, tD: drug threshold)

the Ping-pong algorithm (PPA) is summarized in the following pseudo-code:

• n = 0; g(0) = random(NG) ∈ [0, 1]NG (initial random seed)

• while(|g(n) − g(n−1)| + |d(n)

− d(n−1)

| + |c(n) − c(n−1)| + |c(n) −ˆc(n)

| > ǫ)

1. c = E⊤G · g(n); c

(n+1)j =

{

cj if |cj − µ(c)| > tC σ(c)0 otherwise

(j = 1, . . . , NC)

5

Page 6: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

2. d = RC · c(n); d(n+1)k =

{

dk if |dk − µ(d)| > tD σ(d)0 otherwise

(k = 1, . . . , ND)

3. c = R⊤D · d

(n); c

(n+1)l =

{

cl if |cl − µ(c)| > tC σ(c)0 otherwise

(l = 1, . . . , NC)

4. g = EC · ˆc(n)

; g(n+1)m =

{

gm if |gm − µ(g)| > tG σ(g)0 otherwise

(m = 1, . . . , NG)

5. n = n + 1

• g∗ = g(n); c∗ = c(n); d∗ = d(n)

Here |x|, µ(x) and σ(x) denote the norm, the mean value and the standard deviation of the

components xi of the vector x. x = x/|x| is a normalized vector. EG and EC refer to the

expression matrix normalized across genes and conditions, respectively (cf. [4]). Similarly RD and

RC denote the response matrix normalized across drugs and conditions, respectively.

The following comments are in order:

• In the first step the algorithm is initialized with an arbitrary gene-vector g(0) containing

random scores g(0)i ∈ [0, 1] for each gene. We note that this “seeding” could also commence

with a drug- or condition-vector. The iterative scheme of the PPA has to be applied to

many different seed vectors in order to capture most co-modules. For the PPA (similarly

to the ISA), sampling a relatively small fraction (typically a few hundred) of all possible

seeds is sufficient to reveal all “major” co-modules (those containing many elements that

exhibit coherent patterns). Significant speed-up for constructing a comprehensive co-module

compendium is achieved by aborting iterations that converge towards co-modules that have

already been identified. Further search heuristics (like the one suggested in PISA [7]) can be

used also for the PPA.

• The first four steps in the while-loop (corresponding to the numbered arrows in Fig. 1c of the

main paper) are the algorithmic core of the Ping-pong algorithm. Each step consists of a linear

mapping induced by either the expression- or response-data, followed by thresholding, just as

in the ISA. Here we use both the positive and the negative tails of the score distributions, but

a single-tail cutoff is possible at well. We stress that this thresholding is instrumental for the

identification of weak signals that couple the two datasets, and distinguishes the PPA from

related linear methods like Singular Value Decomposition (SVD) [2], Generalized Singular

Value Decomposition [3] or Partial Least Square Regression (PLS) [1].

6

Page 7: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

• The condition in the while-loop requires that all (normalized) co-module scores do not change

anymore, and is only checked after a two iterations. We use ǫ = 0.02 as threshold. In fact

one may also observe ”limit cycles”, in which case we use the union of all cycles to define the

components of the co-modules.

• The common dimension (cell-lines in the case of the NCI-60 data) plays a special role, because

the corresponding score-vectors can be computed both from the genes (c) and the drugs (c).

In principle one can use two different thresholds for selecting the non-zero elements of these

vectors, but we set tC = tC . Note that we require that c converges to c in our definition of a

co-module, in order to guarantee that the co-expression and co-response occur over the same

conditions.

• Many characteristics of the PPA are akin to the ISA: Each refinement relies on scores that

are computed for genes, drugs or cell-lines as well as thresholds that are applied to these

scores in order to select the subsets that are sufficiently coherent. This thresholding is very

efficient in separating potential signals from noise [4]. Varying the three threshold parameters

independently allows for the identification of co-modules at different ”resolutions”: Using

more stringent thresholds will usually give rise to co-modules with fewer elements that exhibit

more coherent expression- and response-patterns than those of co-modules identified at lower

thresholds. In our implementation tG, tD and tC , were varied between 1 and 3 with step size

0.2. The number of seeds per threshold combination was set to 600.

• Let NC , NG and ND denote the number of cell lines, genes and drugs, respectively. If the PPA

uses N seed vectors its run time is O(N · NC(NG + ND)). Thus PPA is significantly faster

than any gene-drug correlation based iterative method (such as SVD) whose computational

complexity is O(NC ·NG ·ND)+ O(N ·NG ·ND). Here the first term represents the compu-

tational effort to compute the drug-gene correlation matrix, while the second term stands for

the number of elementary calculation to perform one iteration to obtain N “clusters”. For

gene-expression and drug-response data generally N,NC ≪ ND, NG holds, which guarantees

the PPA’s computational advantage. For our data set configuration the theoretical complex-

ity of the PPA is almost two orders of magnitude smaller than that of a correlation-based

iterative method.

7

Page 8: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

4 Enrichment score calculation

Let Gm denote the set of genes in module m and G denote the set of genes in a GO category (e.g.

“ion transport”). Define n = |Gm|, k = |G|, x = |Gm ∩ G| and l = total number of genes in the

whole data set. Then – assuming that x follows a hypergeometric distribution – the probability to

observe an intersection of size x between G and Gm is

p(G,Gm) =

min(k,n)∑

i=x

(

ki

)(

l−kn−i

)

(

ln

) .

We used Bonferroni-corrected significance scores − log10(l · p(G,Gm)) and module m was declared

to be enriched for a GO category including genes G if p(G,Gm) < 0.05/l.

5 More on the Analysis of the NCI-60 data

Our in-silico analysis demonstrated the supremacy of our modular methods, and of the Ping-pong

algorithm in particular, with respect to more standard tools when analyzing artificial data with

a modular structure and significant levels of noise. While we believe that real gene-expression

and drug-response data that were gathered by high-throughput methods in general exhibit these

features, we still needed to demonstrate that our methods indeed perform better also for real data.

We chose the NCI-60 data for this end for the following reasons:

• We wanted to address the challenge of integrating phenotypic data from very different levels.

Indeed cellular phenotypes like drug-response are very much downstream from molecular

phenotypes like mRNA levels, adding to the challenge that possible links between them may

often be difficult, if not impossible, to identify.

• The NCI-60 cell lines have been studied extremely well. Several generations of microarrays

have been employed for transcriptional profiling, and the data from the most recent Affymetrix

U133 chip became publicly available when we started our study. Analysis of replicates gave

very good consistency for genes (Fig. 1) and good agreement for drugs (Fig. 2). A further

advantage of this study is the fact that the cell-lines are cultured in uniform conditions and

are exposed to one drug at a time (at the prize of studying immortalized cells which might

induce artifacts). Moreover, follow-up studies to validate possible leads resulting from our

analysis are obviously more feasible for cultured samples.

8

Page 9: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

• Relatively large amount of information is available (e.g. from DrugBank6) on FDA-approved

drugs (target-genes or pathways, affected biological processes, etc.) which can be combined

with the annotations of genes (GO, KEGG) and tissue specificity of the cell-lines in order to

shed light on possible links to mechanisms of action.

• The previous point allowed us to study the bias among the functional categorizations as-

sociated only with co-modules and those obtained for all transcription modules (obtained

from modular analysis of the expression data alone). An interesting observation we made is

that while 40% of the transcription modules obtained purely from gene expression data are

highly enriched for a particular type of tissue, only 6% of the co-modules are enriched for a

specific tissue. This indicates that many of the possible underlying mechanisms connecting

gene expression with drug sensitivity may not be tissue specific.

Table 1 highlights some features of the algorithm we applied to the NCI-60 data. The different

approaches capture different aspects of the data. Note that ISA(E·R⊤) and PPA modules had on

average 0.55 correlation (for both gene and drug part), PPA and ISA(E) had 0.47 correlation on

average. These average correlations were computed as follows: For each PPA module one obtained

the best correlating (gene and drug score-wise) ISA(E·R⊤) module and their correlation represents

the best match correlation for that particular PPA module. We then computed these best match

correlations for every PPA module and averaged them out.

As highlighted in Fig. 3 of the main paper, the co-modules of the Ping-pong algorithm appear

to be most sensitive to drug-related processes. Yet, for other processes our methods are indeed

often complementary. This is shown in the Venn diagram in Fig. 4 highlighting how all biological

process that were enriched in modules are distributed across those identified by respectively the

PPA, ISA(E) and ISA(E ·R⊤).

ISA(E ·R⊤) identified modules enriched with cancer related processes with higher accuracy

than the PPA in 60% of the cases (18/30), while for drug-related processes the PPA achieved

higher accuracy than ISA(E ·R⊤) in 61% of the cases (23/38). For 19 out of 30 (63%) cancer-

related processes the co-modules found by the PPA gave a better match than those obtained by

6We cannot exclude that some of drug-gene associations in the DrugBank database are influenced by the NCI-60data (e.g. from testing drug-gene interaction hypotheses generated by these or related data). Nevertheless, sinceDrugBank consists of a collection of data from a large number of different studies we believe that such a bias isprobably small and that there is little if any circularity in testing our predictions using DrugBank. Moreover, evenin the presence of such a bias it is unlikely to favor our modular approaches over simpler algorithms, so our findingspertaining to their (superior) performance are valid in any case.

9

Page 10: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

ISA (R) ISA (E) ISA(E ·R⊤) PPA

CPU time (min) 4.8 26.4 769.2 96.7# gene probes in modules – 22282 22197 22282

# drugs in modules 5061 – 4424 5143# cell-lines in modules 58 58 – 59

total modules (corr<.99) 895 616 2251 859total modules (corr<.8) 160 182 570 177

average module self-similarity 0.34 0.17 0.03 0.04total GO/BP categories – 379 771 539total GO/MF categories – 168 327 224total GO/CC categories – 161 205 157

total pathways – 72 127 108% enriched for BP – 82 59 79% enriched for MF – 78 57 729% enriched for CC – 82 43 69

% enriched for tissues 33 50 – 69% enriched for chromosome – 95 68 96

% enriched for chromosome segment – 84 79 92% enriched for drug-BP 18 – 3 1% enriched for drug-MF 21 – 5 2% enriched for drug-CC 3 – 1 5

% enriched for drug-pathway 8 – 3 < 1% enriched for drug-gene 19 – 3 1

% enriched for drug category 2 – 2 < 1

Supplementary Table 1: Summary of the features for the different algorithms.

the ISA(E), while for the drug-related processes 26 of 38 (68%) were covered more accurately.

ISA(E)&ISA(R) was able to identify only one drug-related process and none of the cancer-related

processes with higher significance than the PPA. For most of the cancer-related (25/30) and the

drug-related processes (32/38) the co-modules obtained with the PPA matched better than those

obtained by clustering E ·R⊤. We also investigated what fraction of the modules is at all enriched

for at least one cancer-development or drug-target related process. While nearly half (49.2%) of

the PPA modules were enriched for at least one related category, only 29.5% of the ISA(E ·R⊤)

modules had the same property.

Fig. 5 – in a similar fashion to Fig. 3 of the main paper – compares the best enrichment p-

values for KEGG pathways obtained by the PPA plotted against the p-values resulted from other

tested algorithms. This analysis confirms that the PPA identifies links between drugs and gene sets

pertaining to cancer or drugs (according to KEGG) with better specificity than clustering E ·R⊤,

ISA(E) and ISA(E)&ISA(R). Only ISA(E ·R⊤) provides similar links with similar sensitivity, but

this at the price of lower significance (2.5 times as many modules are needed), no reference to the

10

Page 11: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

relevant cell-lines and higher computational cost.

6 Additional information on ROC analysis

The evaluation of drug-gene association predictions (i.e. sgd) was performed by computing the

Receiver Operating Characteristic (ROC) curve [9] given the corresponding “truth” matrix T of

the underlying model (Tgd = 1, if gene g and drug d are associated, and 0 otherwise). That is, in

order to transform our association scores sgd into binary values we continuously varied a cut-off t

across the range of sgd successively switching on putative drug-gene associations (P(t)gd = 1 if sgd > t

and zero otherwise). For each cut-off t our prediction matrix P (t) was compared to the “truth”

matrix T to obtain the true- and the false-positive rate. These rates were then plotted against each

other constituting the ROC curve.

6.1 Connectivity Map data generation and “truth” matrix computation

Replicates in the Connectivity Map data were averaged out and subsequently the log2 ratio was

computed. For the gene-drug association “truth matrix” genes were assigned to be associated with

a particular drug if their expression changed at least two-fold upon treatment (cf. Fig. 2b of the

paper).

6.2 ROC curves for FPR range of 0 to 0.2

All ROC curves presented in Figure 2 of the main paper are shown in Fig. 6 for an extended range

of the FPR.

6.3 Additional methods tested for gene-drug association prediction

We applied linear regression to generate predictions for gene-drug associations in a similar fashion

as in [6]. ROC analysis was performed as described in the main paper using DrugBank data as

“truth”. The resulting ROC integral (i.e. the surface area under the ROC curve) was only better

than hierarchial clustering of the E ·R⊤ matrix (cf. Fig. 7). We also performed Singular Value

Decomposition (SVD) of the E·R⊤ and applied k-means clustering on the eigenvalues to determine

which eigenvectors to include for prediction. The ROC integral for SVD(E ·R⊤) came second

highest after that of the PPA, but in the FDR < 10% regime only the PPA produces a significant

bias between TPR and FPR.

11

Page 12: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

All modular and non-modular approaches were tested also on the drug-gene association matrix

provided by the Connectivity Map. Resulting ROC curves are presented in Fig. 8.

A major drawback of SVD based methods like Co-inertia Analysis [5] is that they do not produce

clusters, which would be essential for understanding and interpreting large-scale data sets. Also,

in our experience very few eigenvalues are significant, hence only a very limited number (typically

less than 5) of eigenvectors explain the majority of the variance in the data.

6.4 Testing Connectivity Map against DrugBank

The Connectivity Map data were also tested against the DrugBank data to determine whether drug-

gene interactions can be derived by such a method. We only considered drugs and genes which

occurred both in the Connectivity Map and in Drugbank. On this restricted set the ROC integral

of the Connectivity Map data was much better than the one obtained by the NCI-60 co-modules.

This confirms that gene-drug links can be more efficiently predicted through such direct data. Yet,

until expression profiling becomes affordable for many combinations of drugs and cell-lines, the

methodologies developed here to predict drug-response indirectly from the expression of untreated

cell-lines remains highly valuable.

12

Page 13: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

References

[1] H Abdi. Partial Least Square regression (PLS regression). In N.J. Salkind, editor, Encyclopedia

of Measurement and Statistics, pages 740–744. Thousand Oaks (CA): Sage, 2007.

[2] O. Alter, PO Brown, and D Botstein. Singular value decomposition for genome-wide expression

data processing and modeling. PNAS, 97:10101–10106, 2000.

[3] O Alter, PO Brown, and D Botstein. Generalized singular value decomposition for comparative

analysis of genome-scale expression data sets of two different organisms. PNAS, 100:3351–3356,

2003.

[4] S Bergmann, J Ihmels, and N Barkai. Iterative signature algorithm for the analysis of large-scale

gene expression data. Physical Review E, 67:031902, 2003.

[5] AC Culhane, G Perrire, and DG Higgins. Cross-platform comparison and visualisation of gene

expression data using co-inertia analysis. BMC Bioinformatics, 4:59, 2003.

[6] F Gao, BC Foat, and H.J. Bussemaker. Defining transcriptional networks through integrative

modeling of mrna expression and transcription factor binding data. BMC Bioinformatics, 5:31,

2004.

[7] M Kloster, C Tang, and NS Wingreen. Finding regulatory modules through large-scale gene-

expression data analysis. Bioinformatics, 21:1172–1179, 2005.

[8] JE Staunton, DK Slonim, HA Coller, P Tamayo, MJ Angelo, J Park, U Scherf, JK Lee,

WO Reinhold, JN Weinstein, JP Mesirov, ES Lander, and TR Golub. Chemosensitivity pre-

diction by transcriptional profiling. PNAS, 98(19):10787–10792, 2001.

[9] M.H. Zweig and G Campbell. Receiver-operating characteristic (roc) plots: a fundamental

evaluation tool in clinical medicine. Clin Chem, 39:561–577, 1993.

13

Page 14: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

500

1000

1500

2000

2500

correlation of probes for same gene

num

ber

of g

enes

Supplementary Figure 1: Histogram of the correlation between expression profiles of probes repre-senting the same gene. The median of the correlation values is 0.96.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

35

40

correlation of experiments for the same drug

num

ber

of d

rugs

Supplementary Figure 2: Histogram of the correlation between growth inhibition profiles of exper-iments (NSC numbers) for the same drug. The median of the correlation values is 0.75.

14

Page 15: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

C3

F4

C4

F3

G3

G4

C3C3C3

F4

C4C4C4

F3

G3

G3

G4

G4

C1

D1

F2

C2

D2

F1

G1

G2

C1C1C1

D1D1

F2

C2C2C2

D2D2

F1

G1

G1

G2

G2

C5

D3

F6

C6

D4

F5

C5C5C5

D3D3

F6

C6C6C6

D4D4

F5

(a) (b) (c)

(d)

[Pfg]

[Afc]

[Pfg][Afc]

[Pfg]

[Pfg]

[Afc]

[Afc]

[Sfd]

[Sfd] [Sfd]

[Sfd][Afc]

[Afc]

Supplementary Figure 3: (a-c) Underlying model for generating combined set of in-silico expression-and response-data: certain factors (F) are present in subsets of cell-lines (C); some of these factorsmay alter their susceptibility with respect to subsets of drugs (D) or change the expression ofsubsets of genes (G). We distinguish between factors that are associated both with groups of drugsand genes [F1 & F2 in (b)] and those that are attributed exclusively with either genes [F3 & F4 in(a)] or drugs [F5 & F6 in (c)]. These associations are described by matrices Pfg, Afc and Sfd. Ingeneral, factors may be associated with overlapping groups of cell-lines, genes and drugs. (d) In-

silico gene expression and drug response data was generated based on the model in (a). Specifically,the gene expression matrix is given by E = P⊤ · A + ε and describes which gene is expressed inwhich cell-line. Similarly, the drug response matrix R = S⊤ ·A+η indicates which drugs affect theproliferation of which cell-lines. In both cases we added “white noise” (ε and η). The goal of anintegrative analysis of these data is to identify the co-modules whose cell-lines exhibit both coherentexpression- and response-profiles (as in b) and distinguish them from the decoupled modules (as ina and c).

15

Page 16: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

ISA(E⋅ RT) PPA

ISA(E)

328(6932)

133(7659)

106(5466)

253(16138)

57(5900)

47(4056)

28(397)

Supplementary Figure 4: Venn diagram showing how many biological processes are covered by thedifferent methods. Total number of genes in the processes are indicated in brackets.

16

Page 17: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

ISA(E⋅ RT)

sign

ifica

nce

of p

athw

ays

by P

ing−

pong

[−lo

g 10 p

−va

lues

]

ISA(E)

other pathwayscancer related pathwaysdrug related pathways

ISA(E) & ISA(R)

significance of pathways by alternative algorithms [−log10

p−values]

clustering(E⋅ RT)

low high low high

low

high

low

high

Supplementary Figure 5: In a similar fashion to Figure 3 of the main paper the best enrichmentp-values for KEGG pathways obtained by the PPA are plotted against the p-values resulted fromother tested algorithms.

17

Page 18: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

Supplementary Figure 6: Same as Figure 3 of the main paper with FPR extended to 0.2

18

Page 19: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

Supplementary Figure 7: Same as Figure 3a of the main paper with two additional methods. SVDand Regress. The grey region indicates the 90% confidence interval for the random predictor.

Supplementary Figure 8: Same as Figure 3b of the main paper with two additional methods. SVDand Regress. The grey region indicates the 90% confidence interval for the random predictor.

19

Page 20: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

Module # Ref. Extract/summary from article abstracts Drug Gene Cell line

2, 244, 321, 483, 621, 687, 761, 845

[1] Focal adhesion complexes have been reported to be disassembled in etoposide-treated Rat-1 cells.

Drug Name KEGG -

552, 554, 622, 623, 689, 762

[2] The circadian rhythm machinery was shown to play a role in the tolerance of mice for etoposide.

Drug Name KEGG -

244 [3] Chromosome segregation process was demonstrated to be a target of mitoxantrone and low activity of this process is associated with mitoxantrone resistance.

Drug Name GO -

[4] It was demonstrated that etoposide particularly binds to single-stranded DNA.

Drug Name GO -

4 [5]

Combined evaluation of O6-methylguanine methyltransferase (MGMT) and hMLH1 status determines sensitivity to monofunctional alkylating agents against gallbladder and colon carcinoma cells. Alkylating agents suppressed cell proliferation of MGMT-/hMLH1+ carcinoma cells by arresting the cell cycle at the G2-M phase.

Category KEGG Tissue

9 [6] Treatment of breast cancer cells with siRNAs targeting Plk1 improved the sensitivity toward paclitaxel in a synergistic manner.

Drug Name

Gene Symbol -

28 [6] Antineoplastic agents [Mitotane] (when combined with siRNAs targeting Plk1) were shown to arrest cell cycle in breast cancer cells (MCF-7, SK-BR-3, MDA-MB-435 and BT-474).

Category KEGG Tissue

76 [7]

Measurement of enzyme activities that might be involved in DNA repair, such as DNA polymerases, revealed significant elevations in alkylating agent [Busulfan] resistant TE-671 MR human rhabdomyosarcoma xenograft.

Category KEGG -

313 [8]

These results strongly suggest that the expression of the c-myc gene plays a role in the acquisition of drug resistance against alkylating agents. Remark: C-myc gene was found to be up-regulated in melanoma cell lines which are highly sensitive to an alkylating agent, Busulfan.

Category Symbol -

[9] A low dose of cisplatin (alkylating agent, Busulfan) repressed the transcription of the MGMT promoter.

76 [10]

Patients with glioblastoma containing a methylated MGMT promoter benefited from alkylating agent [Busulfan].High levels of MGMT activity in cancer cells create a resistant phenotype by blunting the therapeutic effect of alkylating agents [Busulfan]. Remark: MGMT gene was found to be up-regulated in the module, and the cell-lines were resistant to Busulfan.

Category Symbol -

322 [11] Breast cancer resistance protein (BCRP) is a half- Category GO -

Page 21: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

molecule ATP-binding cassette transporter that forms a functional homodimer and pumps out various anticancer agents, such as 7-ethyl-10-hydroxycamptothecin, topotecan, mitoxantrone and flavopiridol, from cells. Estrogens [Ethinyl Estradiol], such as estrone and 17beta-estradiol, have been found to restore drug sensitivity levels in BCRP-transduced cells by increasing the cellular accumulation of such agents.

4 [12] Cell cycle and mitotic arrest is in agreement with down-regulation of TOP2A [Teniposide, Etoposide, Mitoxantrone], TYMS [Floxuridine].

Target Gene KEGG -

7 [13, 14]

Immune system defects observed in VDR-KO mice are an indirect consequence of VDR [Delsterol] disruption.

Target Gene KEGG -

8 [15] The antiapoptotic gene BCL2 [Paclitaxel] is associated with control of the cell cycle

Target Gene KEGG -

57 [16] A common pathway might exist between Transthyretin (TTR [Resveratrol]) and lipoprotein metabolism.

Target Gene KEGG -

88 [17]

The promoter activity of this TOP2B fragment was constant throughout the cell cycle, in contrast to the activity of the proximal promoter of TOP2A [Teniposide, Etoposide, Mitoxantrone] which was low in resting cells and enhanced during proliferation.

Target Gene KEGG -

112 [18]

The CpG island methylator phenotype (CIMP) with extensive promoter methylation is a distinct phenotype in colorectal cancer. CRABP1 [Tretinoin] is a CIMP-specific marker.

Target Gene KEGG -

162 [19]

Methylthioadenosine phosphorylase (MTAP), an enzyme involved in purine- and methionine metabolism They proved that MTAP [Adenine] deficiency contributes directly to the sensitivity of cancer cells to purine or methionine depletion.

Target Gene KEGG -

171 [20]

Alternate DHFR inhibitors [Methotrexate] combined with drug-resistant DHFR or other chemotherapeutic agent/drug-resistance gene combinations may be required for the application of drug-resistance gene expression to the treatment of chronic myeloid leukemia.

Target Gene KEGG -

190 [21]

A review of the top 12 candidates revealed a relatively novel thyroid cancer markers gene such as CRABP1 [Tretinoin]. This candidate, among others, should help to develop a panel of markers with sufficient sensitivity and specificity for the diagnosis of thyroid tumors in a clinical setting.

Target Gene KEGG -

848 [22] Impairment of MTX-polyglutamate formation, with membrane transport alteration, in the resistant cells

Target Gene KEGG -

Page 22: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

was demonstrated in the previous studies. Further analysis of sensitivity of the cells to trimetrexate (TMQ), which is a potent inhibitor of human DHFR [Methotrexate], revealed a modest decrease in sensitivity to TMQ (2.4- to 15-fold).

853 [23] ADA [Pentostatin]-deficiency has effects on the developing of immune system.

Target Gene KEGG -

7 [24]

To explore the genes involved in chemosensitivity and resistance, 10 human tumour cell lines, including parental cells and resistant subtypes selected for resistance against doxorubicin, melphalan, teniposide and vincristine, were profiled for mRNA expression of 7400 genes using cDNA microarray technology. A substantial number of both proapoptotic and antiapoptotic genes such as focal adhesion kinase were found to be associated to drug resistance.

Drug name KEGG -

Table 1. Results from semi-automated search in paper abstracts for experimental evidences supporting gene-drug-sample associations provided by co-modules. Abstracts available on PubMed were searched for keywords pertaining to drugs, genes (mandatory) and cell-lines (optional) generated from each co-module as follows: For drugs we used annotated therapeutic or general categories, as well as gene symbols of known targets as required keywords. We also submitted the symbols of the 50 top-scoring genes, as well as the names of all functionally enriched GO categories and KEGG pathways. We only considered papers whose abstract had a hit (shown in boldface) for at least one drug- and one gene-related keyword. Abstracts were also searched for the prevalent tissue type of the co-module (if any). Note, that this literature search was only applicable to 741 (86%) of the co-modules which have annotated drugs. The first column refers to the module number, the second references the paper, the third features an extract from its abstract (with the actual drug name in square brackets after its target gene or category), and remaining columns indicates the type of the hit. References 1. Kook, S., et al., Caspase-mediated cleavage of p130cas in etoposide-induced

apoptotic Rat-1 cells. Mol Biol Cell, 2000. 11(3): p. 929-39. 2. Levi, F., et al., Circadian rhythm in tolerance of mice for etoposide. Cancer Treat

Rep, 1985. 69(12): p. 1443-5. 3. Harris, A.L. and D. Hochhauser, Mechanisms of multidrug resistance in cancer

treatment. Acta Oncol, 1992. 31(2): p. 205-13. 4. Chow, K.C., T.L. Macdonald, and W.E. Ross, DNA binding by

epipodophyllotoxins and N-acyl anthracyclines: implications for mechanism of topoisomerase II inhibition. Mol Pharmacol, 1988. 34(4): p. 467-73.

Page 23: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

5. Sato, K., et al., Deficient MGMT and proficient hMLH1 expression renders gallbladder carcinoma cells sensitive to alkylating agents through G2-M cell cycle arrest. Int J Oncol, 2005. 26(6): p. 1653-61.

6. Spankuch, B., et al., Rational combinations of siRNAs targeting Plk1 with breast cancer drugs. Oncogene, 2007.

7. Friedman, H.S., et al., Elevated DNA polymerase alpha, DNA polymerase beta, and DNA topoisomerase II in a melphalan-resistant rhabdomyosarcoma xenograft that is cross-resistant to nitrosoureas and topotecan. Cancer Res, 1994. 54(13): p. 3487-93.

8. Niimi, S., et al., Resistance to anticancer drugs in NIH3T3 cells transfected with c-myc and/or c-H-ras genes. Br J Cancer, 1991. 63(2): p. 237-41.

9. Sato, K., et al., Cisplatin represses transcriptional activity from the minimal promoter of the O6-methylguanine methyltransferase gene and increases sensitivity of human gallbladder cancer cells to 1-(4-amino-2-methyl-5-pyrimidinyl) methyl-3-2-chloroethyl)-3-nitrosourea. Oncol Rep, 2005. 13(5): p. 899-906.

10. Hegi, M.E., et al., MGMT gene silencing and benefit from temozolomide in glioblastoma. N Engl J Med, 2005. 352(10): p. 997-1003.

11. Sugimoto, Y., et al., Breast cancer resistance protein: molecular target for anticancer drug resistance and pharmacokinetics/pharmacodynamics. Cancer Sci, 2005. 96(8): p. 457-65.

12. Kokkinakis, D.M., X. Liu, and R.D. Neuner, Modulation of cell cycle and gene expression in pancreatic tumor cell lines by methionine deprivation (methionine stress): implications to the therapy of pancreatic adenocarcinoma. Mol Cancer Ther, 2005. 4(9): p. 1338-48.

13. Campbell, M.J., et al., The anti-proliferative effects of 1alpha,25(OH)2D3 on breast and prostate cancer cells are associated with induction of BRCA1 gene expression. Oncogene, 2000. 19(44): p. 5091-7.

14. Mathieu, C., et al., In vitro and in vivo analysis of the immune system of vitamin D receptor knockout mice. J Bone Miner Res, 2001. 16(11): p. 2057-65.

15. Gomez, B.P., et al., Human X-Box binding protein-1 confers both estrogen independence and antiestrogen resistance in breast cancer cell lines. Faseb J, 2007.

16. Sousa, M.M. and M.J. Saraiva, Internalization of transthyretin. Evidence of a novel yet unidentified receptor-associated protein (RAP)-sensitive receptor. J Biol Chem, 2001. 276(17): p. 14420-5.

17. Lok, C.N., et al., Characterization of the human topoisomerase IIbeta (TOP2B) promoter activity: essential roles of the nuclear factor-Y (NF-Y)- and specificity protein-1 (Sp1)-binding sites. Biochem J, 2002. 368(Pt 3): p. 741-51.

18. Ogino, S., et al., Evaluation of Markers for CpG Island Methylator Phenotype (CIMP) in Colorectal Cancer by a Large Population-Based Sample. J Mol Diagn, 2007.

19. Hori, H., et al., Methylthioadenosine phosphorylase cDNA transfection alters sensitivity to depletion of purine and methionine in A549 lung cancer cells. Cancer Res, 1996. 56(24): p. 5653-8.

Page 24: A Modular Approach for Integrative Analysis of Large-scale ... · A Modular Approach for Integrative Analysis of Large-scale Gene-expression and Drug-response Data Supplementary Information

20. Sweeney, C.L., et al., Methotrexate exacerbates tumor progression in a murine model of chronic myeloid leukemia. J Pharmacol Exp Ther, 2002. 300(3): p. 1075-84.

21. Griffith, O.L., et al., Meta-analysis and meta-review of thyroid cancer gene expression profiling studies identifies important diagnostic biomarkers. J Clin Oncol, 2006. 24(31): p. 5043-51.

22. Koizumi, S. and C.J. Allegra, Enzyme studies of methotrexate-resistant human leukemic cell (K562) subclones. Leuk Res, 1992. 16(6-7): p. 565-9.

23. Blackburn, M.R. and R.E. Kellems, Adenosine deaminase deficiency: metabolic basis of immune deficiency and pulmonary inflammation. Adv Immunol, 2005. 86: p. 1-41.

24. Rickardson, L., et al., Identification of molecular mechanisms for cellular drug resistance by combining drug activity and gene expression profiles. Br J Cancer, 2005. 93(4): p. 483-92.