13
Biology-Driven Clustering of Microarray Data: Applications to the NCI60 Data Set K.R. Coombes, K.A. Baggerly, D.N. Stivers, J. Wang, D. Gold, H.G. Sung, and S.J. Lee

Biology-Driven Clustering of Microarray Data:

  • Upload
    owena

  • View
    38

  • Download
    2

Embed Size (px)

DESCRIPTION

Biology-Driven Clustering of Microarray Data:. K.R. Coombes, K.A. Baggerly, D.N. Stivers, J. Wang, D. Gold, H.G. Sung, and S.J. Lee. Applications to the NCI60 Data Set. - PowerPoint PPT Presentation

Citation preview

Page 1: Biology-Driven Clustering of Microarray Data:

Biology-Driven Clustering of Microarray Data:Applications to the NCI60 Data

SetK.R. Coombes, K.A. Baggerly, D.N. Stivers,

J. Wang, D. Gold, H.G. Sung, and S.J. Lee

Page 2: Biology-Driven Clustering of Microarray Data:

Introduction

Most analyses of microarray data proceed as though it were simply a large, unstructured matrix. Such analyses ignore substantial amounts of existing biological information. In the study of cancer, we already know many important genes through their involvement in specific biological processes, and we know that reproducible chromosomal abnormalities play an important role. We see a need for developing analytic strategies that exploit this biological information.

We analyzed the NCI60 data set by first determining the chromosomal location and biological function of the genes on the microarray. We performed separate analyses using genes on individual chromosomes and genes involved in different biological processes. The fundamental advantage of this approach is that it provides results that are immediately and directly interpretable without resorting to ex post facto rationalizations.

Methods

Page 3: Biology-Driven Clustering of Microarray Data:

How many genes on the microarray have good annotations?

Numberof Spots

AccessionNumbers

Current UniGeneStatus

294 None None (control spots)128 Only 3’ Unknown to UniGene

1379 Only 3’ Known to UniGene1 Only 5’ Unknown to UniGene6 Only 5’ Known to UniGene

399 Both Unknown to UniGene763 Both 3’ known, 5’ unknown291 Both 3’ unknown, 5’ known646 Both Both known, but disagree

6093 Both Both known, and agree

Table 1: There are only 7478 spots (out of 10,000) on the array with valid, matching UniGene cluster IDs. Genes with unknown or conflicting annotations were eliminated before performing any further analysis.

• Problem:– I.M.A.G.E. clone IDs and

GenBank accession numbers are archival.

– UniGene clusters, gene names, descriptions, etc., are changeable.

• Solution:– Download the latest

version of UniGene (build 137) and LocusLink (July 2001) to update annotations, using the GenBank accession numbers describing both 3’ and 5’ ends of the genes spotted on the microarrays.

Page 4: Biology-Driven Clustering of Microarray Data:

Where are the genes located?

Chromosome

(Ob

serv

ed

- E

xpe

cte

d)

/ SD

5 10 15 20

-6-4

-20

24

6

X Y

chi^2 = 148.8p < 10^(-10)

Figure 1: Distribution of the genes on the array by chromosome.Chromosomes 19 and Y are substantially underrepresented whencompared to the numbers known to LocusLink; chromosomes 6and 13 are overrepresented.

We compared the number of genes on the microarray that mapped to each chromosome with the number known to be on the chromosome, based on current figures from the NCBI. A chi-squared test was used to test whether the distribution of genes on chromosomes was uniform.

Page 5: Biology-Driven Clustering of Microarray Data:

How do we determine gene functions?• Using our updated UniGene

clusters, we followed the links from UniGene to LocusLink to GeneOntology.

• GeneOntology is a structured, hierarchical vocabulary to describe gene functions in three broad areas:– biological process (why)– molecular function (what)– cellular component

(where)• The 7478 good spots on the

array corresponded to 6614 distinct genes, of which 5074 were known to LocusLink, and 2989 had at least one annotation in GeneOntology.

We focused on the biological process annotations in the GeneOntology vocabulary, since these had the most natural interpretation for application to the study of cancer. We counted the number of genes having annotations of functions at or below each level in the hierarchy, and selected a set of categories that each contained roughly one to a few hundred genes, with the categories as a whole accounting for more than 95% of all annotations (Table 2).

Page 6: Biology-Driven Clustering of Microarray Data:

What functional categories are represented on the array?

Function # Ann. # Spots Function # Ann. # Spots

Oncogenesis 140 180 Cell shape and size 78 101Apoptosis 128 138 Protein traffic 157 188

Physiological proc. 180 210 Transport 146 136Perc. of ext. stimuli 238 150 Cell proliferation 197 249

Ectoderm devel. 129 152 Stress response 599 372Mesoderm devel. 92 102 Radiation response 147 136

Cell adhesion 111 140 Cell cycle 494 283Cell-cell signaling 137 166 Nucleic acid met. 695 595

Signal transduction 222 228 Protein metabolism 471 567Intracell sig cascade 110 110 Lipid metabolism 146 156

Cell motility 120 153 Carbohydrate met. 103 97Cell organization 98 118 Energy pathways 88 98

Table 2: The number of annotations (Ann.) into and the number of spots on the array in various functional categories chosen from the biological process annotations from LocusLink into GeneOntology. Individual spots may have multiple annotations into the same category; individual genes may be represented by multiple spots.

Page 7: Biology-Driven Clustering of Microarray Data:

0.00.20.40.6

breast.bt549

breast.hs578t

breast.mcf7

breast.mdamb231

breast.mdamb435breast.mdan

breast.t47d

cns.sf295cns.sf268

cns.sf539

cns.snb19cns.u251

colon.ht29colon.hct116

colon.hct15

colon.km12

colon.sw620

colon.hcc2998

colon.colo205

leukemia.k562leukemia.hl60

leukemia.rpmi8226leukemia.srcl7019

leukemia.molt4leukemia.ccrfcem

melanoma.loximvi

melanoma.uacc577

melanoma.m14

melanoma.skmel2

melanoma.skmel5

melanoma.malme3m

melanoma.skmel28melanoma.uacc62

nsclc.h322

nsclc.hop62

nsclc.h23

nsclc.ekvx

nsclc.h226

nsclc.a549

nsclc.h460

nsclc.hop92

nsclc.h522

ovarian.4ovarian.3

ovarian.8

ovarian.5

ovarian.igrov1ovarian.skov3

prostate.du145

prostate.pc3

cns.snb75

renal.caki1

renal.achnrenal.tk10

renal.sn12c

renal.rxf393

renal.uo31renal.786o

renal.a498

breast.unknown

Cancer B C L M N O P R S

Grade A A D F D C B

Figure 2: Dendrogram using allgenes with valid annotations andwith expression levels abovethose of the blank spots.

How good is a dendrogram?

• A = there is a cluster containing all and only one kind of cancer

• B = all, with one or two extras• C = all except one• D = all except one, with extras• E = all except two• F = all except two, with extras

We introduced a quality grade, based on the dendrograms, to describe how well each set of genes used to produce a dendrogram classifies each kind of cancer:

Grades for the dendrogram of Figure 2are displayed in the following table.

Page 8: Biology-Driven Clustering of Microarray Data:

Heterogeneity of different types of cancer

ch B C L M N O P R S ch B C L M N O P R S

1 B A D F D B 13 D E

2 E C D D E D E 14 A A F

3 C E D E F 15 C B C F C

4 E E E E 16

5 A A D F E 17 A A D F E E

6 C A D E E D 18 E D

7 E A D E C E 19 D D

8 E C D 20 E C

9 B C C E E E 21

10 D E 22 A E E

11 E C C D X B A D E D

12 B C C E E E

• Some cancers (colon, leukemia) are fairly homogeneous and easy to distinguish from others.

• Some (breast, lung) are so heterogeneous as to be nearly impossible to distinguish.

• Some chromosomes (1, 2, 6, 7, 9, 12, 17) can distinguish many types of cancer.

• Some (16, 21) can not accurately distinguish any kind of cancer. The dendrograms using genes from these chromosomes are equivalent to randomly scrambling of the cancer cell lines.

Table 3: Grades given to dendrograms that cluster samples by genes on specific chromosomes. Grades range from A to F, with blanks indicating no clustering for that type of sample. Abbreviations: B=breast, C=colon, L=leukemia, M=melanoma, N=non small cell lung, O=ovarian, P=prostate, R=renal, S=central nervous system.

Page 9: Biology-Driven Clustering of Microarray Data:

0.0

0.2

0.4

0.6

0.8

Chromosome 20.00.20.40.6

breast.bt549breast.hs578t

breast.mcf7

breast.mdamb231

breast.mdamb435breast.mdan

breast.t47d

cns.sf295

cns.sf268

cns.sf539

cns.snb19cns.u251

colon.ht29

colon.hct116

colon.hct15

colon.km12

colon.sw620

colon.hcc2998

colon.colo205

leukemia.k562leukemia.hl60

leukemia.rpmi8226leukemia.srcl7019

leukemia.molt4leukemia.ccrfcem

melanoma.loximvi

melanoma.uacc577

melanoma.m14

melanoma.skmel2

melanoma.skmel5melanoma.malme3m

melanoma.skmel28

melanoma.uacc62

nsclc.h322

nsclc.hop62

nsclc.h23

nsclc.ekvx

nsclc.h226

nsclc.a549

nsclc.h460

nsclc.hop92

nsclc.h522

ovarian.4ovarian.3

ovarian.8

ovarian.5ovarian.igrov1ovarian.skov3

prostate.du145

prostate.pc3

cns.snb75

renal.caki1

renal.achn

renal.tk10

renal.sn12c

renal.rxf393

renal.uo31

renal.786o

renal.a498

breast.unknown

Figure 3: The genes on chromosome 2 do anexcellent job of distinguishing cancer types. We can also locate specific clusters of genes on thechromosome with strong signatures identifyingleukemia, melanoma, and colon cancer.

Chromosome 2

Page 10: Biology-Driven Clustering of Microarray Data:

0.0

0.2

0.4

0.6

0.8

Chromosome 160.00.20.40.6

breast.bt549

breast.hs578t

breast.mcf7

breast.mdamb231

breast.mdamb435breast.mdan

breast.t47d

cns.sf295

cns.sf268

cns.sf539

cns.snb19

cns.u251

colon.ht29

colon.hct116

colon.hct15colon.km12

colon.sw620

colon.hcc2998colon.colo205

leukemia.k562

leukemia.hl60

leukemia.rpmi8226

leukemia.srcl7019

leukemia.molt4leukemia.ccrfcem

melanoma.loximvi

melanoma.uacc577

melanoma.m14

melanoma.skmel2

melanoma.skmel5

melanoma.malme3m

melanoma.skmel28melanoma.uacc62

nsclc.h322

nsclc.hop62

nsclc.h23

nsclc.ekvx

nsclc.h226

nsclc.a549

nsclc.h460

nsclc.hop92

nsclc.h522

ovarian.4

ovarian.3

ovarian.8

ovarian.5

ovarian.igrov1

ovarian.skov3

prostate.du145

prostate.pc3

cns.snb75

renal.caki1renal.achn

renal.tk10

renal.sn12c

renal.rxf393

renal.uo31

renal.786o

renal.a498

breast.unknown

Figure 4: Genes on chromosome 16 cannot reliablydistinguish any single kind of cancer in this study.There are, nevertheless, strong gene signaturesdriving the clustering, which does not appear tomatch anything we know about the biology of thesamples.

Chromosome 16

Page 11: Biology-Driven Clustering of Microarray Data:

0.00.20.40.6

breast.bt549

breast.hs578t

breast.mcf7

breast.mdamb231

breast.mdamb435breast.mdan

breast.t47d

cns.sf295

cns.sf268

cns.sf539

cns.snb19cns.u251

colon.ht29

colon.hct116

colon.hct15colon.km12

colon.sw620colon.hcc2998

colon.colo205

leukemia.k562leukemia.hl60leukemia.rpmi8226

leukemia.srcl7019

leukemia.molt4leukemia.ccrfcem

melanoma.loximvi

melanoma.uacc577

melanoma.m14

melanoma.skmel2melanoma.skmel5melanoma.malme3mmelanoma.skmel28melanoma.uacc62

nsclc.h322

nsclc.hop62

nsclc.h23

nsclc.ekvx

nsclc.h226

nsclc.a549

nsclc.h460

nsclc.hop92

nsclc.h522

ovarian.4

ovarian.3ovarian.8

ovarian.5ovarian.igrov1

ovarian.skov3

prostate.du145

prostate.pc3

cns.snb75

renal.caki1

renal.achnrenal.tk10

renal.sn12c

renal.rxf393

renal.uo31

renal.786o

renal.a498

breast.unknown

0.0

0.2

0.4

0.6

0.8

protein metabolism and modificationProtein Metabolism

Figure 5: The genes involved in protein metabolism do an excellent job of distinguishing cancer types. We can also locate specific clusters of genes on the chromosome with strong signatures identifying leukemia, colon cancer, lung cancer, and central nervous system cancer.

Page 12: Biology-Driven Clustering of Microarray Data:

0.0

0.2

0.4

0.6

0.8

death (apoptosis)0.00.20.40.6

breast.bt549

breast.hs578t

breast.mcf7

breast.mdamb231

breast.mdamb435breast.mdan

breast.t47d

cns.sf295

cns.sf268

cns.sf539

cns.snb19

cns.u251

colon.ht29

colon.hct116

colon.hct15

colon.km12

colon.sw620

colon.hcc2998

colon.colo205

leukemia.k562

leukemia.hl60leukemia.rpmi8226leukemia.srcl7019

leukemia.molt4leukemia.ccrfcem

melanoma.loximvi

melanoma.uacc577

melanoma.m14

melanoma.skmel2

melanoma.skmel5

melanoma.malme3m

melanoma.skmel28melanoma.uacc62

nsclc.h322

nsclc.hop62nsclc.h23

nsclc.ekvx

nsclc.h226

nsclc.a549nsclc.h460

nsclc.hop92

nsclc.h522

ovarian.4

ovarian.3

ovarian.8

ovarian.5

ovarian.igrov1ovarian.skov3

prostate.du145

prostate.pc3

cns.snb75

renal.caki1

renal.achn

renal.tk10

renal.sn12c

renal.rxf393

renal.uo31

renal.786o

renal.a498

breast.unknown

Apoptosis

Figure 6: The genes involved in apoptosis do a poor job of distinguishing cancer types. This suggests that the mechanisms by which cancers overcome cell death cut across the normal biological lines drawn by histology.

Page 13: Biology-Driven Clustering of Microarray Data:

ConclusionsMultiple views into the data provide substantial insight into differences in cancer types and gene sets.Cancer types differ greatly in their degree of heterogeneity, ranging from homogeneous (colon, leukemia) through moderately heterogeneous (renal, melanoma) to extremely heterogeneous (breast and lung).Homogeneous cancers exhibit strong identifying signals across most views of the data, regardless of function or chromosome.There are large difference in the ability of genes of different chromosomes to distinguish cancer types. There are similar differences for genes involved in different biological processes (data not shown).

Functional categories that are good at distinguishing cancers include signal transduction, cell cycle, cell proliferation, and protein metabolism. Some differences result from the histology of the underlying tissue. Others reflect differences in the way particular kinds of cancers overcome limits on cell growth.Categories that are poor at distinguishing cancers include energy pathways and apoptosis. The latter observation has potential implications for cancer therapies designed to trigger apoptosis, since it suggests that the mechanisms by which cancer cells avoid cell death are not linked to the general type of cancer but are either common across cancers or idiosyncratic.