34
1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

Embed Size (px)

Citation preview

Page 1: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

1

Identifying differentially expressed sets of genes in microarray experiments

Lecture 23, Statistics 246,

April 15, 2004

Page 2: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

2

A cartoon version of microarraysA cartoon version of microarrays

t ID Name-19.83 AA495790 ras homolog gene family-16.83 AA598794 connective tissue growth factor-15.22 AA488676 membrane attached signal protein 1

-14.2AI014487 insulin-like growth factor binding protein 102-13.62 R77252 microtubule-associated protein 7-13.6 AA598601 insulin-like growth factor binding protein 31

-13.57 R09561 decay accelerating factor for complement (CD55)-13.38 AA875933 EGF-containing fibulin-like extracellular matrix protein 1-12.79 AA777187 cysteine-rich, angiogenic inducer-12.63 AA598601 insulin-like growth factor binding protein 3-12.01 AA055835 caveolin 1, caveolae protein, 22kD"-11.88 AA012944 insulin-like growth factor binding protein 102-10.86 AA936757 heparin-binding growth factor binding protein-10.86 AA995282 four and a half LIM domains 2-10.35 AA677403 glycoprotein hormones, alpha polypeptide-9.88 AA430032 pituitary tumor-transforming 1-9.32 AI935290 cysteine and glycine-rich protein 1-9.18 AA936757 heparin-binding growth factor binding protein2-9.06 AA424833 bone morphogenetic protein 6-9.02 AI985398 natriuretic peptide receptor C-8.51 AA630794 solute carrier family 3-8.38 H29897 phospholipase C, beta 42-8.22 W72207 cystatin A (stefin A)-7.99 H45668 Kruppel-like factor 4 (gut)-7.95 AA600217 activating transcription factor 4-7.8 AA149095 dual specificity phosphatase 1

-7.68 W73874 cathepsin L-7.61 R09561 decay accelerating factor for complement-7.53 AW028846 trefoil factor 2 (spasmolytic protein 1)-7.16 N66177 microphthalmia-associated transcription factor-7.14 H03346 protease, serine, 22-7.06 AA169469 pyruvate dehydrogenase kinase, isoenzyme 4-6.96 AI989348 protein disulfide isomerase-related protein-6.94 H63077 annexin A1-6.92 AA610004 Homo sapiens putative oncogene protein-6.84 AA599145 ZW10 (Drosophila) homolog-6.78 AA521434 B-cell CLL/lymphoma 6-6.77 AA400128 general transcription factor II-6.68 T53298 insulin-like growth factor binding protein 7-6.67 T86983 complement component 1

-6.6 AA027240 eukaryotic translation initiation factor 2-6.57 AA482117 Ras homolog enriched in brain 2-6.55 AA464849 thioredoxin reductase 1-6.55 AA400893 phosphodiesterase 1A, calmodulin-dependent-6.5 R91550 arginine-rich, mutated in early stage tumors

-6.45 AA620433 dihydropyrimidinase-like 3-6.45 AA625628 accessory proteins BAP31/BAP29-6.41 T77733 tubulin, gamma 1

List of differentially expressed genesLong list of

d.e. genes

Page 3: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

3

Long lists of d.e. genes biological understanding

What happens next? • Select some genes for validation?• Do follow-up experiments on some genes?• Publish a huge table with the results?• Try to learn about all the genes on the list (read

100s of papers)?• ….Usually, some or all of the above will be done, and

more. Can we help further at this

Page 4: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

4

Sets of genesSets of genes

There are usually many sets of genes that might be of interest in a given microarray experiment. Examples include genes in biological (e.g. biochemical, metabolic, and signalling) pathways, genes associated with a particular location in the cell, or genes having a particular function or being involved in a particular process. We could even include sets of genes for which all of the preceding are unknown, but we have reason believe could be of interest, typically from previous experiments. In thinking like this, it is important to remember that many genes (that is, their protein products) can have multiple functions, or be involved in many processes, etc. There are many databases (EcoCyc, KEGG,..) of pathways, and it is not my intention to review them here. We will focus on the most important related concept: the GO.

Page 5: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

5

The Gene Ontology Consortium

Ashburner et al. Nature Genetics 25: 25-29. http://www.geneontology.org

The goal of the Gene Ontology TM (GO) Consortium is to produce acontrolled vocabulary that can be applied to all organisms even as

knowledgeof gene and protein roles in cells is accumulating and changing. GO

provides three structured networks of defined terms to describe gene product

attributes

Molecular Function Ontology (7304 terms as of April 5, 2004) : the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity

Biological Process Ontology (8517 terms) broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions

Cellular Component Ontology (1394 terms) subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex

Page 6: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

6

From the GO web site. The path back to each ontologyfrom a gene.

We will call each term in apath a split.

Page 7: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

7

Structure of a GO annotationStructure of a GO annotation

Each gene can have several annotated GOs and each GO can have several

splits. E.g. DNA topoisomerase II alpha has 8 GO annotations and 11 splits

Page 8: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

8

Annotation of genes to a node in theontology

Each node is also connectedto many other related nodes.

Page 9: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

9

Are sets of genes Are sets of genes differentially expressed?differentially expressed?

The sets we refer to here are all the outcomes of analyses. Later we discuss sets specified a priori.

Examples of sets. They could be the list of all genes whose differential expression (e.g. average M-value) exceeds a given threshold, typically a liberal one, which would not correspond to any real “significance”, e.g. 1.5-fold. They might be clusters.

What do we mean by a set being differentially expressed. Here it is a convenient shorthand for being unusual in relation to all the genes represented on the array, for example, by being functionally enriched, in the sense of having more genes of a given category than one would expect, by chance.

Page 10: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

10

Hypothesis: Functionally related, differentially expressed genes should accumulate in the corresponding GO-group.

Problem: to find a method which scores accumulation of differential gene expression in a node of the GO.

We describe the calculation from the program Gostat. For all the genes analysed, it determines the annotated GO terms and all splits. It then counts the # of appearances of each GO term for the genes in the set, as well as the # in the reference set, which is typically all genes on the array. Then a 22 table is formed, see over page, and a p-value calculated.

GO and microarray gene sets

Page 11: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

11

Is a GO term is specific for a set?

51 416

125 8588

173 9004

467

8713

9177

count geneswith GO term in set

count geneswithout GOterm in set

count in set(e.g. differentiallyexpressed genes)

Count in reference set (e.g. all genes on array)

Contingency Table P-value

8x10-52

Fisher's exact testor chi-square test

Page 12: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

12

The multiple testing problemThe multiple testing problem

Naturally one doesn’t test a single GO term or split, but many, perhaps 1000s. As with testing of single genes, we need to deal with the multiple testing problem. Many of the solutions from there carry over: Bonferroni, Holm, step-down minP, FDR, and so on. But there are also special problems here, deriving from the nesting relationships between splits. In my view, these are not easily dealt with, and require more research.

Related questions. How can we compare the results of different lists being compared? And, rather than select a set of genes using a cut-off, can we make use the gene abundances or p-values for differential expression?

Page 13: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

13

GOstat: Tool for finding significant GO terms in a list of genes

http://gostat.wehi.edu.au

Page 14: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

14

There are many similar toolsThere are many similar tools

Here are a few.

GenMAPP, and MAPPFinder

EASE (DAVID)

FunSpec

FatiGO

…..

Page 15: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

15

Outline of MAPPfinder:MAPP = MicroArray Pathway Profiler

Page 16: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

16

Analyzing microarray data by functional gene sets defined a priori

Analysis at the level of single gene:• Identifying differentially expressed genes becomes a

challenge when the magnitude of differential expression is small.

• For some differences, many genes are involved.

Analysis at the level of functional group: why? By incorporating biological knowledge, we can hope to detect

modest but coordinate expression changes of sets of functionally related genes.

Page 17: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

17

PGC-1-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human

diabetes

Mootha et al, Nature Genetics July 2003

Data: Affymetrix microarray data on 22,000 genes in skeletal muscle biopsy samples from 43 males, 17 with normal glucose tolerance (NGT), 8 with impaired glucose tolerance and 18 with Type 2 diabetes (DM2).

In their single gene analysis, a t-statistic was calculated for each gene. No significant difference found between NTG and DM2 after adjusting for multiple testing.

Their idea: test 149 a priori defined gene sets for association with disease phenotypes.

Page 18: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

18

149 gene sets149 gene sets

Sets of metabolic pathways:• manually curated pathways (standard textbook

literature reviews, and LocusLink)• Netaffx annotations using GenMAPP metabolic

pathways

Sets of coregulated genes:• SOM clustering of the mouse expression atlas

Page 19: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

19

Two sample Kolmogorov-Smirnov test

To compare two empirical cdfs, SM(x) and SN(x) based on samples of size M and N, resp, the Kolmogorov-Smirnov (K-S)test uses the K-S distance DMN = maxx|SM(x) - SN(x)|. This is normalized by multiplying by (M-1 + N-1). It has a complicated null distribution, which can be approximated by permuting.

Page 20: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

20

From Mootha et al

ES=enrichment score for each gene = scaled K-S dist

A set called OXPHOS got the largest ES score,with p=0.029 on 1,000permutations.

Page 21: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

21

OXPHOSOther

All genesOXPHOS

(A small difference for many genes)

Page 22: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

22

SimplificationSimplification

Mootha et al did a two sample K-S test to compare genes in a specific gene set with genes not in that set.

Instead of doing this, why don’t we simply do a one sample test, comparing each gene set to the whole (population) directly?

Each gene set is small w.r.t. the entire set of genes, so all other genes ≈ all genes.

If we have approximate normality, a z-test should work for shift alternatives. A chi-squared test for scale changes also works.

Page 23: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

23

Mootha’s ts are approx normal

Page 24: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

24

One sample z-testOne sample z-test

Assumption: the (population of) t-statistics of all genes follow normal distribution. Denote the mean by and the SD by .

If this is the case, the best test of the null hypothesis that a sample t1 , t2, …..,tn is from this distribution, with alternative a shift of the original distribution is based on t . Specifically, it uses

z = ( t - )/ / n .

In general, we’d expect =0 and =1, and this is the case for Mootha’s ts.

Thus we test the null hypothesis that our sample comes from the same population using

z = n t . Let’s do a normal qq-plot of the 149 z-statistics of this form.

Page 25: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

25

Mootha’s dataMootha’s dataNormal qq-plot of n x t

OXPHOS

Page 26: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

26

Result from one sample z-test

• OXPHOS is easily identified as -10.

• The next three sets on the top ranking list are all related to oxidative phosphorylation.

z n # overlapping

w/ OXPHOS

OXPHOS_HG-U133A_Probes -10.4 114 114

Human_mitoDB_6_2002_HGU133A_ probes

-5.6 594 106

Mitochondr_HG-U133A_probes -5.2 615 103

MAP00190_Oxidative_phosphorylation -5.0 75 29

Page 27: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

27

Simulation 1Simulation 1

• 1500 29 gene expression values are generated from N(0,1), representing 1500 genes for 9 cases and 20 controls.

• The 1500 genes are divided into 50 gene sets, each with 30 genes. The genes are correlated within each gene set.

• We manipulate the gene expression level of the cases of the first gene set so that the magnitude of difference is known.

Page 28: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

28

Page 29: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

29

Simulated dataSimulated data

Page 30: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

30

Page 31: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

31

Page 32: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

32

P-values (simulation)

o--ES+--Rank sum*--one sample z test

Page 33: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

33

Conclusion

• When the population follows a normal distribution, the one-sample z-test is most powerful for shift alternatives (no surprise: theory says it has to be).

• From the simulation study, the one sample z-test is seen to be more powerful than the two sample K-S test for shift alternatives (even less of a surprise).

• The new method is not as compute intensive as the K-S test.• Similar results can be given for the following test statistic, for scale change

alternatives: for a set of n genes

z’ = i=1,..n [(ti - t)2 - (n-1)] / (2(n-1)).

(A test of no scale change might locate a set of genes that was split, with some having larger and others having smaller ts than average.)

Page 34: 1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004

34

AcknowledgementsAcknowledgements

Tim Beissbarth

Yun Zhou

Karen Vranizan