27
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Detecting Differentially Expressed Genes

Pengyu Hong09/13/2005

Page 2: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Background (Microarray)

CellsExtract RNA

Page 3: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Background

CellsExtract RNA

Page 4: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Background

CellsExtract RNA

Page 5: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Background

CellsExtract RNA

Page 6: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Background

CellsExtract RNA

104+ genes

Page 7: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Background

CellsExtract RNA

104+ genes

Page 8: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Background

CellsExtract RNA

104+ genes

Page 9: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Background

Biological sample

RNA extraction (total RNA or mRNA) Amplification (in vitro transcription) Label samples Hybridization Washing and staining

Scanning

• Microarrays are highly noisy • Use replicated experiments to

make inferences about differential expression for the population from which the biological samples originate

biological variability

technical variability

Page 10: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Background

Normalization

Calculate Gene Expression Index

Page 11: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

An Example

probe set gene Normal m412aNormal m414aNormal m416aNormal m426aNormal m430aMM m282 MM m331aMM m332aMM m333aMM m334aMM m353aMM m408aMM m423a31307_at pre-T/NK cell associated protein28.53 32.61 29.56 36.55 33.19 25.1 32.79 34.3 35.44 28.48 29.55 22.28 28.7731308_at pre-T/NK cell associated protein69.14 53.69 52.78 62.07 58.74 67.88 85.82 83.54 85.91 60.93 62.82 47.17 77.0731309_r_atHuman breast cancer suppressor element Ishmael Upper CP1 mRNA, partial cds16.9 67.7 27.61 46.16 51.46 45.62 35.57 32.62 35.14 96.18 45.94 63.2 38.2731310_at glycine receptor, alpha 1 (startle disease/hyperekplexia, stiff man syndrome)67.42 49.55 55.51 59.57 68.42 91.06 91.23 83.66 76.37 71.23 74.95 74.04 100.7731311_at Homo sapiens cDNA FLJ40594 fis, clone THYMU2010671, highly similar to Homo sapiens T-cell receptor78.73 62.91 60.84 72.98 72.9 79.39 85.52 82.57 69.69 63.72 64.29 62.85 67.5831312_at potassium voltage-gated channel, Shab-related subfamily, member 266.65 59.46 55.47 61.75 69.92 75.28 85.53 97.91 69.92 74.77 71.83 58.17 72.1531313_at mannosyl (alpha-1,6-)-glycoprotein beta-1,6-N-acetyl-glucosaminyltransferase115.33 95.51 84.48 94.99 109.04 105.05 118.68 106.76 142.88 103.72 106.19 98.58 104.1331314_at bone morphogenetic protein 3 (osteogenic)71.89 36.24 41.86 46.99 45.94 46.67 67.56 66.14 53.95 40.97 47.96 43.63 53.5431315_at immunoglobulin lambda locus103.99 88.27 83.81 81.81 254.63 87.12 99.11 109.56 86.37 75.03 74.97 69.02 97.2231316_at Human vacuolar ATPase (isoform HO68) mRNA, complete cds16.79 10.08 9.53 16.48 11.98 12.8 16.7 18.76 11.25 12.09 18.89 10.81 19.4931317_r_atHuman unproductively rearranged Ig mu-chain mRNA V-region (VD), 5' end, clone mu-3A1A316.75 269.61 254.92 352.61 342.4 327.12 366.39 346 308.43 279.81 312.4 318.06 334.2731318_at Stem cell factor {alternatively spliced} [human, preimplantation embryos, blastocysts, mRNA Partial, 180 nt]32.68 19.79 27.45 29.56 28.34 26.55 38.04 41.05 31.91 22.76 23.58 28.29 22.6131319_at Cluster Incl. M20707:Human kappa-immunoglobulin germline pseudogene (Chr22.4) variable region (subgroup V kappa II) /cds=(0,320) /gb=M20707 /gi=185954 /ug=Hs.123030 /len=363252.78 441.07 143.32 400.01 373.4 105.06 105.72 87.02 110.75 161.69 84.88 240.91 210.5431320_at Cluster Incl. U18548:Human GPR12 G protein coupled-receptor gene, complete cds /cds=(15,1019) /gb=U18548 /gi=604499 /ug=Hs.123034 /len=1101101.42 89.07 79.51 100.69 120.06 116.74 121.41 134.74 131.36 137.4 114.15 119.89 126.7431321_at Cluster Incl. U41737:Human pancreatic beta cell growth factor (INGAP) mRNA, complete cds /cds=(5,520) /gb=U41737 /gi=1514681 /ug=Hs.123060 /len=586112.27 62.17 62.44 80.17 110.97 53.89 55.04 55.16 63.37 54.35 57.79 48.07 47.8931322_at Cluster Incl. X61079:Human mRNA for T cell receptor, clone IGRA24 /cds=(0,142) /gb=X61079 /gi=33521 /ug=Hs.123062 /len=23544.15 52.5 44.8 46.25 55.96 50.01 53.2 52.24 62.16 49.94 47.24 40.64 50.131323_r_atGlutamate transporter II variant B/HBGT IIB {5' region} [human, brain and spinal cord, mRNA Partial Mutant, 129 nt]141.44 177.7 138.58 142.61 167.28 169.49 199.64 185.22 218.79 196.56 150.14 185.24 226.3731324_at Cluster Incl. U82303:Homo sapiens unknown protein mRNA, partial cds /cds=(0,257) /gb=U82303 /gi=1938329 /ug=Hs.123080 /len=34470.87 57.8 61.61 65.93 84.05 106.41 106.73 87.01 112.12 78.47 111 89.08 100.5331325_at Cluster Incl. U82306:Homo sapiens unknown protein mRNA, partial cds /cds=(0,221) /gb=U82306 /gi=1938333 /ug=Hs.123081 /len=25368.63 167.66 69.04 112.84 120.46 126.72 107.04 100 116.83 207.5 125.65 155.19 102.5531326_at Cluster Incl. AF005081:Homo sapiens skin-specific protein (xp32) mRNA, partial cds /cds=(0,340) /gb=AF005081 /gi=2589189 /ug=Hs.123091 /len=416157.67 127.49 123.37 146.18 150.95 159.46 184.08 206.02 182.95 139.01 154.57 143.09 175.2731327_at Cluster Incl. AF015124:Homo sapiens IgG heavy chain variable region (Vh26) mRNA, partial cds /cds=(0,305) /gb=AF015124 /gi=2599349 /ug=Hs.123093 /len=34035.57 28.17 33.64 32.36 38.76 43.13 40.16 46.8 34.47 33.71 25.74 29.45 37.3731328_at solute carrier family 34 (sodium phosphate), member 161.61 48.23 50.76 58 57.58 57.91 69.91 72.34 70.29 54.98 59.74 45.55 63.0231329_at Human putative opioid receptor mRNA, complete cds12.23 18.91 15.36 19.99 21.15 15.9 20.76 22.26 16.15 28.86 13.59 16.06 23.8531330_at ribosomal protein S19108.87 133.3 89.84 113.02 147.61 169.87 156.81 136.47 153.07 220.54 220.96 332.11 18331331_at surfactant protein A binding protein28.21 17.99 23.56 26.37 30.35 28.84 31.54 35.06 22.53 24.45 23 21.37 30.8631332_at RIG-like 14-1 20.77 18.58 19.03 18.29 20.86 23.56 25.11 24.43 19.3 28.72 17.27 23.18 25.3531333_at tolloid-like 1 22.97 52.9 26.95 41.22 48.38 48.85 42.09 40.13 40.73 89.86 46.96 59.74 40.8731334_at G protein-coupled receptor 4598.57 100.16 78.76 119.09 118.58 97.42 110.67 104.95 143.47 111.28 102.88 115.9 133.1731335_at clone 1900 unknown protein65.79 54.3 54.79 57.23 60.75 66.98 72.89 86.97 76.34 57.65 59.83 49.21 70.8331336_at Cluster Incl. AC004076:Homo sapiens chromosome 19, cosmid R30217 /cds=(0,2075) /gb=AC004076 /gi=2822142 /ug=Hs.129709 /len=207640.97 26.15 32.55 26.26 33.73 36.15 36.03 34.93 24.12 26.26 22.55 23.64 28.11

5 normal sample and 9 myeloma (MM) samples 12558 genes (rows)

Page 12: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Genes of Interest

• Statistical significance: that the observed differential expression is unlikely to be due to chance.

• Scientific significance: that the observed level of differential expression is of sufficient magnitude to be of biological relevance.

Page 13: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Group 1 (N samples): X1, X2, … XN

Group 2 (M samples): Y1, Y2, … YM

Statistical significance in the two group problem

Assume

Yj ~ Normal (μ2, σ2)

Xi ~ Normal (μ1, σ2)

Null hypothesis: Group 1 is the “same” to Group 2

(i.e., μ1= μ2)

Parametric Test: t-test

Page 14: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Statistical significance in the two group problem

Yj ~ Normal (μ2, σ2)Xi ~ Normal (μ1, σ2)

Null hypothesis: μ1= μ2

M

s

N

s

YXt

22

* )(

2

)1()1( 22

212

MN

sMsNs

N

iiX

NX

1

1

M

iiY

MY

1

1

Test null hypothesis with test statistics:

N

ii XX

Ns

1

221 )(

1

1

M

ii YY

Ms

1

222 )(

1

1

)2(~* MNtt

Parametric Test: t-test

Page 15: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

M

s

N

s

YXt

22

21

)('

1

)/(

1

)/()//(

' 222

221

222

21

M

Ms

N

NsMsNs

df

If variances are unequal

(1) When N+M > 30, this is approximately normal

(2) When 1 >> 2, this is approximately t(df = N–1)

(3) In general, Welch approximation: t’ ~ t(df’), where

Yj ~ Normal (μ2, σ22)

Xi ~ Normal (μ1, σ12)

σ1 σ2

Page 16: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Wilcoxon rank sum test

Consider row 7 of MM study

16 253 633 1008 708 36 72 28 14 33 19 49 58 23

13 4 3 1 2 8 5 10 14 9 12 7 6 11

---------------------------

rank sum = 23

This test is more appropriate than the t-tests when the underlying distribution is far from normal. (But it requires large group sizes)

Page 17: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

P-value

• p-value = P(|T|>|t|) is calculated based on the distribution of T under the null hypothesis.

• p-value is a function of the test statistics and can be viewed as a random variable.– e.g. p-value = 2(1 - F(|t*|), F = cdf of t(N+M – 2).

• A small p-value represents evidence against the null hypothesis differentially expressed in our case.

Page 18: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Permutation test• A non-parametric way of computation p-value for any

test statistics.– In the MM-study, each gene has (14 choose 5) = 2002

different test values obtainable from permuting the group labels.

• Under the null hypothesis that the distribution for the two groups are identical, all these test values are equally probable. What is the probability of getting a test value at least as extreme as the observed one? This is the permutation p-value.

Page 19: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Permutation technique

Condition 0 Condition 1

Patient 4 Patient 2 Patient 3 Patient 1 Patient 5 Patient 6

Condition 0 Condition 1

Patient 1 Patient 2 Patient 5 Patient 4 Patient 3 Patient 6

Condition 0 Condition 1

Patient 1 Patient 6 Patient 3 Patient 4 Patient 5 Patient 2

Condition 0 Condition 1

Patient 1 Patient 2 Patient 3 Patient 4 Patient 5 Patient 6Compute TS0

Compute TS1

Compute TS2

Compute TS3

The set of TSi form the empirical distribution of the test statistic TS

Page 20: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Scientific Significance

• Fold change FC =

• May not be high when statistical significance is high.

• Not an appropriate measure if the dispersion is not taken into consideration.

/X Y

Page 21: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Conservative fold change

Conservative fold change (CFC) =

Max (25th percentile of sample 1 / 75th percentile of sample 2,

25th percentile of sample 2 / 75th percentile of sample 1)

Page 22: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

xaxis

de

n1

98 100 102 104 106

0.0

0.1

0.2

0.3

0.4

Sample 1: Normal (100, 1)

Sample 2: Normal (103, 1)

CFC = 1.0164

Page 23: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

90

10

011

01

20

10

02

00

30

04

00

50

04

06

08

01

00

12

01

40

16

0

CFC=3.53

50

10

01

50

20

02

50

30

0

CFC=1.07

CFC=2.89

CFC=1.45

Page 24: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

P-values and FC contains different information

- log10(p.value.unequal.log)[idx]

fold

.qu

an

tile

[idx]

3 4 5 6 7

24

68

10

12

14

Page 25: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Gene Selection and Ranking

• A high threshold of statistical significance Select genes with p-values smaller than a threshold

• The selected genes are ordered according to their scientific significance (i.e. ranked by fold-changes)

Page 26: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

The False Positive Rate (FPR)

• If we select genes with p-value < 0.01, then the probability of making a positive call when the gene is in fact not differential is less than 0.01. Thus selection by p-value controls the FPR.

• However, if we have 12,000 genes in a microarray, then a FPR = 0.01 still allows up to 120 false positives. To make sensible decision, we must take multiple comparisons into consideration.

Page 27: Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Dealing with Multiple Comparison

• Bonferroni inequality: To control the family-wise error rate for testing m hypotheses at level α, we need to control the FPR for each individual test at α/m

• Then P(false rejection at least one hypothesis) < α

or P(no false rejection) > 1- α

• This is appropriate for some applications (e.g. testing a new drug versus several existing ones), but is too conservative for our task of gene selection.