Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Detecting Differentially Expressed Genes

Pengyu Hong09/13/2005

Background (Microarray)

CellsExtract RNA

Background

CellsExtract RNA

Background

CellsExtract RNA

Background

CellsExtract RNA

Background

CellsExtract RNA

104+ genes

Background

CellsExtract RNA

104+ genes

Background

CellsExtract RNA

104+ genes

Background

Biological sample

RNA extraction (total RNA or mRNA) Amplification (in vitro transcription) Label samples Hybridization Washing and staining

Scanning

• Microarrays are highly noisy • Use replicated experiments to

make inferences about differential expression for the population from which the biological samples originate

biological variability

technical variability

Background

Normalization

Calculate Gene Expression Index

An Example

probe set gene Normal m412aNormal m414aNormal m416aNormal m426aNormal m430aMM m282 MM m331aMM m332aMM m333aMM m334aMM m353aMM m408aMM m423a31307_at pre-T/NK cell associated protein28.53 32.61 29.56 36.55 33.19 25.1 32.79 34.3 35.44 28.48 29.55 22.28 28.7731308_at pre-T/NK cell associated protein69.14 53.69 52.78 62.07 58.74 67.88 85.82 83.54 85.91 60.93 62.82 47.17 77.0731309_r_atHuman breast cancer suppressor element Ishmael Upper CP1 mRNA, partial cds16.9 67.7 27.61 46.16 51.46 45.62 35.57 32.62 35.14 96.18 45.94 63.2 38.2731310_at glycine receptor, alpha 1 (startle disease/hyperekplexia, stiff man syndrome)67.42 49.55 55.51 59.57 68.42 91.06 91.23 83.66 76.37 71.23 74.95 74.04 100.7731311_at Homo sapiens cDNA FLJ40594 fis, clone THYMU2010671, highly similar to Homo sapiens T-cell receptor78.73 62.91 60.84 72.98 72.9 79.39 85.52 82.57 69.69 63.72 64.29 62.85 67.5831312_at potassium voltage-gated channel, Shab-related subfamily, member 266.65 59.46 55.47 61.75 69.92 75.28 85.53 97.91 69.92 74.77 71.83 58.17 72.1531313_at mannosyl (alpha-1,6-)-glycoprotein beta-1,6-N-acetyl-glucosaminyltransferase115.33 95.51 84.48 94.99 109.04 105.05 118.68 106.76 142.88 103.72 106.19 98.58 104.1331314_at bone morphogenetic protein 3 (osteogenic)71.89 36.24 41.86 46.99 45.94 46.67 67.56 66.14 53.95 40.97 47.96 43.63 53.5431315_at immunoglobulin lambda locus103.99 88.27 83.81 81.81 254.63 87.12 99.11 109.56 86.37 75.03 74.97 69.02 97.2231316_at Human vacuolar ATPase (isoform HO68) mRNA, complete cds16.79 10.08 9.53 16.48 11.98 12.8 16.7 18.76 11.25 12.09 18.89 10.81 19.4931317_r_atHuman unproductively rearranged Ig mu-chain mRNA V-region (VD), 5' end, clone mu-3A1A316.75 269.61 254.92 352.61 342.4 327.12 366.39 346 308.43 279.81 312.4 318.06 334.2731318_at Stem cell factor {alternatively spliced} [human, preimplantation embryos, blastocysts, mRNA Partial, 180 nt]32.68 19.79 27.45 29.56 28.34 26.55 38.04 41.05 31.91 22.76 23.58 28.29 22.6131319_at Cluster Incl. M20707:Human kappa-immunoglobulin germline pseudogene (Chr22.4) variable region (subgroup V kappa II) /cds=(0,320) /gb=M20707 /gi=185954 /ug=Hs.123030 /len=363252.78 441.07 143.32 400.01 373.4 105.06 105.72 87.02 110.75 161.69 84.88 240.91 210.5431320_at Cluster Incl. U18548:Human GPR12 G protein coupled-receptor gene, complete cds /cds=(15,1019) /gb=U18548 /gi=604499 /ug=Hs.123034 /len=1101101.42 89.07 79.51 100.69 120.06 116.74 121.41 134.74 131.36 137.4 114.15 119.89 126.7431321_at Cluster Incl. U41737:Human pancreatic beta cell growth factor (INGAP) mRNA, complete cds /cds=(5,520) /gb=U41737 /gi=1514681 /ug=Hs.123060 /len=586112.27 62.17 62.44 80.17 110.97 53.89 55.04 55.16 63.37 54.35 57.79 48.07 47.8931322_at Cluster Incl. X61079:Human mRNA for T cell receptor, clone IGRA24 /cds=(0,142) /gb=X61079 /gi=33521 /ug=Hs.123062 /len=23544.15 52.5 44.8 46.25 55.96 50.01 53.2 52.24 62.16 49.94 47.24 40.64 50.131323_r_atGlutamate transporter II variant B/HBGT IIB {5' region} [human, brain and spinal cord, mRNA Partial Mutant, 129 nt]141.44 177.7 138.58 142.61 167.28 169.49 199.64 185.22 218.79 196.56 150.14 185.24 226.3731324_at Cluster Incl. U82303:Homo sapiens unknown protein mRNA, partial cds /cds=(0,257) /gb=U82303 /gi=1938329 /ug=Hs.123080 /len=34470.87 57.8 61.61 65.93 84.05 106.41 106.73 87.01 112.12 78.47 111 89.08 100.5331325_at Cluster Incl. U82306:Homo sapiens unknown protein mRNA, partial cds /cds=(0,221) /gb=U82306 /gi=1938333 /ug=Hs.123081 /len=25368.63 167.66 69.04 112.84 120.46 126.72 107.04 100 116.83 207.5 125.65 155.19 102.5531326_at Cluster Incl. AF005081:Homo sapiens skin-specific protein (xp32) mRNA, partial cds /cds=(0,340) /gb=AF005081 /gi=2589189 /ug=Hs.123091 /len=416157.67 127.49 123.37 146.18 150.95 159.46 184.08 206.02 182.95 139.01 154.57 143.09 175.2731327_at Cluster Incl. AF015124:Homo sapiens IgG heavy chain variable region (Vh26) mRNA, partial cds /cds=(0,305) /gb=AF015124 /gi=2599349 /ug=Hs.123093 /len=34035.57 28.17 33.64 32.36 38.76 43.13 40.16 46.8 34.47 33.71 25.74 29.45 37.3731328_at solute carrier family 34 (sodium phosphate), member 161.61 48.23 50.76 58 57.58 57.91 69.91 72.34 70.29 54.98 59.74 45.55 63.0231329_at Human putative opioid receptor mRNA, complete cds12.23 18.91 15.36 19.99 21.15 15.9 20.76 22.26 16.15 28.86 13.59 16.06 23.8531330_at ribosomal protein S19108.87 133.3 89.84 113.02 147.61 169.87 156.81 136.47 153.07 220.54 220.96 332.11 18331331_at surfactant protein A binding protein28.21 17.99 23.56 26.37 30.35 28.84 31.54 35.06 22.53 24.45 23 21.37 30.8631332_at RIG-like 14-1 20.77 18.58 19.03 18.29 20.86 23.56 25.11 24.43 19.3 28.72 17.27 23.18 25.3531333_at tolloid-like 1 22.97 52.9 26.95 41.22 48.38 48.85 42.09 40.13 40.73 89.86 46.96 59.74 40.8731334_at G protein-coupled receptor 4598.57 100.16 78.76 119.09 118.58 97.42 110.67 104.95 143.47 111.28 102.88 115.9 133.1731335_at clone 1900 unknown protein65.79 54.3 54.79 57.23 60.75 66.98 72.89 86.97 76.34 57.65 59.83 49.21 70.8331336_at Cluster Incl. AC004076:Homo sapiens chromosome 19, cosmid R30217 /cds=(0,2075) /gb=AC004076 /gi=2822142 /ug=Hs.129709 /len=207640.97 26.15 32.55 26.26 33.73 36.15 36.03 34.93 24.12 26.26 22.55 23.64 28.11

5 normal sample and 9 myeloma (MM) samples 12558 genes (rows)

Genes of Interest

• Statistical significance: that the observed differential expression is unlikely to be due to chance.

• Scientific significance: that the observed level of differential expression is of sufficient magnitude to be of biological relevance.

Group 1 (N samples): X1, X2, … XN

Group 2 (M samples): Y1, Y2, … YM

Statistical significance in the two group problem

Assume

Yj ~ Normal (μ2, σ2)

Xi ~ Normal (μ1, σ2)

Null hypothesis: Group 1 is the “same” to Group 2

(i.e., μ1= μ2)

Parametric Test: t-test

Statistical significance in the two group problem

Yj ~ Normal (μ2, σ2)Xi ~ Normal (μ1, σ2)

Null hypothesis: μ1= μ2

M

s

N

s

YXt

22

* )(

2

)1()1( 22

212

MN

sMsNs

N

iiX

NX

1

1

M

iiY

MY

1

1

Test null hypothesis with test statistics:

N

ii XX

Ns

1

221 )(

1

1

M

ii YY

Ms

1

222 )(

1

1

)2(~* MNtt

Parametric Test: t-test

M

s

N

s

YXt

22

21

)('

1

)/(

1

)/()//(

' 222

221

222

21

M

Ms

N

NsMsNs

df

If variances are unequal

(1) When N+M > 30, this is approximately normal

(2) When 1 >> 2, this is approximately t(df = N–1)

(3) In general, Welch approximation: t’ ~ t(df’), where

Yj ~ Normal (μ2, σ22)

Xi ~ Normal (μ1, σ12)

σ1 σ2

Wilcoxon rank sum test

Consider row 7 of MM study

16 253 633 1008 708 36 72 28 14 33 19 49 58 23

13 4 3 1 2 8 5 10 14 9 12 7 6 11

---------------------------

rank sum = 23

This test is more appropriate than the t-tests when the underlying distribution is far from normal. (But it requires large group sizes)

P-value

• p-value = P(|T|>|t|) is calculated based on the distribution of T under the null hypothesis.

• p-value is a function of the test statistics and can be viewed as a random variable.– e.g. p-value = 2(1 - F(|t*|), F = cdf of t(N+M – 2).

• A small p-value represents evidence against the null hypothesis differentially expressed in our case.

Permutation test• A non-parametric way of computation p-value for any

test statistics.– In the MM-study, each gene has (14 choose 5) = 2002

different test values obtainable from permuting the group labels.

• Under the null hypothesis that the distribution for the two groups are identical, all these test values are equally probable. What is the probability of getting a test value at least as extreme as the observed one? This is the permutation p-value.

Permutation technique

Condition 0 Condition 1

Patient 4 Patient 2 Patient 3 Patient 1 Patient 5 Patient 6






Patient 1 Patient 2 Patient 3 Patient 4 Patient 5 Patient 6Compute TS0

Compute TS1

Compute TS2

Compute TS3

The set of TSi form the empirical distribution of the test statistic TS

Scientific Significance

• Fold change FC =

• May not be high when statistical significance is high.

• Not an appropriate measure if the dispersion is not taken into consideration.

/X Y

Conservative fold change

Conservative fold change (CFC) =

Max (25th percentile of sample 1 / 75th percentile of sample 2,

25th percentile of sample 2 / 75th percentile of sample 1)

xaxis

de

n1

98 100 102 104 106

0.0

0.1

0.2

0.3

0.4

Sample 1: Normal (100, 1)

Sample 2: Normal (103, 1)

CFC = 1.0164

90

10

011

01

20

10

02

00

30

04

00

50

04

06

08

01

00

12

01

40

16

0

CFC=3.53

50

10

01

50

20

02

50

30

0

CFC=1.07

CFC=2.89

CFC=1.45

P-values and FC contains different information

- log10(p.value.unequal.log)[idx]

fold

.qu

an

tile

[idx]

3 4 5 6 7

24

68

10

12

14

Gene Selection and Ranking

• A high threshold of statistical significance Select genes with p-values smaller than a threshold

• The selected genes are ordered according to their scientific significance (i.e. ranked by fold-changes)

The False Positive Rate (FPR)

• If we select genes with p-value < 0.01, then the probability of making a positive call when the gene is in fact not differential is less than 0.01. Thus selection by p-value controls the FPR.

• However, if we have 12,000 genes in a microarray, then a FPR = 0.01 still allows up to 120 false positives. To make sensible decision, we must take multiple comparisons into consideration.

Dealing with Multiple Comparison

• Bonferroni inequality: To control the family-wise error rate for testing m hypotheses at level α, we need to control the FPR for each individual test at α/m

• Then P(false rejection at least one hypothesis) < α

or P(no false rejection) > 1- α

• This is appropriate for some applications (e.g. testing a new drug versus several existing ones), but is too conservative for our task of gene selection.

Documents

Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005