View
221
Download
0
Tags:
Embed Size (px)
Citation preview
Detecting Differentially Expressed Genes
Pengyu Hong09/13/2005
Background (Microarray)
CellsExtract RNA
Background
CellsExtract RNA
Background
CellsExtract RNA
Background
CellsExtract RNA
Background
CellsExtract RNA
104+ genes
Background
CellsExtract RNA
104+ genes
Background
CellsExtract RNA
104+ genes
Background
Biological sample
RNA extraction (total RNA or mRNA) Amplification (in vitro transcription) Label samples Hybridization Washing and staining
Scanning
• Microarrays are highly noisy • Use replicated experiments to
make inferences about differential expression for the population from which the biological samples originate
biological variability
technical variability
Background
Normalization
Calculate Gene Expression Index
An Example
probe set gene Normal m412aNormal m414aNormal m416aNormal m426aNormal m430aMM m282 MM m331aMM m332aMM m333aMM m334aMM m353aMM m408aMM m423a31307_at pre-T/NK cell associated protein28.53 32.61 29.56 36.55 33.19 25.1 32.79 34.3 35.44 28.48 29.55 22.28 28.7731308_at pre-T/NK cell associated protein69.14 53.69 52.78 62.07 58.74 67.88 85.82 83.54 85.91 60.93 62.82 47.17 77.0731309_r_atHuman breast cancer suppressor element Ishmael Upper CP1 mRNA, partial cds16.9 67.7 27.61 46.16 51.46 45.62 35.57 32.62 35.14 96.18 45.94 63.2 38.2731310_at glycine receptor, alpha 1 (startle disease/hyperekplexia, stiff man syndrome)67.42 49.55 55.51 59.57 68.42 91.06 91.23 83.66 76.37 71.23 74.95 74.04 100.7731311_at Homo sapiens cDNA FLJ40594 fis, clone THYMU2010671, highly similar to Homo sapiens T-cell receptor78.73 62.91 60.84 72.98 72.9 79.39 85.52 82.57 69.69 63.72 64.29 62.85 67.5831312_at potassium voltage-gated channel, Shab-related subfamily, member 266.65 59.46 55.47 61.75 69.92 75.28 85.53 97.91 69.92 74.77 71.83 58.17 72.1531313_at mannosyl (alpha-1,6-)-glycoprotein beta-1,6-N-acetyl-glucosaminyltransferase115.33 95.51 84.48 94.99 109.04 105.05 118.68 106.76 142.88 103.72 106.19 98.58 104.1331314_at bone morphogenetic protein 3 (osteogenic)71.89 36.24 41.86 46.99 45.94 46.67 67.56 66.14 53.95 40.97 47.96 43.63 53.5431315_at immunoglobulin lambda locus103.99 88.27 83.81 81.81 254.63 87.12 99.11 109.56 86.37 75.03 74.97 69.02 97.2231316_at Human vacuolar ATPase (isoform HO68) mRNA, complete cds16.79 10.08 9.53 16.48 11.98 12.8 16.7 18.76 11.25 12.09 18.89 10.81 19.4931317_r_atHuman unproductively rearranged Ig mu-chain mRNA V-region (VD), 5' end, clone mu-3A1A316.75 269.61 254.92 352.61 342.4 327.12 366.39 346 308.43 279.81 312.4 318.06 334.2731318_at Stem cell factor {alternatively spliced} [human, preimplantation embryos, blastocysts, mRNA Partial, 180 nt]32.68 19.79 27.45 29.56 28.34 26.55 38.04 41.05 31.91 22.76 23.58 28.29 22.6131319_at Cluster Incl. M20707:Human kappa-immunoglobulin germline pseudogene (Chr22.4) variable region (subgroup V kappa II) /cds=(0,320) /gb=M20707 /gi=185954 /ug=Hs.123030 /len=363252.78 441.07 143.32 400.01 373.4 105.06 105.72 87.02 110.75 161.69 84.88 240.91 210.5431320_at Cluster Incl. U18548:Human GPR12 G protein coupled-receptor gene, complete cds /cds=(15,1019) /gb=U18548 /gi=604499 /ug=Hs.123034 /len=1101101.42 89.07 79.51 100.69 120.06 116.74 121.41 134.74 131.36 137.4 114.15 119.89 126.7431321_at Cluster Incl. U41737:Human pancreatic beta cell growth factor (INGAP) mRNA, complete cds /cds=(5,520) /gb=U41737 /gi=1514681 /ug=Hs.123060 /len=586112.27 62.17 62.44 80.17 110.97 53.89 55.04 55.16 63.37 54.35 57.79 48.07 47.8931322_at Cluster Incl. X61079:Human mRNA for T cell receptor, clone IGRA24 /cds=(0,142) /gb=X61079 /gi=33521 /ug=Hs.123062 /len=23544.15 52.5 44.8 46.25 55.96 50.01 53.2 52.24 62.16 49.94 47.24 40.64 50.131323_r_atGlutamate transporter II variant B/HBGT IIB {5' region} [human, brain and spinal cord, mRNA Partial Mutant, 129 nt]141.44 177.7 138.58 142.61 167.28 169.49 199.64 185.22 218.79 196.56 150.14 185.24 226.3731324_at Cluster Incl. U82303:Homo sapiens unknown protein mRNA, partial cds /cds=(0,257) /gb=U82303 /gi=1938329 /ug=Hs.123080 /len=34470.87 57.8 61.61 65.93 84.05 106.41 106.73 87.01 112.12 78.47 111 89.08 100.5331325_at Cluster Incl. U82306:Homo sapiens unknown protein mRNA, partial cds /cds=(0,221) /gb=U82306 /gi=1938333 /ug=Hs.123081 /len=25368.63 167.66 69.04 112.84 120.46 126.72 107.04 100 116.83 207.5 125.65 155.19 102.5531326_at Cluster Incl. AF005081:Homo sapiens skin-specific protein (xp32) mRNA, partial cds /cds=(0,340) /gb=AF005081 /gi=2589189 /ug=Hs.123091 /len=416157.67 127.49 123.37 146.18 150.95 159.46 184.08 206.02 182.95 139.01 154.57 143.09 175.2731327_at Cluster Incl. AF015124:Homo sapiens IgG heavy chain variable region (Vh26) mRNA, partial cds /cds=(0,305) /gb=AF015124 /gi=2599349 /ug=Hs.123093 /len=34035.57 28.17 33.64 32.36 38.76 43.13 40.16 46.8 34.47 33.71 25.74 29.45 37.3731328_at solute carrier family 34 (sodium phosphate), member 161.61 48.23 50.76 58 57.58 57.91 69.91 72.34 70.29 54.98 59.74 45.55 63.0231329_at Human putative opioid receptor mRNA, complete cds12.23 18.91 15.36 19.99 21.15 15.9 20.76 22.26 16.15 28.86 13.59 16.06 23.8531330_at ribosomal protein S19108.87 133.3 89.84 113.02 147.61 169.87 156.81 136.47 153.07 220.54 220.96 332.11 18331331_at surfactant protein A binding protein28.21 17.99 23.56 26.37 30.35 28.84 31.54 35.06 22.53 24.45 23 21.37 30.8631332_at RIG-like 14-1 20.77 18.58 19.03 18.29 20.86 23.56 25.11 24.43 19.3 28.72 17.27 23.18 25.3531333_at tolloid-like 1 22.97 52.9 26.95 41.22 48.38 48.85 42.09 40.13 40.73 89.86 46.96 59.74 40.8731334_at G protein-coupled receptor 4598.57 100.16 78.76 119.09 118.58 97.42 110.67 104.95 143.47 111.28 102.88 115.9 133.1731335_at clone 1900 unknown protein65.79 54.3 54.79 57.23 60.75 66.98 72.89 86.97 76.34 57.65 59.83 49.21 70.8331336_at Cluster Incl. AC004076:Homo sapiens chromosome 19, cosmid R30217 /cds=(0,2075) /gb=AC004076 /gi=2822142 /ug=Hs.129709 /len=207640.97 26.15 32.55 26.26 33.73 36.15 36.03 34.93 24.12 26.26 22.55 23.64 28.11
5 normal sample and 9 myeloma (MM) samples 12558 genes (rows)
Genes of Interest
• Statistical significance: that the observed differential expression is unlikely to be due to chance.
• Scientific significance: that the observed level of differential expression is of sufficient magnitude to be of biological relevance.
Group 1 (N samples): X1, X2, … XN
Group 2 (M samples): Y1, Y2, … YM
Statistical significance in the two group problem
Assume
Yj ~ Normal (μ2, σ2)
Xi ~ Normal (μ1, σ2)
Null hypothesis: Group 1 is the “same” to Group 2
(i.e., μ1= μ2)
Parametric Test: t-test
Statistical significance in the two group problem
Yj ~ Normal (μ2, σ2)Xi ~ Normal (μ1, σ2)
Null hypothesis: μ1= μ2
M
s
N
s
YXt
22
* )(
2
)1()1( 22
212
MN
sMsNs
N
iiX
NX
1
1
M
iiY
MY
1
1
Test null hypothesis with test statistics:
N
ii XX
Ns
1
221 )(
1
1
M
ii YY
Ms
1
222 )(
1
1
)2(~* MNtt
Parametric Test: t-test
M
s
N
s
YXt
22
21
)('
1
)/(
1
)/()//(
' 222
221
222
21
M
Ms
N
NsMsNs
df
If variances are unequal
(1) When N+M > 30, this is approximately normal
(2) When 1 >> 2, this is approximately t(df = N–1)
(3) In general, Welch approximation: t’ ~ t(df’), where
Yj ~ Normal (μ2, σ22)
Xi ~ Normal (μ1, σ12)
σ1 σ2
Wilcoxon rank sum test
Consider row 7 of MM study
16 253 633 1008 708 36 72 28 14 33 19 49 58 23
13 4 3 1 2 8 5 10 14 9 12 7 6 11
---------------------------
rank sum = 23
This test is more appropriate than the t-tests when the underlying distribution is far from normal. (But it requires large group sizes)
P-value
• p-value = P(|T|>|t|) is calculated based on the distribution of T under the null hypothesis.
• p-value is a function of the test statistics and can be viewed as a random variable.– e.g. p-value = 2(1 - F(|t*|), F = cdf of t(N+M – 2).
• A small p-value represents evidence against the null hypothesis differentially expressed in our case.
Permutation test• A non-parametric way of computation p-value for any
test statistics.– In the MM-study, each gene has (14 choose 5) = 2002
different test values obtainable from permuting the group labels.
• Under the null hypothesis that the distribution for the two groups are identical, all these test values are equally probable. What is the probability of getting a test value at least as extreme as the observed one? This is the permutation p-value.
Permutation technique
Condition 0 Condition 1
Patient 4 Patient 2 Patient 3 Patient 1 Patient 5 Patient 6
Condition 0 Condition 1
Patient 1 Patient 2 Patient 5 Patient 4 Patient 3 Patient 6
Condition 0 Condition 1
Patient 1 Patient 6 Patient 3 Patient 4 Patient 5 Patient 2
Condition 0 Condition 1
Patient 1 Patient 2 Patient 3 Patient 4 Patient 5 Patient 6Compute TS0
Compute TS1
Compute TS2
Compute TS3
The set of TSi form the empirical distribution of the test statistic TS
Scientific Significance
• Fold change FC =
• May not be high when statistical significance is high.
• Not an appropriate measure if the dispersion is not taken into consideration.
/X Y
Conservative fold change
Conservative fold change (CFC) =
Max (25th percentile of sample 1 / 75th percentile of sample 2,
25th percentile of sample 2 / 75th percentile of sample 1)
xaxis
de
n1
98 100 102 104 106
0.0
0.1
0.2
0.3
0.4
Sample 1: Normal (100, 1)
Sample 2: Normal (103, 1)
CFC = 1.0164
90
10
011
01
20
10
02
00
30
04
00
50
04
06
08
01
00
12
01
40
16
0
CFC=3.53
50
10
01
50
20
02
50
30
0
CFC=1.07
CFC=2.89
CFC=1.45
P-values and FC contains different information
- log10(p.value.unequal.log)[idx]
fold
.qu
an
tile
[idx]
3 4 5 6 7
24
68
10
12
14
Gene Selection and Ranking
• A high threshold of statistical significance Select genes with p-values smaller than a threshold
• The selected genes are ordered according to their scientific significance (i.e. ranked by fold-changes)
The False Positive Rate (FPR)
• If we select genes with p-value < 0.01, then the probability of making a positive call when the gene is in fact not differential is less than 0.01. Thus selection by p-value controls the FPR.
• However, if we have 12,000 genes in a microarray, then a FPR = 0.01 still allows up to 120 false positives. To make sensible decision, we must take multiple comparisons into consideration.
Dealing with Multiple Comparison
• Bonferroni inequality: To control the family-wise error rate for testing m hypotheses at level α, we need to control the FPR for each individual test at α/m
• Then P(false rejection at least one hypothesis) < α
or P(no false rejection) > 1- α
• This is appropriate for some applications (e.g. testing a new drug versus several existing ones), but is too conservative for our task of gene selection.