An Analysis of MicroArray Quality Control Data James J. Chen, Ph.D. Division of Biometry and Risk...

Preview:

Citation preview

An Analysis of MicroArray Quality Control Data

James J. Chen, Ph.D.Division of Biometry and Risk AssessmentNational Center for Toxicological Research

U.S. Food and Drug Administration

2006 FDA and Industry WorkshopSeptember 29, 2006

The views expressed in this presentation do not represent those of the U.S. Food and Drug Administration

Outline

Background: MAQC experimental design and data

Microarray Platform Comparisons Inter-platform analysis Intra-platform analysis and platform’s performance

concordance, site effects, consistency, discriminability sensitivity, specificity, and accuracy in gene selection self-consistency of titration mixture

TaqMan and microarray platforms comparability Conclusion

MicroArray Quality Control Project

Objective: To compare expression data generated at multiple test sites (labs) using several microarray-based and alternative technology platforms

Microarray platforms Alternatives platforms

Applied Biosystems ABI (1) Applied Biosystems (TAQ)Affymetrix  AFX (1) Panomics (QGN) Agilent AGI (1, 2) Gene Express (GEX)

Eppendorf  EPP (1) GE Healthcare  GEH (1)Illumina  ILM (1)NCI_Operon  NCI (2)

Nature Biotechnologyv24(9), Sep (2006)

MAQC Experimental Design

Four RNA samples: Sample A: Universal human reference RNA (Stratagene) Sample B: Human brain reference RNA (Ambion) Sample C (75% A + 25% B) Sample D (25% A + 75% B) Three sites for each microarray platform (NCI: 2 sites) One site for the TAQ, QGN, GEX Five technical replicates for each microarray platform Four replicates for TAQ, three replicates for QGN & GEX

EPP: 294 target genes; QGN: 245; GEX:205

MAQC Data Used for Comparisons

Platform

ABIAFXAGIGEHILM

TAQ

Probe

32,87854,67543,93154,35947,293

1,004

Site

33

3 33

1

Array2

5860566059

N/A

Rep1

55555

412,091 common genes among microarray platforms 906 TAQ genes are among the 12,091 genes1. technical replicates; 2. a total of 293 arrays

Sample

44444

4

Hierarchical Clustering of 293 arrays on12091 genes from all pairwise correlations between two arrays.

AFXABIAGLGEHILM

ABCD

Site1Site2Site3

Sam

ple

Site

A B C D

0.5

0.6

0.7

0.8

Concordance: all pairwise Inter-platform sample correlation coefficients between two arrays from different platforms.

Up to 2250 (10x15x15) correlations computed for each sample.

.74.70 .71

.68

.82

.45

Concordance: all pairwise Inter-platform fold-change correlation coefficients between two arrays from different platforms.

B/A C/A D/A C/B D/B C/D

0.6

0.7

0.8

0.9

90 (10 x 3 x 3) correlations for each fold-change

.85

.75

.82.78

.84

.78

.92

.53

Cross Platform Consistency

Proportion of genes shows a significant platform*sample interaction from the (gene-by-gene) ANOVA:

y = m + P + Sample + P*Sample + e

Significant interaction: the patterns of expression of the four samples are inconsistent across the platforms.

alpha:10pow er

pro

portio

n o

f sig

nifi

cance

s

0.2

0.4

0.6

0.8

1.0

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

Plot of the p-values versus ranking proportions

Proportion

log10 p

The proportion of significances is 30% at = 0.01

0.3

Inconsistency (p < 0.01) Consistency (p > 0.01) Gene5 ,pvalue < 10 11

-1

0

1

2

3

AFX ABI AG1 ILM GEH

ABCD

Gene21 ,pvalue = 0.001

-1

0

1

2

3

AFX ABI AG1 ILM GEH

ABCD

Gene312 ,pvalue = 0.11

-1

0

1

2

3

AFX ABI AG1 ILM GEH

ABCD

Gene15 ,pvalue = 0.991

-1

0

1

2

3

AFX ABI AG1 ILM GEH

ABCD

Intra-Platform Analysis

Concordance: all pairwise correlations between two arrays from different sites for samples A,B,C, and D (3 x 5 x 5 correlations).

Site Effects: ANOVA: y = m + sample+ site + sample*site + e

Site Effect: the variance ratio, F = MSEsite/MSEe

Consistency: proportion of genes shown to have a significant sample*site interaction (

Discriminability: ANOVA: y = m + sample + e

Variability: residual mean square (total variation other than sample differences).

Discriminability: the proportion of the genes shown to have significant sample effects ( .

Individual Platform’s Performance

Reproducibility and Consistency Performance Median Correlation Site Cons’y MSE

Discr’ty2

A B C D Fm 1

AFX .988 .988 .991 .992 24. .012 .066 .618

ABI .968 .964 .972 .969 15. .008 .107 .620

AG1 .978 .982 .982 .981 28. .063 .090 .633

ILM .980 .979 .980 .981 242. .020 .266 .441 GEH .925 .904 .872 .862 64. .097 .267 .453

2.

Gold Standard Set

A gene is differentially expressed if it was shown to be significant in at least 2 of the 5 platforms at 10-5.

H0: A - B = 0 versus H1: A - B ≠ 0 (8265 genes were selected)

A gene is non-differentially expressed if its fold change was shown to be between 0.90 and 1/0.90 in at least 2 of the 5 platforms at 10-3. Let - log2(0.90)

Equivalence test: H0: |A-B| > versus H1: |A-B| <

(498 genes were selected)

Gold Standard: 8607 genes (delete 78 overlaps)

Accuracy (AC), sensitivity (SN), specificity (SP), and FDR by FWE = 0.05* and FDR = 0.05 as threshold.

AC SN SP FDR

.77 .76 .95 .004

.74 .73 .95 .004

.81 .80 .80 .003

.55 .53 1.0 .000

.54 .52 .95 .005

AC SN SP FDR

.92 .94 .55 .024

.89 .91 .59 .023

.92 .94 .55 .024

.88 .88 .95 .023

.82 .82 .69 .019

AFXABIAG1ILMGEH

FWE = 0.05* FDR = 0.05

= 0.05/8607 = 5.8 x 10-6

Comment on MAQC: Gene Selection

The MAQC project used technical replicates (small variance) with two distinct biological samples (large difference).

The number of differential expressed genes are much more than typical microarray experiments.

Generating a gene list is not a problem, the problem is determining the number of genes in the list.

General principle: to identify a list of differentially expressed genes as accurately as possible.

Reproducibility of lists of differentially expressed genes – Percentage of Overlapping Genes (POG)

For AFX, 6319 genes have p < 10-5 4370 genes have FC > 2.

For AB1, 6127 genes have p < 10-5 4835 genes have FC > 2.

At least more than 4,000 genes can be selected with an FDR estimate less than 2/4,000.

from MAQC Fig S2 of supplements.

Assessment of Titration Trend

Titration correlations: 0.75A+0.25B and C 0.25A+0.75B and D

Titration model: (A two-step test)

The titration relationship can be modelled by M1t: y = m + Conc + Site + e

Full ANOVA model. M1 y = m + Sample + Site + e

S1: Test for Sample difference M1: H0t1: A = B = C = D

S2: Test for the goodness of fit: H0t2 M1t = M1 Proportion of genes that reject H0t1and accept H0t2

Linear Titration Model

H0t1:A H0t1:R,H0t2:A H0t1:R,H0t2:R

p1= 0.316

2

4

6

8

10

12

B_0 D_0.25 C_0.75 A_1

p1<0.0001 , p2= 0.108

2

4

6

8

10

12

B_0 D_0.25 C_0.75 A_1

p1<0.0001 , p2<0.0001

2

4

6

8

10

12

B_0 D_0.25 C_0.75 A_1

Titration correlation for samples C and D, and the proportions of the genes that follow the titration relationship.

Sample C Sample D (5%, 5%) (1%, 1%)

.909 .911 .963 .976 .916 .928 .954 .967 .930 .939 .923 .944 .930 .936 .937 .954 .923 .934 .988 .988

AFXABIAG1ILMGEH

Correlation Titration Model (,

Taqman and microarray platform concordance: Box-Plots of all pairwise sample correlation coefficients. Corre. of TAQ v.s. microarrays

0.50

0.55

0.60

0.65

0.70

0.75

0.80

AFX ABI AG1 ILM GEH

AB

60 (4 x 15) correlations computed in each sample

.78

.62

.77.75

.74.76

.66

.71

.74

.71

.52

.80

Taqman and microarray platform concordance: Box-Plots of fold-change (B/A) correlation coefficients.Corre. of TAQ v.s. microarrays: B/A

0.82

0.84

0.86

0.88

0.90

AFX ABI AG1 ILM GEH

.86

.88

.89

.86

.89

.82

.90

Consistency of TaqMan and Microarray platforms

Proportions of significances: 0.72, 0.57, 0.49, 0.65, 0.39; Proportion of significances microarray platforms: 0.30

pvalue = 0.74

0.0

0.5

1.0

1.5

2.0

AFX ABI AG1 ILM GEH

ABCD

0.0

0.5

1.0

1.5

2.0

AFX ABI AG1 ILM GEH

ABCD

10 10

10 8

10 4

10 9

10 7

microarray platforms Taqman and microarray

Conclusion (1)

Inter platform (microarray and Taqman): Concordance

Sample correlations: 0.45(D)-0.82 (A) FC correlations: Higher B/A; Lower: C/A

In-consistency Microarray platforms: Thirty percent (30%) of genes show

inconsistent expression patterns at = 0.01. Taqman and microarray platforms: The proportions are

between 0.34 to 0.74 for the five platforms.

Comparability Intensities measured by different microarray platforms, and

measured between microarray and Taqman platforms are different.

Conclusion (2)

Titration Trend Titration Correlation: The correlations between observed

intensity and expected intensity are more than 90%. Titration trend: All five platforms follow the linear titration

relationship well.

Intra microarray platforms’ performance

Concordance: Intra-platform correlations are high. Site effect: All platforms show site effects. Consistency: The patterns of expression are consistent across

three sites.

Recommended