3. Microarrays: experimental design, statistical analysis and gene

John BennettInternational Rice Research Institute

Los Baños, Philippines

WUEMED – Drought Course

3. Microarrays: experimental design,statistical analysis and gene clustering

IRRI

3.1: Experimental overview

IRRI

IRRI

Photolithography for oligonucleotide synthesis

Applied Biosystems Agilent TaqMan Gene Expression Assays

Platforms Human Genome Human Whole TaqMan Expression Survey Microarray Genome Arrays

Technology Hybridization Comparative Hybridization 5' Nuclease Chemistry & Real Time PCR

Probe (bases) 60mer 60mer TaqMan Primer & Probes

Substrate Nylon Glass slide -

Deposition Contact Spotting In-situ Ink Jet Printing -

Detection One-color Two-color One-color FAM Chemilumin. Cy3/Cy5 Fluorescence

FluorescenceSoftware 1700 Chemilumin. Feature Extraction SDS 2.1

Microarray Analyzer A 7.5.1Software v 1.1

Total Probes 33,096 probes 44,000 probes 1375 Selected Targets

Overview of the two microarray platforms and TaqMan® GeneExpression Assay based real-time PCR

Wang et al. (2006). Large scale real-time PCR validation on gene expression measurements from two commercial long-oligonucleotide microarrays. BMC Genomics 7: 59.

IRRI

Management of library of cDNA clones withBiomek 2000 liquid transferring system

PCR and library replication use Biomek

IRRI

Slides printing using GeneTAC printer

Microtiter plates Glass slides

3000 spots per slidePCR productsfrom >9000 genes

IRRI

UV-crossing linking after printing

IRRI

Reverse transcription andlabeling with fluorescent dyes

Smears showing labeled reverse transcripts

Un-incorporated dyes

IRRI

GeneTAC Hyb station

Automated and manual hybridization chambers

Manual hyb chamberin water bath

IRRI

Slides Scanning with ScanArray

10K rice panicle cDNA library printed at IRRI59 K oligo array from BGI, Beijing IRRI

22K chips from Agilent

Images captured by scanner

IRRI

Quantification---gridding

IRRI

3.2: Experimental design

IRRI

Steps in microarray analysis

Steps• Biological experiment• Sample collection• RNA extraction• RNA labeling• Array printing • Hybridization• Scanning• Data acquisition• Data analysis• Data interpretation

Sources of error• Plant growth and stress conditions• Tissue variation• RNA quality• Efficiency of labeling (esp. Cy3 vs Cy5)• Reproducibility, pin effects• Non-uniformity, background, cross-hybrid’n• Varaiable scanner performance• Inaccurate gridding• Inconsistent background subtraction• Faulty annotation

IRRI

Types of replication

• Technical replication (on same RNA sample to gauge effects of different arrays, hybridization conditions, etc.)

• Dye swap (to gauge the effect of using different dyes)• Biological replication (different plants from the same treatment in the same

experiment)• Experimental replication (same experimental design but conducted at

different times)

log-transformed gene expression signal= log(y) = µ + A + D + V + G + (AG) + (V G) + ε (1)

where: µ is the average expression signalA array effectD dye effectV sample variety effectG gene effect(AG) combination of array and gene(VG) combination of variety and geneε independent noise. IRRI

Random factors contributing to technical variance include:• variation among replicate spots within a slide hybridization (corr > 95%) • variation among replicate spots between slides (corr ~60-80%) • variation introduced by scratches or dust or local hybridization effects• variation introduced by subtraction of background from spot signal intensities• variation introduced by tissue sampling• variation introduced by RNA extraction

Sources of error

Systematic sources of variation include:• different dyes (corr <60–80%) – include dye swaps• multiple print tips (print group effects) – local data normalization

Unlike earlier microarray studies, most journals will no longer accept manuscripts without adequate sampling.

IRRI

• Replication requires more resources and appropriate experimental design canincrease the efficiency of resource utilization and optimize statistical power.

• Reference and balanced are the two basic designs.

• In reference designs, all experimental samples are labelled with one dye and each co-hybridized with a common reference sample that is labelled with a second dye.

• In balanced designs such as loops, experimental samples are labelled with bothdyes and hybridized to each other.

• For the same number of slides, twice the number of experimental samples can be included in a balanced design compared to a reference design, leading to improved precision and increased statistical power.

• Furthermore, error due to technical variability is highest for reference designs.

Reference and balanced designs

IRRI

Two treatments X two replicates X two dyes X dye swap ÷ two scan λs = 4 slides

Simple two-treatment design

IRRI

Gary A. Churchill GA. 2002. Fundamentals of experimental design for cDNAmicroarrays. Nature Genetics 32: 490 - 495

Design with and without reference samples

IRRI

3.3: Statistical analysis

IRRI

The TM4 suite of tools consist of four major applications:

1. Microarray Data Manager (MADAM) 2. TIGR_Spotfinder3. Microarray Data Analysis System (MIDAS) 4. Multiexperiment Viewer (MeV)

Plus

5. A (MIAME*)-compliant MySQL database

Freely available at http://www.tigr.org/software

TM4 from The Institute for Genome Research(TIGR)

*Minimal Information About a Microarray Experiment

Saeed et al. (2003). TM4: a free, open-source system for microarray data management and analysis. Biotechniques. 34: 374-378.

IRRI

• After the spot intensity values are measured in TIGR Spotfinder, they must be normalized to help compensate for variability between slides and fluorescent dyes, as well as other systematic sources of error, by appropriately adjusting the measured array intensities.

• Data filtering can reduce the dataset by removing poor or questionable data. TIGR’sMIDAS, a Java application, provides an interface to design analysis protocols combining one or more normalization and filtering steps. MIDAS reads “.tav” files generated by TIGR Spotfinder or retrieved from the database via MADAM.

• Normalization modules include locally weighted linear regression [lowess] and total intensity normalization. These can be linked with filters, including low-intensity cutoff, intensity-dependent Z-score cutoffs, and replicate consistency trimming, creating a highly customizable method for preparing expression data for subsequent comparison and analysis. When the normalization and filtering steps are complete, MIDAS outputs thedata in “.tav” format.

Data normalization and filtering via MIDAS

IRRIQuackenbush J. 2002. Microarray data normalization and transformation. Nature Genetics 32: 496 – 501.

Global versus local normalization. Most normalization algorithms, including lowess, can be applied either globally (to the entire data set) or locally (to some physical subsetof the data). For spotted arrays, local normalization is often applied to each group of array elements deposited by a single spotting pen (sometimes referred to as a 'pen group' or 'subgrid').

• TIGR Spotfinder was designed for the rapid, reproducible, and computer-aidedanalysis of microarray images and the quantification of gene expression. It readspaired 16-bit TIFF image files generated by most microarray scanners.

• Automatic and manual grid adjustments help to ensure that each rectangular gridcell is centered on a spot. Spot intensities are calculated as an integral of non-saturated pixels. Local background is subtracted from each intensity value.

• These calculated intensities, along with each spot’s position on the array, spotarea, background values, and quality control flags, are written to a TIGR ArrayViewer (“.tav”) file format, a Microsoft Excel® workbook, or the database.

• In noisy areas of the slide, the user may manually identify or discard spots. Quality-control views allow the user to assess systematic biases in the data.

TIGR Spotfinder for image analysis

IRRI

• ANOVA log-transformed gene expression signal (Kerr et al., 2000)• mixture models for gene effect (Lee et al., 2000)• multiplicative model (not logarithm-transformed) (Yang et al., 2001; Sasik

et al., 2002)• ratio-distribution model (Chen et al., 1997, 2002)• binary model (Shmulevich and Zhang, 2002)• rank-based models not sensitive to noise distributions (Ben-Dor et al., 2000)• replicates using mixed models (Wernisch et al., 2003)• quantitative noise analysis (Tu et al., 2002; Fathallah-Shaykh et al., 2002)• design of reverse dye microarrays (Dobbin et al., 2003).

Proposed models for statistical analysis of microarray expression data

Pan (2002) compared different microarray statistical analysis methods: • log-linear ANOVA mixed model (Pan et al., 2001; based on Tusher et al., 2001)• two-sample t-test (Devore & Peck, 1997)• regression (Thomas et al., 2001)

Pan W. 2002. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 18: 546-554. IRRI

Quantification---spot quality

IRRI

Spot scanning for control of spot quality

For 16 positions across each spot,determine intensity and calculatep value from t-test.

Spots with low p values are acceptable.

High p values could result from poorprinting, damage, poor hybridization,poor gridding

IRRI

LOWESS normalized data in GPR format

IRRI

Distribution plot view

IRRI

Data loaded into TMeV(TIGR) for statistical analysis

IRRI

Comparison of biological replicates

IRRI

Quackenbush J. 2002. Microarray data normalization and transformation. Nature Genetics 32: 496 – 501.

Lowess - Locally Weighted Linear Regression

log10R*Glog10R*G

log 1

0(R

/G)

http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/Norm_Lowess1.htm

IRRI


Replicated determination of ratio of two treatments

log2(A/B)1

log 2

(A/B

) 2What was wrong here?

IRRI


log10R*G

log 1

0R/G

Intensity-dependent Z scores for identifying differential expression

Z>2

1<Z<2

Z<1

IRRI

ExpressConverterExpressConverter is a file transformation tool that reads microarray data files in a variety of file formats and generates .mev or .tav files as output for uploading microarray data to the database with MADAM and analyzed withMIDAS and MEV. These supported formats include Genepix, ImaGene, ScanArray, ArrayVersion and Agilent files. Affymetrix data files cannot be converted with the ExpressConverter, but can be loaded directly into MeV.

TM4 utilities: SlideMap and ExpressConverter

SlideMapSlideMap.pm is a Perl module used for conversion of spots to wells and wells to spots. It is useful when the array is custom-printed from PCR products presented to the arrayer in microtiter plates. SlideMap currently supports several commercial arrayers and 'generic' arrayers.

FAQhttp://www.tm4.org/faq.html

IRRI

Normalized and filtered expression files are analyzed from “.tav” files using TIGR MeV, which generates informative and interrelated displays of expression and annotation data from single or multiple experiments.

Analysis modules currently implemented in MeV include:• hierarchical clustering (8)• k-means clustering (18)• self-organizing maps (15)• principal components analysis (17)• cluster affinity search technique (3)• self-organizing trees (13)• template matching• between-groups tests (including t-tests)• bootstrapping and jackknifing resample the dataset to generate consensusclusters.

Data analysis via TIGR MeV

IRRI

3.4: Gene clustering

IRRI

Datta S, Datta S. 2003. Comparisons and validation of statistical clustering techniquesfor microarray gene expression data. Bioinformatics 19: 459-466.

• At first a mainly visual analysis was used for clustering of genes into similar groups(e.g., DeRisi et al., 1997)

• Subsequently, simple sorting of expression ratios and some form of ‘correlation distance’ were used to identify genes (Spellman et al., 1998; Eisen et al., 1998).

• Datta & Datta (2003) compared six different clustering methods:(i) Hierarchical clustering with correlation (e.g., UPGMA)(Eisen et al., 1998)(ii) Clustering by K-means(iii) Diana (divisive clustering)(iv) Fanny (Fuzzy logic)(v) Model-based clustering(vi) Hierarchical clustering with partial least squaresThey used microarray data of Chu et al. (1998) for yeast sporulation:6118 genes, seven time points during the onset of sporulation (0-12 h) [http://cmgm.stanford.edu/pbrown/sporulation]

Comparison of clustering methods

IRRI

Chung et al. (2002). Molecular portraits and the family tree of cancer. Nature Genetics 32: 533 - 540 (2002)

Use of microarray data to cluster cancer types

IRRI

Cluster analysis requires a suitable co-variable

Examples:• Time (e.g., duration of treatment)• Genotypes • Stress level (e.g., salt concentration, temperature, water status)• Any other suitable independent variable, or dummy independent

variable, or co-variable• Certain suitable combinations (e.g., temperature and water status)

IRRI

0.00.20.40.60.81.00.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

NTR

FTSW

Using FTSW as the co-variablefor cluster analysis

(fraction of transpirable soil water)

Each point on the curve represents a stage in stress development andcan be related to changesin other physiological and molecular factors (such as photosynthesis and transcriptlevels).

NTR = normalized transpirationrate

IRRI

Ermolaeva et al. (1998). Data management and analysis for gene expression arrays.Nature Genetics 20: 19-23.

Data management and analysis for gene expression arrays

IRRI

Documents

3. Microarrays: experimental design, statistical analysis and gene