144
Gene Expression - Microarrays Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May 2010

Gene Expression - Microarrays

  • Upload
    niyati

  • View
    104

  • Download
    1

Embed Size (px)

DESCRIPTION

Gene Expression - Microarrays. Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May 2010. Compare gene expression in this cell type…. …after viral infection. …relative to a knockout. …in samples from patients. …after drug treatment. …at a later - PowerPoint PPT Presentation

Citation preview

Page 1: Gene Expression - Microarrays

Gene Expression - Microarrays

Misha KapusheskyEuropean Bioinformatics Institute, EMBL

St. Petersburg, RussiaMay 2010

Page 2: Gene Expression - Microarrays
Page 3: Gene Expression - Microarrays
Page 4: Gene Expression - Microarrays
Page 5: Gene Expression - Microarrays
Page 6: Gene Expression - Microarrays

Compare gene expression in this cell type…

…after drug treatment

…at a later developmental time

…in a different body region

…after viral infection

…in samplesfrom patients

…relative to a knockout

Page 7: Gene Expression - Microarrays

• by region (e.g. brain versus kidney)

• in development (e.g. fetal versus adult tissue)

• in dynamic response to environmental signals

(e.g. immediate-early response genes)

• in disease states

• by gene activity

Gene expression is context-dependent,and is regulated in several basic ways

Page 297

Page 8: Gene Expression - Microarrays

Outline: microarray data analysis

Gene expression

Microarrays

Preprocessingnormalizationscatter plots

Inferential statisticst-testANOVA

Exploratory (descriptive) statisticsdistancesclusteringprincipal components analysis (PCA)

Page 9: Gene Expression - Microarrays

Microarrays: tools for gene expression

A microarray is a solid support (such as a membraneor glass microscope slide) on which DNA of knownsequence is deposited in a grid-like array.

Page 312

Page 10: Gene Expression - Microarrays

Microarrays: tools for gene expression

The most common form of microarray is used to measure gene expression. RNA is isolated from matched samples of interest. The RNA is typically converted to cDNA, labeled with fluorescence (or radioactivity), then hybridized to microarrays in order to measure the expression levelsof thousands of genes.

Page 11: Gene Expression - Microarrays

[email protected]

Measuring RNA abundances

Page 12: Gene Expression - Microarrays

[email protected]

How it works

Complementary hybridization:- Put a part of the gene sequence on the array- convert mRNA to cDNA using reverse transcriptase

Page 13: Gene Expression - Microarrays

[email protected]

Spotted Arrays

• Robot puts little spots of DNA on glass slides• Each spot is a DNA analog of the mRNA we want to detect

Page 14: Gene Expression - Microarrays

[email protected]

Spotted Arrays

• Two channel technology for comparing two samples – relative measurements• Two mRNA samples (reference, test) are reverse transcribed to cDNA, labeled with fluorescent dyes (Cy3, Cy5) and allowed to hybridize to array

Page 15: Gene Expression - Microarrays

[email protected]

Spotted Arrays

• Read out two images by scanning array with lasers, one for each dye

Page 16: Gene Expression - Microarrays

[email protected]

Oligonucleotide Arrays

• One channel technology – absolute measurements• Instead of putting entire genes on array, put multiple oligonucleotide probes: short, fixed length DNA sequences (25-60 nucleotides)• Oligos are synthesized in situ

• Affymetrix uses a photolithography process, similar to that used to make semiconductor chips• Other technologies available (e.g. mirror arrays)

Page 17: Gene Expression - Microarrays

[email protected]

Oligonucleotide Arrays

• For each gene, construct a probeset – a set of n-mers to specific to this gene

Page 18: Gene Expression - Microarrays

Fast Data on >20,000 transcripts within weeks

Comprehensive Entire yeast or mouse genome on a chip

Flexible Custom arrays can be made to represent genes of interest

Easy Submit RNA samples to a core facility

Cheap? Chip representing 20,000 genes for $300

Advantages of microarray experiments

Page 19: Gene Expression - Microarrays

Cost ■ Some researchers can’t afford to do appropriate numbers of controls, replicates

RNA ■ The final product of gene expression is proteinsignificance ■ “Pervasive transcription” of the genome is

poorly understood (ENCODE project)■ There are many noncoding RNAs not yet represented on microarrays

Quality ■ Impossible to assess elements on array surfacecontrol ■ Artifacts with image analysis

■ Artifacts with data analysis■ Not enough attention to experimental design■ Not enough collaboration with statisticians

Disadvantages of microarray experiments

Page 20: Gene Expression - Microarrays

Biological insight

Sampleacquisition

Dataacquisition

Data analysis

Data confirmation

Page 21: Gene Expression - Microarrays

Stage 1: Experimental design

Stage 3: Hybridization to DNA arrays

Stage 2: RNA and probe preparation

Stage 4: Image analysis

Stage 5: Microarray data analysis

Stage 6: Biological confirmation

Stage 7: Microarray databases

Page 22: Gene Expression - Microarrays

Stage 1: Experimental design

[1] Biological samples: technical and biological replicates:determine the data analysis approach at the outset

[2] RNA extraction, conversion, labeling, hybridization:except for RNA isolation, routinely performed at core facilities

[3] Arrangement of array elements on a surface:randomization can reduce spatially-based artifacts

Page 314

Page 23: Gene Expression - Microarrays

Stage 2: RNA preparation

For Affymetrix chips, need total RNA (about 5 ug)

Confirm purity by running agarose gel

Measure a260/a280 to confirm purity, quantity

One of the greatest sources of error in microarrayexperiments is artifacts associated with RNA isolation;appropriately balanced, randomized experimental design is necessary.

Page 24: Gene Expression - Microarrays

Stage 3: Hybridization to DNA arrays

The array consists of cDNA or oligonucleotides

Oligonucleotides can be deposited by photolithography

The sample is converted to cRNA or cDNA

(Note that the terms “probe” and “target” may refer to theelement immobilized on the surface of the microarray, orto the labeled biological sample; for clarity, it may be simplest to avoid both terms.)

Page 25: Gene Expression - Microarrays

Stage 4: Image analysis

RNA transcript levels are quantitated

Fluorescence intensity is measured with a scanner.

Page 26: Gene Expression - Microarrays

Rett

Control

Differential Gene Expression on a cDNA Microarray

B Crystallin is over-expressed in Rett Syndrome

Page 27: Gene Expression - Microarrays
Page 28: Gene Expression - Microarrays

Fig. 8.21Page 319

Page 29: Gene Expression - Microarrays
Page 30: Gene Expression - Microarrays

Fig. 8.21Page 319

Page 31: Gene Expression - Microarrays

Stage 5: Microarray data analysis

Page 318

Hypothesis testing • How can arrays be compared? • Which RNA transcripts (genes) are regulated?• Are differences authentic?• What are the criteria for statistical significance?

Clustering• Are there meaningful patterns in the data (e.g. groups)?

Classification• Do RNA transcripts predict predefined groups, such as disease subtypes?

Page 32: Gene Expression - Microarrays

Stage 6: Biological confirmation

Page 320

Microarray experiments can be thought of as“hypothesis-generating” experiments.

The differential up- or down-regulation of specific RNAtranscripts can be measured using independent assayssuch as

-- Northern blots-- polymerase chain reaction (RT-PCR)-- in situ hybridization

Page 33: Gene Expression - Microarrays

Stage 7: Microarray databases

There are two main repositories:

Gene Expression Omnibus (GEO) at NCBI

ArrayExpress at the European Bioinformatics Institute (EBI)

Page 34: Gene Expression - Microarrays

Microbial

ORFs

Design PCR Primers

PCR Products

Eukaryotic

Genes

Select cDNA clones

PCR Products

Microarray Overview IMicroarray Overview I

For each plate set,For each plate set,many identical replicasmany identical replicas

Microarray SlideMicroarray Slide(with 60,000 or more(with 60,000 or more

spotted genes)spotted genes)

+

Microtiter PlateMicrotiter Plate

Many different plates Many different plates containing different genescontaining different genes

Page 35: Gene Expression - Microarrays

Microarray Overview IIMicroarray Overview II

Prepare FluorescentlyPrepare FluorescentlyLabeled ProbesLabeled Probes

ControlControl

TestTest

Hybridize,Hybridize,WashWash

MeasureMeasureFluorescenceFluorescencein 2 channelsin 2 channels

redred//greengreen

Analyze the dataAnalyze the datato identifyto identifypatterns ofpatterns of

gene expressiongene expression

Page 36: Gene Expression - Microarrays

Affymetrix GeneChip™ Expression AnalysisAffymetrix GeneChip™ Expression Analysis

Obtain RNAObtain RNASamplesSamples

Prepare Prepare FluorescentlyFluorescently

LabeledLabeledProbesProbes

ControlControl

TestTest

Scan chipsScan chips

AnalyzeAnalyze

PMPM

MMMM

Hybridize andHybridize andwash chipswash chips

Page 37: Gene Expression - Microarrays

GeneGeneSpots Spots on anon anArrayArray

FluorescenceFluorescenceIntensityIntensity

ExpressionExpressionMeasurementMeasurement

TissueTissueSelectionSelection

DifferentialDifferentialState/StageState/StageSelectionSelection

RNA PreparationRNA Preparationand Labelingand Labeling

CompetitiveCompetitiveHybridizationHybridization

Microarray Expression AnalysisMicroarray Expression Analysis

Page 38: Gene Expression - Microarrays

Select array elements and annotate themSelect array elements and annotate them

Build a database to manage stuffBuild a database to manage stuff

Print arrays and manage the labPrint arrays and manage the lab

Hybridize and analyze images; manage dataHybridize and analyze images; manage data

Analyze hybridization data and get resultsAnalyze hybridization data and get results

Steps in the Process

Page 39: Gene Expression - Microarrays

MIAME

In an effort to standardize microarray data presentationand analysis, Alvis Brazma and colleagues at 17institutions introduced Minimum Information About aMicroarray Experiment (MIAME). The MIAME framework standardizes six areas of information:

►experimental design►microarray design

►sample preparation ►hybridization procedures ►image analysis ►controls for normalization

Visit http://www.mged.org

Page 40: Gene Expression - Microarrays

Interpretation of RNA analyses

The relationship of DNA, RNA, and protein:DNA is transcribed to RNA. RNA quantities and half-lives vary. There tends to be a low positive correlation between RNA and protein levels.

The pervasive nature of transcription:The Encyclopedia of DNA Elements (ENCODE) project identified functional features of genomic DNA, initially in 30 megabases (1% of the human genome). One of its observations was the “pervasive nature of transcription”: the vast majority of DNA is transcribed, although the function is unknown.

Page 41: Gene Expression - Microarrays

Outline: microarray data analysis

Gene expression

Microarrays

Preprocessingnormalizationscatter plots

Inferential statisticst-testANOVA

Exploratory (descriptive) statisticsdistancesclusteringprincipal components analysis (PCA)

Page 42: Gene Expression - Microarrays

Microarray data analysis

• begin with a data matrix (gene expression values versus samples)

genes(RNAtranscriptlevels)

Page 43: Gene Expression - Microarrays

Microarray data analysis

• begin with a data matrix (gene expression values versus samples)

Typically, there aremany genes(>> 20,000) and few samples (~ 10)

Fig. 9.1Page 333

Page 44: Gene Expression - Microarrays

Microarray data analysis

• begin with a data matrix (gene expression values versus samples)

Preprocessing

Inferential statistics Descriptive statistics

Page 45: Gene Expression - Microarrays

Microarray data analysis: preprocessing

Observed differences in gene expression could be due to transcriptional changes, or they could becaused by artifacts such as:

• different labeling efficiencies of Cy3, Cy5• uneven spotting of DNA onto an array surface• variations in RNA purity or quantity• variations in washing efficiency• variations in scanning efficiency

Page 46: Gene Expression - Microarrays

Microarray data analysis: preprocessing

The main goal of data preprocessing is to removethe systematic bias in the data as completely aspossible, while preserving the variation in geneexpression that occurs because of biologicallyrelevant changes in transcription.

A basic assumption of most normalization proceduresis that the average gene expression level does notchange in an experiment.

Page 47: Gene Expression - Microarrays

Data analysis: global normalization

Global normalization is used to correct two or moredata sets. In one common scenario, samples arelabeled with Cy3 (green dye) or Cy5 (red dye) andhybridized to DNA elements on a microrarray. Afterwashing, probes are excited with a laser and detectedwith a scanning confocal microscope.

Page 48: Gene Expression - Microarrays

Data analysis: global normalization

Global normalization is used to correct two or moredata sets

Example: total fluorescence in Cy3 channel = 4 million unitsCy 5 channel = 2 million units

Then the uncorrected ratio for a gene could show2,000 units versus 1,000 units. This would artifactuallyappear to show 2-fold regulation.

Page 49: Gene Expression - Microarrays

Data analysis: global normalization

Global normalization procedure

Step 1: subtract background intensity values(use a blank region of the array)

Step 2: globally normalize so that the average ratio = 1(apply this to 1-channel or 2-channel data sets)

Page 50: Gene Expression - Microarrays

Scatter plots

Useful to represent gene expression values fromtwo microarray experiments (e.g. control, experimental)

Each dot corresponds to a gene expression value

Most dots fall along a line

Outliers represent up-regulated or down-regulated genes

Page 51: Gene Expression - Microarrays

Brain

Astrocyte Astrocyte

Fibroblast

Differential Gene Expressionin Different Tissue and Cell Types

Page 52: Gene Expression - Microarrays

expression level high

low

up

dow

n

Expression level (sample 1)

Exp

ress

ion

leve

l (sa

mp

le 2

)

Page 53: Gene Expression - Microarrays

Log-log transformation

Page 54: Gene Expression - Microarrays

Scatter plots

Typically, data are plotted on log-log coordinates

Visually, this spreads out the data and offers symmetry

raw ratio log2 ratio

time behavior value value t=0 basal 1.0 0.0

t=1h no change 1.0 0.0t=2h 2-fold up 2.0 1.0t=3h 2-fold down 0.5 -1.0

Page 55: Gene Expression - Microarrays

expression levelhighlow

up

down

Mean log intensity

Lo

g r

atio

Page 56: Gene Expression - Microarrays

You can make these plots in Excel…

…but for many bioinformatics applications use R.Visit http://www.r-project.org to download it.

Page 57: Gene Expression - Microarrays
Page 58: Gene Expression - Microarrays

There are limits to what you There are limits to what you can measurecan measure

Page 59: Gene Expression - Microarrays

The Limits of log-ratios: The space we exploreThe Limits of log-ratios: The space we explore

Page 60: Gene Expression - Microarrays

The Limits of log-ratios: The space we exploreThe Limits of log-ratios: The space we explore

Page 61: Gene Expression - Microarrays

The Limits of log-ratios: The space we exploreThe Limits of log-ratios: The space we explore

Page 62: Gene Expression - Microarrays

Good Data

Page 63: Gene Expression - Microarrays

Bad Data from Parts Unknown

Gary ChurchillGary Churchill

Each “pin group” is colored differentlyEach “pin group” is colored differently

Page 64: Gene Expression - Microarrays

Lowess Normalization

Why LOWESS?Why LOWESS?

-3

-2

-1

0

1

2

3

7 8 9 10 11 12 13 14

log(Cy3*Cy5)

A SD = 0.346

1.1. Intensity-dependent structureIntensity-dependent structure2.2. Data not mean centered at logData not mean centered at log22(ratio) = 0(ratio) = 0

Page 65: Gene Expression - Microarrays

11ab Raw Ratios

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

11a:ratio

11b:ratio

Ratio Cy3/Cy5 for the same RNA sorted from least most expressed

Page 66: Gene Expression - Microarrays

LOWESS Results

Page 67: Gene Expression - Microarrays

Affymetrix Chips

Page 68: Gene Expression - Microarrays

Mismatch (MM) probes

• MM probes are used to measure background signals due to non-specific sources and scanner offset.

• Using a MM probe as an estimate of background seems wrong and often the MM signal >= the PM signal

• Some would claim that subtraction of the mismatch probe adds noise for little gain.

Page 69: Gene Expression - Microarrays

Computing expression summaries: a three-step process

• Background/Signal adjustment • Normalization (can happen at the probe-pair or

the probe-set level).• Summarization of probe-pairs into probe-set or

gene level information

Page 70: Gene Expression - Microarrays

Background/Signal Adjustment

• A method which does some or all of the followingCorrects for background noise, processing effectsAdjusts for cross hybridizationAdjust estimated expression values to fall on proper scale

• Probe intensities are used in background adjustment to compute correction (unlike cDNA arrays where area surrounding spot might be used)

Page 71: Gene Expression - Microarrays

Normalization Methods

• Complete data (no reference chip, information from all arrays used)Quantile normalization (Bolstadt al 2003)

• Baseline (normalized using reference chip)Scaling (Affymetrix)

Non linear (Li-Wong)

Page 72: Gene Expression - Microarrays

Summarization

• Reduce the 11-20 probe intensities on each array to a single number for gene expression

• Main ApproachesSingle chip

• AvDiff (Affymetrix) – no longer recommended for use due to many flaws

• Mas5.0 (Affymetrix) –use a 1 step Tukey biweight to combine the probe intensities in log scale

Multiple Chip•MBEI (Li-Wong dChip) –a multiplicative model•RMA –a robust multi-chip linear model fit on the log scale

Page 73: Gene Expression - Microarrays

Robust multi-array analysis (RMA)• Developed by Rafael Irizarry (Dept. of Biostatistics), Terry Speed, and others• Available at www.bioconductor.org as an R package• Also available in various software packages (including Partek, www.partek.com and Iobion Gene Traffic)• See Bolstad et al. (2003) Bioinformatics 19; Irizarry et al. (2003) Biostatistics 4

There are three steps:

[1] Background adjustment based on a normal plus exponential model (no mismatch data are used)[2] Quantile normalization (nonparametric fitting of signal intensity data to normalize their distribution)[3] Fitting a log scale additive model robustly. The model is additive: probe effect + sample effect

Page 74: Gene Expression - Microarrays

GCRMA

• GC-RMA is a modified version of RMA that models intensity of probe level data as a function of GC-content

• expect to see higher intensity values for probes that are GC rich due to increased binding

Page 75: Gene Expression - Microarrays
Page 76: Gene Expression - Microarrays

A A

M M

After RMA (a normalization procedure), the median is near zero, and skewing is corrected.

Scatterplots display the effects of normalization.

Page 77: Gene Expression - Microarrays

vsn: variance stabilizing normalization• Variance depends on signal intensity in microarray data

• A transformation can be found after which the variance is approximately constant

• Like the logarithm at the upper end of, approximately linear at the lower end

• Also incorporates the estimation of "normalization" parameters (shift and scale)

• Assumes that less than half of the genes on the arrays are differentially transcribed across the experiment.

Page 78: Gene Expression - Microarrays

vsn: post-normalization plot

Page 79: Gene Expression - Microarrays

array

log

sig

nal i

nte

nsi

ty

array

log

sig

nal i

nte

nsi

ty

Histograms of raw intensity values for 14 arrays (plotted in R) before and after RMA was applied.

Page 80: Gene Expression - Microarrays

RMA can adjust for the effect of GC content

GC content

log

inte

nsi

ty

Page 81: Gene Expression - Microarrays

Robust multi-array analysis (RMA)

RMA offers a large increase in precision (relative to Affymetrix MAS 5.0 software).

precision

average log expression

log

exp

res

sio

n S

D

RMA

MAS 5.0

Page 82: Gene Expression - Microarrays

Robust multi-array analysis (RMA)

RMA offers comparable accuracy to MAS 5.0.

log nominal concentration

ob

serv

ed

log

ex

pre

ss

ionaccuracy

Page 83: Gene Expression - Microarrays

Outline: microarray data analysis

Gene expression

Microarrays

Preprocessingnormalizationscatter plots

Inferential statisticst-testANOVA

Exploratory (descriptive) statisticsdistancesclusteringprincipal components analysis (PCA)

Page 84: Gene Expression - Microarrays

Inferential statistics

Inferential statistics are used to make inferencesabout a population from a sample.

Hypothesis testing is a common form of inferentialstatistics. A null hypothesis is stated, such as:“There is no difference in signal intensity for the geneexpression measurements in normal and diseasedsamples.” The alternative hypothesis is that thereis a difference.

We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level to p < 0.05.

Page 85: Gene Expression - Microarrays

[1] Obtain a matrix of genes (rows) and expression values columns. Here there are 20,000 rows of genes of which the first six are shown. There are three control samples and three disease samples. Calculate the mean value for each gene (transcript) for the controls and the disease (experimental) samples.

Analyzing expression dataQuestion: for each of my 20,000 transcripts, decide whether it is significantly regulated in some disease.

control disease

Page 86: Gene Expression - Microarrays

[2] Calculate the ratios of control versus disease.

Also note that some ratios, such as 2.00, appear to be dramatic while others are not. Some researchers set a cut-off for changes of interest such as two-fold.

Analyzing expression data

Page 87: Gene Expression - Microarrays

A significantA significantdifferencedifference

ProbablyProbablynotnot

Page 88: Gene Expression - Microarrays

Inferential statistics

A t-test is a commonly used test statistic to assessthe difference in mean values between two groups.

t = =

Questions

Is the sample size (n) adequate?Are the data normally distributed?Is the variance of the data known?Is the variance the same in the two groups?Is it appropriate to set the significance level to p < 0.05?

x1 – x2

SE

difference between mean values

variability (standard errorof the difference)

Page 89: Gene Expression - Microarrays

Inferential statistics

A t-test is a commonly used test statistic to assessthe difference in mean values between two groups.

t = =

Notes

• t is a ratio (it thus has no units)• We assume the two populations are Gaussian• The two groups may be of different sizes• Obtain a P value from t using a table• For a two-sample t test, the degrees of freedom is N - 2. • For any value of t, P gets smaller as df gets larger

x1 – x2

SE

difference between mean values

variability (standard errorof the difference)

Page 90: Gene Expression - Microarrays

[3] Perform a t-test. Hypothesis is that the transcript in the disease group is up (or down) relative to controls.

Analyzing expression data

Page 91: Gene Expression - Microarrays

[3] Note the results: you can have…

a small p value (<0.05) with a big ratio differencea small p value (<0.05) with a trivial ratio differencea large p value (>0.05) with a big ratio differencea large p value (>0.05) with a trivial ratio difference

Analyzing expression data

Page 92: Gene Expression - Microarrays

Inferential statistics

Is it appropriate to set the significance level to p < 0.05?If you hypothesize that a specific gene is up-regulated,you can set the probability value to 0.05.

You might measure the expression of 10,000 genes andhope that any of them are up- or down-regulated. Butyou can expect to see 5% (500 genes) regulated at thep < 0.05 level by chance alone. To account for thethousands of repeated measurements you are making,some researchers apply a Bonferroni correction.The level for statistical significance is divided by thenumber of measurements, e.g. the criterion becomes:

p < (0.05)/10,000 or p < 5 x 10-6

The Bonferroni correction is generally considered to be too conservative.

Page 93: Gene Expression - Microarrays

Inferential statistics: false discovery rateThe false discovery rate (FDR) is a popular multiple corrections correction. A false positive (also called a type I error) is sometimes called a false discovery.

The FDR equals the p value of the t-test times the number of genes measured (e.g. for 10,000 genes and a p value of 0.01, there are 100 expected false positives).You can adjust the false discovery rate. For example:

FDR # regulated transcripts # false discoveries0.1 100 100.05 45 30.01 20 1

Would you report 100 regulated transcripts of which 10 are likely to be false positives, or 20 transcripts of which one is likely to be a false positive?

Page 94: Gene Expression - Microarrays

Inferential statistics: other methods used

• t-test for two sample groups, SAM and t-tests with permutation testing

• ANOVA for multiple factors

• Linear models with Bayesian moderation of varianceSmyth G. (2004) “Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments”

• Simultaneous inference: multivariate t-distributions forsimultaneous confidence intervalsHsu et al. (1996) “Multiple Comparisons: Theory and Methods”Hsu et al. (2006) “Screening for Differential Gene Expressions from Microarray Data”

Page 95: Gene Expression - Microarrays

log fold change (treated/untreated)

p v

alu

e (

tre

ate

d v

ers

us c

on

tro

l)

A volcano plot displays both p values and fold change

Page 96: Gene Expression - Microarrays

Outline: microarray data analysis

Gene expression

Microarrays

Preprocessingnormalizationscatter plots

Inferential statisticst-testANOVA

Exploratory (descriptive) statisticsdistancesclusteringprincipal components analysis (PCA)

Page 97: Gene Expression - Microarrays
Page 98: Gene Expression - Microarrays
Page 99: Gene Expression - Microarrays
Page 100: Gene Expression - Microarrays

Descriptive statistics

Microarray data are highly dimensional: there aremany thousands of measurements made from a smallnumber of samples.

Descriptive (exploratory) statistics help you to findmeaningful patterns in the data.

A first step is to arrange the data in a matrix.Next, use a distance metric to define the relatednessof the different data points. Two commonly useddistance metrics are:

-- Euclidean distance-- Pearson coefficient of correlation

Page 101: Gene Expression - Microarrays

What is a cluster?

A cluster is a group that has homogeneity (internal cohesion) and separation (external isolation). The relationships between objects being studied are assessed by similarity or dissimilarity measures.

Page 102: Gene Expression - Microarrays

Data matrix(20 genes and 3 time pointsfrom Chu et al., 1998)

Software: S-PLUS package

genes

samples (time points)

Page 103: Gene Expression - Microarrays

3D plot (using S-PLUS software)

t=0t=0.5

t=2.0

Page 104: Gene Expression - Microarrays

Descriptive statistics: clustering

Clustering algorithms offer useful visual descriptionsof microarray data.

Genes may be clustered, or samples, or both.

We will next describe hierarchical clustering.This may be agglomerative (building up the branchesof a tree, beginning with the two most closely relatedobjects) or divisive (building the tree by finding themost dissimilar objects first).

In each case, we end up with a tree having branchesand nodes.

Page 355

Page 105: Gene Expression - Microarrays

Distance Is Defined by a Metric

Euclidean Pearson*Distance Metric:

6.0

1.4

+1.00

-0.05D

D

-3

0

3

log

2(cy

5/cy

3)

Page 106: Gene Expression - Microarrays

Distance is Defined by a Metric

-2

0

2

log2(cy5/cy3)

EuclideanEuclidean Pearson(r*-1)Pearson(r*-1)Distance Metric:Distance Metric:

4.24.2

1.41.4

-1.00-1.00

-0.90-0.90DD

DD

Page 107: Gene Expression - Microarrays

Once a distance metric has been selected, the starting point for all Once a distance metric has been selected, the starting point for all clustering methods is a “distance matrix”clustering methods is a “distance matrix”

Distance Matrix

Ge

ne

Ge

ne 11

Ge

ne

Ge

ne 22

Ge

ne

Ge

ne 33

Ge

ne

Ge

ne 44

Ge

ne

Ge

ne 55

Ge

ne

Ge

ne 66

GeneGene11 0 1.5 1.2 0.25 0.75 1.4 0 1.5 1.2 0.25 0.75 1.4

GeneGene22 1.5 0 1.3 0.55 2.0 1.5 1.5 0 1.3 0.55 2.0 1.5

GeneGene33 1.2 1.3 0 1.3 0.75 0.3 1.2 1.3 0 1.3 0.75 0.3

GeneGene44 0.25 0.55 1.3 0.25 0.55 1.3 0 0.25 0.4 0 0.25 0.4

GeneGene55 0.75 2.0 0.75 0.25 0 1.2 0.75 2.0 0.75 0.25 0 1.2

GeneGene66 1.4 1.5 0.3 0.4 1.2 0 1.4 1.5 0.3 0.4 1.2 0

The elements of this matrix are the pair-wise distances. Note that the The elements of this matrix are the pair-wise distances. Note that the matrix is symmetric about the diagonal.matrix is symmetric about the diagonal.

Page 108: Gene Expression - Microarrays

Agglomerative clustering

a

b

c

d

e

a,b

43210

Adapted from Kaufman and Rousseeuw (1990)

Page 109: Gene Expression - Microarrays

a

b

c

d

e

a,b

d,e

43210

Agglomerative clustering

Page 110: Gene Expression - Microarrays

a

b

c

d

e

a,b

d,e

c,d,e

43210

Agglomerative clustering

Page 111: Gene Expression - Microarrays

a

b

c

d

e

a,b

d,e

c,d,e

a,b,c,d,e

43210

Agglomerative clustering

…tree is constructed

Page 112: Gene Expression - Microarrays

Divisive clustering

a,b,c,d,e

4 3 2 1 0

Page 113: Gene Expression - Microarrays

Divisive clustering

c,d,e

a,b,c,d,e

4 3 2 1 0

Page 114: Gene Expression - Microarrays

Divisive clustering

d,e

c,d,e

a,b,c,d,e

4 3 2 1 0

Page 115: Gene Expression - Microarrays

Divisive clustering

a,b

d,e

c,d,e

a,b,c,d,e

4 3 2 1 0

Page 116: Gene Expression - Microarrays

Divisive clusteringa

b

c

d

e

a,b

d,e

c,d,e

a,b,c,d,e

4 3 2 1 0

…tree is constructed

Page 117: Gene Expression - Microarrays

divisive

agglomerative

a

b

c

d

e

a,b

d,e

c,d,e

a,b,c,d,e

4 3 2 1 0

43210

Adapted from Kaufman and Rousseeuw (1990)

Page 118: Gene Expression - Microarrays
Page 119: Gene Expression - Microarrays
Page 120: Gene Expression - Microarrays

1

1

12

12

Agglomerative and divisive clustering sometimes give conflictingresults, as shown here

Page 121: Gene Expression - Microarrays

Agglomerative Linkage Methods

Linkage methods are rules or metrics that return Linkage methods are rules or metrics that return a value that can be used to determine which a value that can be used to determine which elements (clusters) should be linked.elements (clusters) should be linked.

Three linkage methods that are commonly used Three linkage methods that are commonly used are: are:

Single LinkageSingle Linkage Average LinkageAverage Linkage Complete LinkageComplete Linkage

(HCL-6)(HCL-6)

Page 122: Gene Expression - Microarrays

Cluster-to-cluster distance is defined as the Cluster-to-cluster distance is defined as the minimum distanceminimum distance between members of one cluster and members of the another between members of one cluster and members of the another cluster. Single linkage tends to create ‘elongated’ clusters with cluster. Single linkage tends to create ‘elongated’ clusters with individual genes chained onto clusters.individual genes chained onto clusters.

DDABAB = min ( d(u = min ( d(uii, v, vjj) )) )

where u where u A and v A and v BBfor all i = 1 to Nfor all i = 1 to NAA and j = 1 to N and j = 1 to NBB

Single Linkage

(HCL-7)(HCL-7)

DDABAB

Page 123: Gene Expression - Microarrays

Cluster-to-cluster distance is defined as the Cluster-to-cluster distance is defined as the average distanceaverage distance between all members of one cluster and all members of another between all members of one cluster and all members of another cluster. Average linkage has a slight tendency to produce clusters cluster. Average linkage has a slight tendency to produce clusters of similar variance.of similar variance.

DDABAB = 1/(N = 1/(NAANNBB) ) ( d(u ( d(uii, v, vjj) )) )

where u where u A and v A and v BBfor all i = 1 to Nfor all i = 1 to NAA and j = 1 to N and j = 1 to NBB

Average Linkage

(HCL-8)(HCL-8)

DDABAB

Page 124: Gene Expression - Microarrays

Cluster-to-cluster distance is defined as the Cluster-to-cluster distance is defined as the maximum distancemaximum distance between members of one cluster and members of the another between members of one cluster and members of the another cluster. Complete linkage tends to create clusters of similar size cluster. Complete linkage tends to create clusters of similar size and variability.and variability.

DDABAB = max ( d(u = max ( d(uii, v, vjj) )) )

where u where u A and v A and v BBfor all i = 1 to Nfor all i = 1 to NAA and j = 1 to N and j = 1 to NBB

Complete Linkage

(HCL-9)(HCL-9)

DDABAB

Page 125: Gene Expression - Microarrays

Comparison of Linkage Methods

SingleSingle AverageAverage CompleteComplete

Page 126: Gene Expression - Microarrays

Two-way clusteringof genes (y-axis)and cell lines(x-axis)(Alizadeh et al.,2000)

Page 127: Gene Expression - Microarrays

A

B

x1

x2

1

1

0.5

0.5

1.5

A’

B’

a1 b1a’1 b’1

a’2

b2

a2

b’2

Euclidean distance

Chord distance

Angle distance

Page 128: Gene Expression - Microarrays

1. Specify number of 1. Specify number of clustersclusters, e.g., 5. , e.g., 5.

2. Randomly assign genes to clusters.2. Randomly assign genes to clusters.G1G1 G2G2 G3G3 G4G4 G5G5 G6G6 G7G7 G8G8 G9G9 G10G10 G11G11 G12G12 G13G13

K-Means/Medians Clustering – 1

Page 129: Gene Expression - Microarrays

K-Means/Medians Clustering – 2

3. Calculate mean/median expression profile of each cluster.3. Calculate mean/median expression profile of each cluster.

4. Shuffle genes among clusters such that each gene is now in the 4. Shuffle genes among clusters such that each gene is now in the cluster whose mean expression profile (calculated in step 3) is cluster whose mean expression profile (calculated in step 3) is the closest to that gene’s expression profile.the closest to that gene’s expression profile.

G1G1 G2G2G3G3 G4G4 G5G5G6G6

G7G7

G8G8 G9G9G10G10

G11G11

G12G12

G13G13

5. Repeat steps 3 and 4 until genes cannot be shuffled around any 5. Repeat steps 3 and 4 until genes cannot be shuffled around any more, OR a user-specified number of iterations has been more, OR a user-specified number of iterations has been reached. reached.

kk-means is most useful when the user has an -means is most useful when the user has an a prioria priori hypothesis about the hypothesis about the number of clusters the genes should belong to.number of clusters the genes should belong to.

Page 130: Gene Expression - Microarrays

Because of the random initialization of K-Means/K-Means, Because of the random initialization of K-Means/K-Means, clustering results may vary somewhat between successive runs on clustering results may vary somewhat between successive runs on the same dataset. KMS helps us validate the clustering results the same dataset. KMS helps us validate the clustering results obtained from K-Means/K-Medians.obtained from K-Means/K-Medians.

Run K-Means / K-Medians multiple times.Run K-Means / K-Medians multiple times.

The KMS module generates clusters in which the member genes The KMS module generates clusters in which the member genes frequently group together in the same clusters (“consensus frequently group together in the same clusters (“consensus clusters”) across multiple runs of K-Means / K-Mediansclusters”) across multiple runs of K-Means / K-Medians..

TThe consensus clusters consist of genes that clustered together he consensus clusters consist of genes that clustered together in at least in at least xx% of the K-Means / Medians runs, where % of the K-Means / Medians runs, where xx is the is the threshold percentage input by the user.threshold percentage input by the user.

K-Means / K-Medians Support (KMS)

Page 131: Gene Expression - Microarrays

An exploratory technique used to reduce thedimensionality of the data set to 2D or 3D

For a matrix of m genes x n samples, create a newcovariance matrix of size n x n

Thus transform some large number of variables intoa smaller number of uncorrelated variables calledprincipal components (PCs).

Principal components analysis (PCA)

Page 132: Gene Expression - Microarrays

Principal components analysis (PCA): objectives

• to reduce dimensionality

• to determine the linear combination of variables

• to choose the most useful variables (features)

• to visualize multidimensional data

• to identify groups of objects (e.g. genes/samples)

• to identify outliers

Page 133: Gene Expression - Microarrays

http://www.okstate.edu/artsci/botany/ordinate/PCA.htm

Page 134: Gene Expression - Microarrays

http://www.okstate.edu/artsci/botany/ordinate/PCA.htm

Page 135: Gene Expression - Microarrays

http://www.okstate.edu/artsci/botany/ordinate/PCA.htm

Page 136: Gene Expression - Microarrays

http://www.okstate.edu/artsci/botany/ordinate/PCA.htm

Page 137: Gene Expression - Microarrays
Page 138: Gene Expression - Microarrays
Page 139: Gene Expression - Microarrays

1

12

Page 140: Gene Expression - Microarrays

[email protected]

High-throughput methods beyond microarrays

Page 141: Gene Expression - Microarrays

[email protected]

RNA-seq

• Sequencing technology is making fast progress• Idea: sequencing is so cheap that we can sequence mRNA molecules directly

“Digital Gene Expression”

Page 142: Gene Expression - Microarrays

[email protected]

RNA-seq (a) After two rounds of poly(A) selection, RNA is fragmented to an average length of 200 nt by magnesium-catalyzed hydrolysis and then converted into cDNA by random priming. The cDNA is then converted into a molecular library for Illumina/Solexa 1G sequencing, and the resulting 25-bp reads are mapped onto the genome. Normalized transcript prevalence is calculated with an algorithm from the ERANGE package.

(b) Primary data from mouse muscle RNAs that map uniquely in the genome to a 1-kb region of the Myf6 locus, including reads that span introns. The RNA-Seq graph above the gene model summarizes the quantity of reads, so that each point represents the number of reads covering each nucleotide, per million mapped reads (normalized scale of 0–5.5 reads).

(c) Detection and quantification of differential expression. Mouse poly(A)-selected RNAs from brain, liver and skeletal muscle for a 20-kb region of chromosome 10 containing Myf6 and its paralog Myf5, which are muscle specific. In muscle, Myf6 is highly expressed in mature muscle, whereas Myf5 is expressed at very low levels from a small number of cells. The specificity of RNA-Seq is high: Myf6 expression is known to be highly muscle specific, and only 4 reads out of 71 million total liver and brain mapped reads were assigned to the Myf6 gene model.

Page 143: Gene Expression - Microarrays

[email protected]

RNA-seq

Page 144: Gene Expression - Microarrays

Acknowledgements

• This presentation uses slides/graphics from: J. Pevsner (Johns Hopkins, http://www.bioinfbook.org)

J. Quackenbush (DFCI, Harvard)

C. Dewey (Wisconsin, http://www.biostat.wisc.edu/bmi576)

[email protected]