58
Introduction to Microarray Data Analysis Sajid khan Chapter : 6 GPGCM Abbottabad

Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Embed Size (px)

Citation preview

Page 1: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Introduction to Microarray Data Analysis

Sajid khan

Chapter : 6

GPGCM Abbottabad

Page 2: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Outline

1. Introduction

2. Microarrays Technology3. Types and Uses of Microarrays4. Microarrays for the Study of Gene Expression5. Fabrication

1. Spotted microarrays2. Oligonucletide microarrays

7. Experiments with Microarrays1. Flow chart of a experiment with microarrays

8. Software for Microarray Data Analysis

Page 3: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Introduction (1)Brief review of molecular biology...

Most life forms are made of cells. Each individual has a very large indefinite number of cells.

Each cell containschromosomes

(e.g. human cellscontain 23 pairs of

chromosomes).These organized

structures of DNAand inherited information.

proteins are the carriers of

A chromosome is a single piece of coiled DNA containing many genes, regulatory elements and other nucleotide sequences.

Page 4: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

The genome of an organism is inscribed in DNA or RNA in some virus

A gene is the basic unit of heredity in a living organism and is the portion of the DNA that codes for a protein or an RNA

Each protein-coding gene is a gene transcribed into RNA in some molecules and in turn mRNA is translated into at least one protein in some cells

Introduction (2)What else?

Page 5: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Introduction (3)The Central Dogma of Molecular Biology

• Information flow from DNA to RNAtoprotein occurs in four stages

1. Replication

The DNA replicates its information in a process that involves many enzymes

2. Transcription

The DNA codes for the production of messenger RNA (mRNA)

3. Splicing

In eucaryotic cells, the mRNA is processedand migrates from the nucleus to the cytoplasm.

4. Translation

Messenger RNA carries coded information to ribosomes. The ribosomes read this information and use it for protein synthesis.

Page 6: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Introduction (4)

• Molecular biology has developed multiple techniques to measure levels of RNA, DNA, proteins or metabolites, such as

Southern Blot

Northern Blot

Differential display

SAGE

• Post-genomics era is perform and toanalyze

characterized by its capability todata sets from large-scale

experiments simultaneously

Techniques in Molecular Biology

Page 7: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Introduction (5)

• With the same resources we obtain a picture with lower resolution but with a view of the whole context

vs

The paradigm shift

Based on “The paradigm shift” slide from J. Dopazo (CNIO)

Page 8: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Introduction (6)

• Biology used to “spy” on genes everything in deep and individually (i.e. gene by gene)

To draw an analogy with pre-genomics era

Page 9: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Introduction (7)

Nowadays, a lot of genes can be “spied” at the same time... but...

To draw an analogy with post-genomics era

… How can we split the wheat from the chaff?

Page 10: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Microarrays Technology (1)

Microarrays are a variety of platformsin which high density assays performed in parallel on a support.

are solid

This technology has changed the waybiologists approach problems

and newchallenges

forhuge each

introduces statisticians quantity of experiment

becauseof the datageneratedin

They have been used for all kinds of biological problems

The literature contains almost 8000 papers using microarray word in the title

Broadly speaking...

1998

2000

2002

2004

2006

2008

2010

0

200

400

600

800

1000

year

num

ber

of

publ

ica

tion

s

Publications in PubMed with microarray word in the title

524

83

259

544

747

988

1091 1080

1000 986

920

171

Page 11: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Microarrays Technology (2)

• It is the same one that allows DNA double helices toprovide the basis for heredity

• Sequences of DNA or RNA molecules containing complementary base pairs have a natural tendency to bind together.

...AAAAAGCTAGTCGATGCTAG...

...TTTTTCGATCAGCTACGATC...

• If we know the mRNA sequence, we can build a probe for it using the complementary sequence.

The biological principle of microarrays involved in...

Page 12: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Microarrays Technology (3)

It consist of a large set (thousands to ten of thousands) of specific sequences (known as probes or features) attached in order (array) to microscopic spots on a solid support (nylon or silicon glass,...).

A probe (that can be a gene, a protein, a metabolite,...) is used to hybridize a molecule of a nucleic acid sample (called target) under high-stringency conditions.

Probe-Targetdetermine therelative

hybridization is usedto

of

abundancenucleic acid sequences in the targets (e.g.to determine sequences,to

detectvariations in gene sequences, levels, gene mapping,...).

expression

But... What is a microarray?

gene 1 gene 2 gene 3 gene 4probe probe probeprobe1 32 4

probe probeprobeprobe 5

6

7

8

probe probeprobeprobe 9

10

11

12probe

k-3probe

probeprobe k-2

k-1

k

...

spots

Microarray

molecule sample 1

molecule sample 2

molecule sample r...

Page 13: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Types and Uses of Microarrays (1)

Microarrays spatially arranged on a solid surface aremost widely used.`

Bead arrays are created by

– either impregnating beads with differentconcentrations of fluorescent dye,

– or some type of barcoding technology.

The beads are addressable and usedto binding events that occur on their surface.

identify specific

Types of microarrays

Page 14: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Types and Uses of Microarrays (2)

Expression analysis

– The process of measuring gene expression via RNA (or cDNA after reverse transcription) is called expression analysis or expression profiling.

In this experiments the expression levels of thousands of genes are simultaneously monitored to study the effects of certain treatments, diseases, and developmental stages on gene expression.

Comparative Genomic Hybridization

– Comparative genomic hybridization (CGH) or Chromosomal Microarray Analysis (CMA) is used for the analysis of copy number changes (increases or decreases) of the important chromosomal fragments harboring genes involved in diseases.

Mutation analysis

– A single base difference between two sequences is known as Single Nucleotide Polymorphism (SNP) and detecting them is known as SNP detection.

With gDNA this kind of arrays try to detect genes that might differ from each other by as less as a single nucleotide base.

Uses of Microarrays (1)

Page 15: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Types and Uses of Microarrays (2)

Uses of Microarrays (2)

cDNA Nylon Membrane Array

SNP Array Affymetrix

CNV Array Illumina

CGH Arrays

Expression Arrays

Protein Array Tissue Array

cDNA Agilent Array GeneChip Affymetrix Array

Page 16: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Types and Uses of Microarrays (3)

Gene discovery

Identification of new genes, know about their functioning and expression levels under different conditions.

Molecular classification of complex diseases

To classify the types of cancer on the basis of the patterns of gene activity in the tumor cells, to develop more effective drugs.

Drug discovery

Comparative analysis of the genes from a diseased and a normal cell help the identification of the biochemical constitution of the proteins synthesized by the diseased genes. This information can be used to synthesize drugs that combat with these proteins and reduce their effect.

Toxicological research

Microarray technology provides a robust platform for the research of the impact of toxins on the cells and their passing on to the progeny.

Application of Microarrays

Page 17: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Microarrays for the Study of Gene Expression (1)

• The gene expression is the presence of the gene products of a gene, in the form of mRNA (or protein), in a cell

• To put it straight: Since cells contain the same genetic information, what makes different brain cells from heart cells is the gene expression.

What is the gene expression?

Page 18: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Microarrays for the Study of Gene Expression (2)

To find genes that display a large difference in gene expression between two conditions and are homogeneous within them

– Typically statistical tests (t-test, Wilcoxon test) are used

If there are more than two conditions, or if conditions are nested, the appropriate statistical method is ANOVA

pvalues from these tests have to be corrected for multiple testing

Finding Differentially Expressed Genes (DEG)

Cell Line 1 Cell Line 2Control Treatment Control Treatment

Page 19: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Microarrays for the Study of Gene Expression (3)

To find groups that are not defined yet (e.g. novel disease

subtypes) Methods

from this field were the first to be used for microarray data

should be used only if no prior knowledge exists that could be incorporated

find patterns in the data, but any patterns, whether they are meaningful or not

include

•Clustering (hierarchical and partitioning)

Projection (PCA, MDS)

Exploratory data analysis (1)

Alizadeh et al. Nature 403:503–511 (2000)

Page 20: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Microarrays for the Study of Gene Expression (4)

Usually used to find patterns of co-expressed genes

The meaning of time series is different for– Biologists: 2-10 time points

– statisticians: >200 time points

“Non-optimal” solution: to use clustering methods to find such patterns

Note that they are by no means exhaustive, and that no significance measure can be attached to them

In contrast to Estimation of Distribuiton Methods (EDA), partitioning cluster methods are more popular (e.g. K- means or Self-organizing maps)

To seek genes whose expression profile is similar to that of a paradigmatic gene, correlations can be calculated, and sort by them. There is no need for clustering.

Special methods exist for periodic changes (⇒ cell cycle), e.g. Fourier analysis

Time series, partitioning clustering and correlation

Page 21: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Microarrays for the Study of Gene Expression (5)

When information about grouping of the samples is available, it can (and should) be used to get improved results

Groupings may be:–

Treatment and Control

Disease and Normal

Disease stage 1, 2, 3

Mutant and Wild Type

Good and Poor Outcome,

Therapy success or failure

...

One learns characteristic patterns from a training set and evaluate by predicting classes of a test set

Classification

Page 22: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Microarrays for the Study of Gene Expression (6)

To find patterns that are associated with prolonged patients’survival time

Instead of treating outcome as a binary variable, can be used–

–The overall survival time or

The event free survival timeas continuous variables, and try to estimate it by regression

Since the risk to suffer from relapse is decreasing with time, linear regression models are almost always inappropriate specialized models would be better

–Cox regression

Regression

trees

Survival Analysis

Page 23: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Microarrays for the Study of Gene Expression (7)

To find molecular predictors that tell about probable success (or failure) of a certain therapy. e.g.

–estrogen receptor status for tamoxifen (antihormone)

therapy HER2/NEU status for herceptin therapy in breast

cancer One may regard treatment outcome as a discrete variable and

use classification methods

Sometimes, it’s convenient not to wait for the finalendpoint (which may be years away), but to use surrogate variables, e.g.–

–the drop of the blood level of a certain protein

reduction in tumor volume

Pharmacogenomics

Page 24: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Fabrication

There are many types of technologies, but principles are the same

The most used are spotted arrays and In situ arrays

Spotted arrays (aka cDNA arrays or Stanford arrays)– Previously synthesized cDNAs or oligonucleotides are deposited on the

chip

Based on “printing-like” technologies–

In situ arrays (aka oligo arrays or Affy arrays)–

Probes are synthesized directly on the chip

Based on photolithographic techniques

Affymetrix arrays are the best-known... but not the only one!

Two main technologies

Page 25: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

1. Chip Design and Production2. Sample Preparation3. Hybridization4. Scanning and Capturing Images5. Image Analysis6. Quantification

From the chips to the images

Spotted Arrays (1)

Page 26: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Spotted Arrays (2)

Production begins with the selection of the "probes" to be printed on the array

In general: chosen from– GenBank (http://www.ncbi.nlm.nih.gov/)

– dbEST (http://www.ncbi.nlm.nih.gov/UniGene/index.html)

cDNA’s are printed on the array

– Each spot can contain unique sequences

– Printing” means adhering sequences to the spots

Chip design and construction

A movie of the printing process is available here

Page 27: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Spotted Arrays (3)

1. RNA is extracted from the samples

2. This RNA is converted to fluorescently labeled cDNA by reverse transcription in presence of fluorescently labeled nucleotide precursors

3. RNA from each samplesarelabelled fluorescent Cy-5)to

with dyes

different (e.g. Cy-3,allow

directcomparison

4. After labeling, they are mixedand hybridized

sequences onthe (probes)

with array

Sample preparation

Page 28: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Spotted Arrays (4)Hybridization with probes

Targets labeled and combined

A movie of the hybridization process is available here

Page 29: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Spotted Arrays (5)

After hybridization each DNA spot is illuminated and fluorescence measures taken for each dye separately

These measurements will be used, after the appropriate quality controls, to determine the relative abundance, of the sequence of each specific gene in the two mRNA or DNA samples

Scanning and capturing image

Page 30: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Spotted Arrays (6)

TIFF images are processed by image analysis programs–

SPOT,

GenePix

...

to acquire intensity values for each spot

These measures will be used, after the appropriate quality controls, to determine the relative abundance, of the sequence of each specific gene in the two mRNA or DNA samples

Image analysis (1)

Page 31: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Steps in Image Processing1. Addressing: Estimate location of spot centers

2. Segmentation: Classify each spot a

● foreground (signal) background (noise)●

3. Information extraction (quantification)

For each spot on the array, and each dye obtain– Signal measurements (R , G )

gg

Background measurements (bgR , bgG )g g

Quality indicators

Image analysis (2)

Spotted Arrays (7)

Page 32: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Spotted Arrays (8)

● Gene measured measures

expression isfrom

intensity as

therelative

(corrected)intensity of one dye vs the (corrected)relative intensity of the other

● Background correctionmay be according quality

needed, or

not, array

to the

Quantification

M = Rg , M

G g Corrected

R g−bgRg

= Gg −bgGg

Page 34: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

1. Main Concepts2. Synthesis of Oligos on the Chip3. Sample Preparation4. Hybridization Process5. Scanning Images6. Output Images7. Quantification and Expression Measures

From the chips to the images

In situ Chips (1)

Page 35: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

In situ Chips (1)

More advanced design than spotted cDNA arrays–

–They are NOT based on competitive hybridization. That is, one chip, one sample

They are NOT added on the chip after being synthesized in vitro

Main idea: Probes are synthesized in situ (on the chip)

Sequences are built up on the chip surface by sequentially elongating a growing chain with a single nucleotide using photolithography

Chemical yield of the stepwise elongation is limited

– Sequences can NOT grow to more than 25mers length (oligo)

– Need 16-20 different 25mer sequences to uniquely characterize a gene

Probe = Individual 25mer sequence

Probe set = Set of 25mers corresponding to a particular gene/EST

Main concepts (1)

Page 36: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

In situ Chips (2)

Affymetrix (http://www.affymetrix.com) is the leader company of these kinds of chips. They call them GeneChips

Each gene is represented by a set of short sequences

Some of these chips contain whole genomes, that is > 50.000 probe sets

A probe set (usually denoted probeset) is used to measure the mRNA levels of a unique gene

Each probeset is made up of multiple probe cells–

–with millons of copies of one oligo de copias de un oligo (25bp)

Organized in probe pairs with•

a Perfect Match (PM): match perfectly with a piece of a gene

a Mismatch (MM): it is the same to PM but with the central nucleotide change by the

complementary

Main concepts (2)

Page 37: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

In situ Chips (3)

More advanced design than spotted cDNA arrays–

–They are NOT based on competitive hybridization. That is, one chip, one sample

They are NOT added on the chip after being synthesized in vitro

Main idea: Probes are synthesized in situ (on the chip)

Sequences are built up on the chip surface by sequentially elongating a growing chain with a single nucleotide using photolithography

Chemical yield of the stepwise elongation is limited

– Sequences can NOT grow to more than 25mers length (oligo)

– Need 16-20 different 25mer sequences to uniquely characterize a gene

Probe = Individual 25mer sequence

Probe set = Set of 25mers corresponding to a particular gene/EST

Main concepts (1)

Page 38: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

In situ Chips (4)GeneChip® expression array design

Page 39: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

In situ Chips (5)One gene, one probe set

Probes are selected to be specific of the represented gene

The must have good properties of hybridization

gene sequence

Page 40: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

In situ Chips (6)Synthesis of oligos on the chip (1)

GeneChip® probe arrays are manufactured through a unique and robust process, a combination of photolithography and combinational chemistry

Image courtesy of Affymetrix

Page 41: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

In situ Chips (7)Synthesis of oligos on the chip (2)

Image from a course of Dan Nettleton

GeneChip

maskmaskmaskmaskmaskmaskmask

mask

G A A

T C

C A G

TTT

T AA C C

Page 42: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

In situ Chips (8)Synthesis of oligos on the chip (3)

Several copies of a single feature are deposited in each cell

Image courtesy of Affymetrix

Page 43: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

In situ Chips (9)Sample preparation

Page 44: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

In situ Chips (8)Hybridization process

Once the oligos have been synthesized hybridization is performed by adding mRNA from the tissue to analyze on the chip

Image courtesy of Affymetrix

Page 45: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

In situ Chips (9)Scanning Images

Scanning of tagged and un-tagged probes on an Affymetrix GeneChip®

microarray

Image courtesy of Affymetrix

Page 46: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

In situ Chips (10)Output Image

Data from an experiment showing the expression of thousands of genes on a single GeneChip® probe array

Image courtesy of Affymetrix

Page 47: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Intensities from each element are extracted

Quantitative analysis of the hybridization results is performed by analyzing the hybridization pattern of the set of PM and MM probes of every gene

In contrast with spotted chips expression measures used here are absolute ones. That is, each chip is hybridized with only one tissue at a time

Quantification

In situ Chips (11)

Page 48: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Measures to determine the quantitative RNA abundance, i.e. the expression level based on the average of the differences PM minus MM for each probe family

Many alternatives have been introduced

Absolute expression measures

In situ Chips (12)

Avg.Diff = 1 ¿ j ∈ A PM −MM ∣A

Page 49: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

PRO's

••

CheaperFlexibility with the experimental designHigh signal intensity (large sequences)

CON's

• Low reproducibility• Cross-hybridization (low specificity)• High manupulation (ssibility of

contamination)

PRO's and CON's

Spotted vs In situ Arrays

PRO's

••••

Quick manufacture (automated) High reproducibilityHigh specificityA lot of probes/genes

CON's

• Requires more specialized equipmentExpensives Low flexibility

••

cDNA microarrays Oligo microarrays

Page 50: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

In situ Chips (13)Overview of the process

A movie of the whole process is available here

Image courtesy of Affymetrix

Page 51: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Experiments with MicroarraysFlowchart of a experiment with microrrays

Page 52: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Software (1)

Microarray experiments generate huge quantitiesof data which have to be– Stored, managed, visualized, processed...

Many options available.

However... No tool satisfies all

user’s needs Trade-off. A tool

must be

– Powerful but user friendly

– Complete but without too many option

– Flexible but easy to start with and go further

– Available, to date, well documented but affordable

Which software for the analysis?

Page 53: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Software (2)

Partek–

Commercial market

leader Many useful

facilities Some infelicities Bioconductor

– Open Source R software package

– Vigorous development, new things show up here first

GEPAS

– Freely available

– Based on Bioconductor and R

Some options are...

Page 54: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Software (3)

R is an open-source system for statistical computationand graphics.

It consists of–A language

– A run-time environment with

– Graphics

– A debugger

–Access to certain system functions

It can be used

– either, interactively, through a command language

– or running programs stored in script files

So, what you need is R?

Page 55: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Software (4)

R is a popular tool between statisticians

Once they started to work with microarrays they

continued using it– To perform the analysis

– To implement new tools

This gave rise very fast to lots of free R-based software to analyze microarrays

The Bioconductor project groups many of these (but not all) developments

R and Microarrays

Page 56: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

Software (5)

http://bioconductor.org

Open source and open development software project for the analysis and comprehension of genomic data

Most early developments as R packages

Extensive documentation and training material fromshort courses

Has reached very good stability...

But, what is now a standard may not be so in a future

The Bioconductor project

Page 58: Introduction to Microarray Data Analysis Chapter : 6 GPGCM Abbottabad

ReferencesEnjoy them!!!

Smyth, G.K.; Yang, Y.H.; Speed, T. (2003) Statistical Issues in cDNA Microarray Data Analysis. In: MJ Brownstein, AB Khodurski (eds.), Methods in Molecular Biology, Humana Press 2002.

http://www.stat.berkeley.edu/ t̃�erry/zarray/TechReport/mareview.pdf

Huber, W.; von Heydebreck, A.; Vingron, M. (2003) Analysis of microarray gene expression data. In: Handbook of Statistical Genetics, 2nd ed., Wiley 2003.

http://www.ebi.ac.uk/huber/docs/hvhv.pdf