88
1 Gene Expression and DNA Chips Based on slides by Ron Shamir http://www.bio.davidson.edu/courses/genomics/chip/chip.html

1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Embed Size (px)

Citation preview

Page 1: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

1

Gene Expression and DNA Chips

Based on slides by Ron Shamir

http://www.bio.davidson.edu/courses/genomics/chip/chip.html

Page 2: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

2

Monitoring Gene Expression

• Goal: Simultaneous measurement of expression levels of all genes in one experiment.

• 2 fundamental biological assumptions:– Transcription level indicates genes’ regulation.– Only genes that contribute to organism fitness

are expressed.

=> Detecting changes in a gene’s expression level provides clues on the function of its product

Page 3: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

3

DNA Pre-mRNA

protein

transcription translation

Mature

mRNA

splicing

Factors controlling expression

Post-translational modifications

Chromatin remodeling

Alternative splicing

RNA interference / degradation

Page 4: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

4

Hybridization• DNA double strands form

by “gluing” of complementary single strands

• Complementarity rule: A-T, G-C

ACTCCG

TGAGGC| | | | | |

Use probe to identify if target contains a particular sequence

Page 5: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

5

DNA chips / Microarrays• Perform thousands of

hybridizations in a single experiment

• Variants:– Oligonucleotide arrays– cDNA microarrays

• Another distinction– Single channel– Dual channel

• Allow global view of cellular processes: Monitor transcription levels of numerous/all genes simultaneously.

Page 6: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

6

Oligonucleotide Arrays

Page 7: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

7

A single feature on the chiph

ttp

://w

ww

.aff

ymet

rix.

com

/cor

por

ate/

med

ia/i

mag

e_li

bra

ry/

Page 8: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

8

Hybridization on chip

Page 9: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

9

Reading off a chip

Page 12: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

12

Manufacturing a GeneChip Array

Page 13: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

13

Many probes for a single gene

Page 14: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

14

cDNA Microarrays

Page 15: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

15For Flash animation of the technology, see http://www.bio.davidson.edu/Courses/genomics/chip/chip.html

Page 16: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

16

cDNA Microarrays (2)

Page 17: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

17

Affymetrix oligo arrays vs cDNA microarrays

•Short oligos•Low specificity•High density•Many probes per gene•Synthetic oligos•Absolute exp values•Yield problems•“turnkey” solutions•Price: +++

•Long oligos•High specificity•Lower density•One probe per gene•Probes: cDNAs•Relative exp values•Spotting problems•Custom solutions•Price : ++

Page 18: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

18

…and other technologies

• Agilent:– In situ synthesized arrays using ink-jet

technology– 60-mer arrays: more specific than Affy’s– Allows custom design without expensive

masks– Differential measurements: target vs

reference

• Nimblegen• Illumina

Page 19: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Comparative genomic hybridization (CGH) microarrays

Known DNA sequences

Glass slide

Isolate genomic DNA

Cells of Interest

Reference sample

Flourescently labeled

(almost identical to gene expression arrays, but genomic DNA is hybridized instead of mRNA)

Page 20: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Robert Lucito et al. Genome Res. 2003; 13: 2291-2305

Chromosomes with varying copy number fluctuations from analysis of the tumor cell line SK-BR-3 as compared with the normal reference

Page 21: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Single nucleotide polymorphism (SNP) detection

GCCATGCANGAGTTACTACAGTAGC

CGGTACGTTCTCAATGATGTCATCG

A/G

CGGTACGTTCTCTATGATGTCATCG

PM + 4 Allele A

MM +4 Allele A

CGGTACGTCCTCAATGATGTCATCG

CGGTACGTCCTCTATGATGTCATCG

PM +4 Allele B

MM + 4 Allele B

(Affymetrix Human Mapping 500K Array)

Target sequence:

SNP: single base sequence variation

Page 22: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Remember Gene Transcription?

Transcription factors recognize transcription factor binding sites and bind to them, forming a complex.

RNA polymerase binds the complex.

3’5’

5’3’

G A T T A C A. . .

C T A A T G T. . .

Transcription Factors

(proteins)

RNA polymerase(protein)

(eukaryotes)

Page 23: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Using microarrays to measure protein-DNA interactions

Simon et al., Cell 2001

ChIP-chip: Chromatin immunoprecipitation chip

(microarray)

(antibodies bind transcription factor of interest )

(TF-bound sequences hybridized to microarray)

Page 24: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Mapping transcription factor binding sites in yeast with ChIP-chip

Harbison C., Gordon B., et al. Nature 2004

Page 25: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Dynamic role of transcription factors

Harbison C., Gordon B., et al. Nature 2004

Page 26: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Other microarray applications:Competitive growth assays

yfg1 yfg2 yfg3

CTAACTC TCGCGCA TCATAATBarcode

DeletionStrain:

Growth 6hrsin minimal mediaRich media

Harvest and label genomic DNA

Page 27: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Measuring relative fitness with a barcode microarray

Oligo barcodes matching each strain are also spotted on a DNA microarray

Page 28: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Protein MicroarraysProtein Microarrays• Protein microarrays are lagging behind DNA

microarrays

• Same idea but immobilized elements are proteins instead of nucleic acids

• Number of elements (proteins) on current protein microarrays are limited (approx. 500)

• Antibodies for high density microarrays have limitations (cross-reactivities)

• Aptamers or engineered antibodies/proteins may be viable alternatives

(Aptamers:RNAs that bind proteins with high specificity and affinity)

Page 29: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

ApplicationsApplicationsScreening for:Screening for:• Small molecule

targets• Post-translational

modifications• Protein-protein

interactions• Protein-DNA

interactions• Enzyme assays• Epitope mapping

Page 30: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

High-throughput proteomic analysisHigh-throughput proteomic analysis

Haab et al. Haab et al. Genome BiologyGenome Biology 2000;1:1-22 2000;1:1-22

Label all Proteins in Mixture

Page 31: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

marker proteinmarker protein

cytokinecytokine

VEGFIL-10IL-6IL-1 MIX

BIOTINYLATED MAb

CAPTURE MAb

ANTIGEN

Detection system

Cytokine Specific Microarray Cytokine Specific Microarray (Microarray version of ELISA)(Microarray version of ELISA)

Page 32: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Tissue MicroarraysTissue Microarrays

• Printing on a slide tiny amounts of tissue

• Array many patients in one slide (e.g. 500)

• Process all at once (e.g. immunohistochemistry)

• Works with archival tissue (paraffin blocks)

Page 33: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Tissue MicroarrayTissue Microarray

Alizadeh et al. J Pathol 2001;195:41-52

Page 34: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

34

How Gene Expression Data Looks

Expression levels,

“Raw Data”

conditions

genes

Entries of the Raw Data matrix:• Ratio values• Absolute values• Distributions…

• Row = gene’s expression pattern /fingerprint vector

• Column = experiment/condition’s profile

Normalization is important!!

Page 35: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

35

Data PreprocessingExpression

levels,

“Raw Data”

conditions

genes

•Input: Real-valued raw data matrix.•Compute the similarity matrix (dot product/correlation/…)

10 20 30 40 50 60

10

20

30

40

50

60

From the Raw Data matrix we compute the similarity matrix S.Sij reflects the similarity of the expression patterns of gene i and gene j.

Page 36: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

36

DNA chips: Applications

• Deducing functions of unknown genes (similar expression pattern similar function)• Identifying disease profiles• Deciphering regulatory mechanisms (co-expression co-regulation).• Classification of biological conditions • Genotyping•Drug development•…

Analysis requires clustering of genes/conditions.

Page 37: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Pearson Correlation Coefficient, r. Values are in [-

1,1] interval• Gene expression over d experiments is a

vector in Rd, e.g. for gene C: (0, 3, 3.58, 4, 3.58, 3)

• Given two vectors X and Y that contain N elements, we calculate r as follows:

Cho & Won, 2003

Page 38: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Intuition for Pearson Correlation Coefficient

r(v1,v2) close to 1: v1, v2 highly correlated.r(v1,v2) close to -1: v1, v2 anti correlated.r(v1,v2) close to 0: v1, v2 not correlated.

Page 39: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Pearson Correlation and p-Values

When entries in v1,v2 are distributed according to normal distribution, can assign(and efficiently compute) p-Values for a given result.

These p-Values are determined by the Pearson correlation coefficient, r, and thedimension, d, of the vectors.For same r, vectors of higher dimension willbe assigned more significant (smaller) p-Value.

Page 40: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

• Replace each entry xi by its rank in vector x.

• Then compute Pearson correlation coefficients of rank vectors.

• Example: X = Gene C = (0, 3.00, 3.41, 4, 3.58, 3.01) Y = Gene D = (0, 1.51, 2.00, 2.32, 1.58, 1)

• Ranks(X)= (1,2,4,6,5,3)• Ranks(Y)= (1,3,5,6,4,2)• Ties should be taken care of, but: (1) rare (2) can randomize (small effect)

Spearman Rank Order Coefficient

(a close relative of Pearson, non parametric)

Page 41: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

From Pearson Correlation Coefficients to a Gene Network

• Compute correlation coefficient for allpairs of genes (what about missing

data?)

• Choose p-Value threshold.

• Put an edge between gene i and gene j iff

p-Value exceeds threshold.

Page 42: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

42

Clustering: Objective

Group elements (genes) to clusters satisfying:

• Homogeneity: Elements inside a cluster are highly similar to each other.

• Separation: Elements from different clusters have low similarity to each other.

Page 43: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

43

The Clustering Bazaar

Page 44: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

44

Hierarchical clustering

Page 45: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

45

An Alternative ViewForm a tree-hierarchy of the input elements satisfying:

• More similar elements are placed closer along the tree.

• Or: Tree distances reflect element similarity

•Note: No explicit partition into clusters.

Page 46: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

46

Hierarchical Representations

Page 47: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

47

Hierarchical Representations (2)

1 3 421 3 42

2.8

4.55.0

Dendrogram: rooted tree, usually binary, and all root-leaf distances are equal

Page 48: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

48

Neighbor Joining Algorithm Saitou & Nei, 87

• Input: Distance matrix Dij; Initially each

element is a cluster.

• Find min element Drs in D; merge clusters r,s

• Delete elts. r,s, add new elt. t with Dit=Dti=(Dir+ Dis – Drs)/2

• Repeat• Present the hierarchy as a tree with similar

elements near each other

Page 49: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

49

Hierarchical Clustering: Average LinkageSokal & Michener 58, Lance & Williams 67

• Input: Distance matrix Dij; Initially each element is a cluster. nr- size of cluster r

• Find min element Drs in D; merge clusters r,s

• Delete elts. r,s, add new elt. t with Dit=Dti=nr/(nr+ns)•Dir+ ns/(nr+ns) • Dis

• Repeat

Page 50: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

50

A General FrameworkLance & Williams 67

• Input: Distance matrix Dij; Initially each

element is a cluster.

• Find min element Drs in D, merge clusters r,s

• Delete elts. r,s, add new elt. t with Dit=Dti=rDir+ sDis + |Dir-Dis|

Page 51: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

51

Hierarchical clustering of GE data

Eisen et al., PNAS 1998• Growth response: Starved human fibroblast cells, added serum

• Monitored levels of 8600 genes over 13 time-points using cDNA microarrays

• tij - fluorescence levels of target gene i in condition j; rij – same for reference

• Dij= log(tij/rij)

• D*ij= [Dij –E(Di)]/std(Di)

• Similarity of genes k,l: Skl=(jD*kj •D*lj)/Ncond

• Applied average linkage method• Ordered leaves by increasing subtree weight: average

expression level, time of maximal induction, other criteria

Page 54: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

54

Comments

• Distinct measurements of same genes cluster together

• Genes of similar function cluster together

• Many cluster-function specific insights• Interpretation is a REAL biological

challenge

Page 55: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

55

More on hierarchical methods

• All methods described above – agglomerative

• An alternative approach: Divisive• Advantages:

– gives a single coherent global picture– Intuitive for biologists (from phylogeny)

• Disadvantages:– no single partition; no specific clusters– Forces all elements to fit a tree

hierarchy

Page 56: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

56

Non-Hierarchical Clustering

Page 57: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

57

Clustering: ObjectiveGroup elements (genes) to clusters satisfying:

• Homogeneity: Elements inside a cluster are highly similar to each other.

• Separation: Elements from different clusters have low similarity to each other.

•Needs formal objective functions•Most useful versions are NP-hard.

Page 58: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

58

K-means clusteringMacQueen, 65

• Initialize an arbitrary partition P into k clusters C1 ,…, Ck.

• For cluster Cj, element i Cj, EP(i, Cj) = cost of soln. if i is moved to cluster Cj. Pick EP(r, Cs) that is minimum; move r to cluster Cs if the new partition is better than P

• Repeat until no improvement possible• Requires knowledge of k

Page 59: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

59

K-means variations• Input: vector vi for each element i• Compute a centroid cp for each cluster Cp, e.g., gravity

center = average vector• Solution cost: clusters pi in cluster pd(vi,cp)• EP(i,j)= change in soln. cost if i is moved to cluster Cj. • Parallel version: move each elt. to the cluster with the

closest centroid simultaneously• Sequential version: one elt. each time• “moving centers” approach• Objective = homogeneity only (k fixed)• Variations for changing k

Page 62: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Gene Expression Omnibus (GEO)

• NCBI’s repository for gene expression data• The EU database is ArrayExpress• Both databases exchange data (GenBank’s

model)• Basic entities

– Series – a deposited experiment that wasn’t processed yet, but the data is available

– Dataset – processed and manually curated– Platform – a microarray platform (e.g., Affymetrix HG-

U133A chips)– Profiles – the expression of a gene in an experiment

62

Page 69: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Novartis GNF

69

Page 70: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Promoter analysis Position Weight Matrix

(PWM)a.k.a Position Specific Scoring Matrix

(PSSM)

Example: A0.10.800.70.20

C00.10.50.10.40.6

G000.50.10.40.1

T0.90.100.100.3

ATGCAGGATACACCGATCGGTA 0.0605

GGAGTAGAGCAAGTCCCGTGA 0.0605

AAGACTCTACAATTATGGCGT 0.0151

Need to set score threshold

Page 71: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Computational approaches to promoter analysis

• Look for overrepresented BSs in groups of promoters– Obtained by clustering expression profiles– Of genes with a common known function (e.g.

from GO annotations)– From chip2 data – requires knowledge of the TF,

and an antibody.

- Use a combination of sources

• De-novo or using known TF signatures

Page 72: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Location analysis

Ren et al., Science 290:2306-2309 )2000( .

.

Page 73: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

ATM-dependent Transcriptional Response to

Ionizing Radiation• DNA damage response modulates many

signaling pathways, including lesion processing, repair, cell cycle checkpoints and apoptotic pathway.

• ATM protein kinase is a master regulator of cellular response to double strand breaks.

Goal: identify the transcriptional network.

Page 74: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Experimental Design

Gene expression profiles:wild-type and Atm-/- mice ± ionizing radiation.

Thymus tissue, time points: 0, 30 min, 120 min

S. Rashi, R. Elkon, N. Weizman, C. Linhart, N. Amariglio, N. Orlev, G. Sternberg, A. Barzilai, Y. Shiloh

Filtering ‘responding genes ’1206 genes whose expression level is changed by >1.75 fold

Clustering6 main clusters generated by the CLICK algorithm

Promoter AnalysisNF-B and p53 found by PRIMA analysis

Page 75: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Atm-dependent responding genes:The genes respond to radiation only in wild type

Major Gene Clusters – Irradiated Thymus

Page 76: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Atm-dependent 2nd wave of responding genes

Major Gene Clusters – Irradiated Thymus

Page 77: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Major Gene Clusters – Irradiated Thymus

Similar response in both genotypes

Page 78: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

? ? ? ?

Hidden layer

?

ATM

g3g13 g12 g10 g9 g1g8 g7 g6 g5 g4g11 g2

Observed layer

Clues are in the

promoters

Transcription Factors

p53TF-C TF-B TF-ATF-D

Page 79: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

PRIMA: PRomoter Integration in Microarray Analysis

• Assumption: Co-expression → Transcriptional co-regulation → common cis-regulatory promoter elements

• Step 1: Identification of co-expressed genes using microarray technology and clustering algorithms

• Step 2: Computational identification of transcription factors whose binding site signatures are significantly over-represented among promoters of co-expressed genes

R. Elkon, C. Linhart, Y. Shiloh

Page 80: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

PRIMA - Results

Page 81: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Transcription factor

Enrichment factor

P-value

PRIMA - Results

NF-B 5.1 3.8x10-8

p53 4.2 9.6x10-7

Hypothesis: NF-B and p53 mediate the late response to DNA damage.

Page 82: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Molecular Classification of Cancer: Class Discovery and

Class Prediction by Gene Expression Monitoring

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES.

Science 286 (Oct 1999) 531-537Computational paper: Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander Proc. RECOMB 2000

ppt Source: Elashof-Horvath UCLA course, Statistical Analysis of DNA Microarray Data http://www.genetics.ucla.edu/horvathlab/Biostat278/Biostat278.htm

Page 83: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Background: Cancer Classification

• Cancer classification is central to cancer treatment;

• Traditional cancer classification methods: by sites; by morphology, etc;

• Limitations of morphology classification: tumors of similar histopathological appearance can have significantly different clinical courses and response to therapy;

• Traditionally cancer classification relied on specific biological insights

• Challenges: – finer classification of morphologically similar tumors at

the molecular level; – systematic and unbiased approaches;

Page 84: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Background: Cancer Classification (Continued)

Three challenges:•  Class prediction (classification) :

assignment of particular tumor samples to already-defined classes.

• Feature selection : Identify the most informative genes for prediction

•  Class discovery : defining previously unrecognized tumor subtypes ( = clusters)

Page 85: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Background: Leukemia• Acute leukemia: variability in clinical outcome and

subtle differences in nuclear morphology•  Subtypes: acute lymphoblastic leukemia (ALL) or

acute myeloid leukemia (AML);•  ALL subcategories: T-lineage ALL and B-lineage

ALL;•  Particular subtypes of acute leukemia have been

found to be associated with specific chromosomal translocations;

•  No single test is currently sufficient to establish the diagnosis, but a combination of different tests in morphology, histochemistry and immunophenotyping etc.

•  Although usually accurate, leukemia classification remains imperfect and errors do occur;

Page 86: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Objective• Develop a systematic approach to cancer

classification based on gene expression data from microarray

• Use leukemia as test case

Method: Biological Samples & microarrays

•Learning set: 38 bone marrow samples (27 ALL, 11 AML) obtained from acute leukemia patients at the time of diagnosis;

• test set: 34 leukemia samples (24 bone marrow and 10 peripheral blood samples);

•RNA from cells hybridized to high-density Affymetrix oligo arrays (6817 human genes)

Page 87: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Feature selection50 genes mostly highly correlated with AML-ALL :

Page 88: 1 Gene Expression and DNA Chips Based on slides by Ron Shamir

Class predictor

•The prediction of new samples assigned 36 of 38 samples as either AML or ALL and the remaining 2 are uncertain .

•All predictions agree with patients’ clinical diagnosis.