70
GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th May, 2010 Karsten Hokamp Genetics Department GeneExpression II 1 BI2010

GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th May, 2010

  • Upload
    jensen

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th May, 2010 Karsten Hokamp Genetics Department. TFBS prediction - Overview. Introduction Methods Implementations Analyse 2kb upstream of eve. TFBS prediction - Introduction. TFBS = DNA motifs - PowerPoint PPT Presentation

Citation preview

Page 1: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

GeneExpression II:1. Transcription Factor Binding Sites

2. Microarrays

26th May, 2010

Karsten HokampGenetics Department

GeneExpression II 1BI2010

Page 2: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

TFBS prediction - Overview

• Introduction

• Methods

• Implementations

• Analyse 2kb upstream of eve

GeneExpression II 2BI2010

Page 3: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

TFBS prediction - Introduction

• TFBS = DNA motifs = 5 – 20 bp long

= variable = multiple occurrences/sites per gene = combination of activators and repressors

• cis-regulatory regions = clusters of TFBS -20kb – first intron

GeneExpression II 3BI2010

Page 4: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

TFBS prediction - Introduction

GeneExpression II 4BI2010

Example: MSE2 strip for eve (D. melanogaster):

(Janssens et al., 2006)

understand transcriptional regulation infer regulatory networks

Page 5: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

TFBS prediction - Methods

• De novo motif prediction (overrepresentation)

• Searching for known motifs

• Phylogenetic Footprinting/Shadowing

• Clustering of TFBSs

• Integration of external data sources

(co-expression, structure)

GeneExpression II 5BI2010

Page 6: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

GeneExpression II 6BI2010

TFBS prediction - Overview

Hannenhalli (2008, Bioinformatics)Hannenhalli (2008, Bioinformatics)

Page 7: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

De novo motif prediction

• Search for over-represented motifs

• Frequency count

• Works well for yeast and prokaryotes

• Not so successful in higher organisms

GeneExpression II 7BI2010

Page 8: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Using motif databases

• Search for known motifs• Position specific scoring matrix (PSSM) or

Position weight matrix (PWM)• Databases:

– Transfac– Jasper

GeneExpression II 8BI2010

Page 9: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Phylogenetic-based methods

• Search for islands of highly conserved regions• Footprinting: elements conserved across

distant species• Shadowing: elements conserved between

closely related species• Pros: increases specificity• Cons: conservation is not sufficient nor

necessary

GeneExpression II 9BI2010

Page 10: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Practical:

• Try some tools on 2kp upstream sequence of D. melanogaster eve and compare with published results.– Alibaba (de novo)– Match (Tranfac)– Meme (de novo)– Promo (Tranfac)– WeederH (phylogenetic footprinting)

GeneExpression II 10BI2010

Page 11: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Other tools:

• Many more tools available for download:– Sombrero– FootPrinter– PhyloGibbs

• Other Web-tools for groups of co-regulated genes:– RSAT– NestedMICA– WebMOTIFS

GeneExpression II 11BI2010

Page 12: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

TFBS prediction - Conclusion:

• No single tool gives accurate results

• Combination of predictions from multiple tools might increase specificity

• Incorporate additional information for greater precision

GeneExpression II 12BI2010

Page 13: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Microarrays - Overview

• Introduction• Data Generation• Data Characteristics• Diagnostic Plots• Preprocessing• Statistical Analysis

GeneExpression II 13BI2010

Page 14: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

GeneExpression II 14

What is a microarray?

• A solid support onto which the sequences from thousands of different genes are immobilized

• Different probe types- short oligonucleotides- long oligonucleotides- cDNA

• Different array supports- glass slide- nylon membrane- silicon chip

• Each probe measures the expression of a single transcript

BI2010

Page 15: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

GeneExpression II 15

Microarrays – How do they work?

+

uninfected cells infected cells

Affymetrix Arrays : single colour

RNA

Reverse transcriptionLabel with dye

cDNA

Hybridize

Slide A Slide B

BI2010

Page 16: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

GeneExpression II 16

Microarrays – How do they work?

Prepare Sample

+

uninfected cells infected cells

Spotted Arrays : two colour

Prepare Microarray

Hybridize target to microarray

BI2010

Page 17: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

GeneExpression II 17

Microarray: Subgrids

• One pin per subgrid (printTip group, stratus)

BI2010

Page 18: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Microarrays – Data Extraction

• How to get data from the slides into the computer?

GeneExpression II 18BI2010

Page 19: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Data Extraction – Scanning

GeneExpression II 19

ScannerSlide

PRMS02-001-S100

CF010settings: - laser power - sensitivity - focus

Images (TIFF)

channel 1 (green) channel 2 (red) composite (green, yellow, red)

BI2010

Page 20: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Data Extraction – Quantification

GeneExpression II 20

align grid,align grid,tag unreliable spotstag unreliable spots

program assigns program assigns numbers numbers

representing representing intensity of spotintensity of spot

Software:

-ImaGene

-GenePix

-ScanAlyze

...

Spot ID FG CH1

BG CH1

FG CH2

BG CH2

FL

GFP 1241 671 6707 713 1

PA0080 570 495 599 384 0

PA0080 691 632 667 651 0

PA0122 703 610 653 619 0

PA0122 708 598 695 602 0

.. … … … … …

Data File

foreground (FG)background (BG)

BI2010

Page 21: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Quantification: Intensity Range

GeneExpression II 21

- area composed of pixel- value range: 0 – 216 - 1- value range: 0 – 65535- saturation possible- low intensities = noise

BI2010

Page 22: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Data Generation – Summary

• RNA labelling and hybridization• Array Scanning• One image per channel• Load into quantification software• Flag flawed spots• Extract values• Text file with FG and BG intensities (per probe)

GeneExpression II 22BI2010

Page 23: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

GeneExpression II 23

Cy3

Cy5

Cy5-cDNA

Cy3-cDNA

RT

RT

cDNAarray

Cy5 intensity

Cy3 intensity

Sample2 mRNA

Sample1 mRNA

wavelength dependent

intensity dependent

uneven hybridization gel

print-tip variations

background variations

image processing algorithm-dependent

systematic experimental error

.tiff Image Files

Raw Data File

Microarrays – Sources of Variation

source: www.tigr.org

BI2010

Page 24: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Microarrays – Sources of Variation

• Technical:– labelling– hybridization– slide quality– scanning– print-tip effect– quantification– experimenter

GeneExpression II 24

• Biological:– individual/strain/sample– environment– time point

BI2010

Page 25: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Microarrays – Data Characteristics

• Intensities vs. ratios• Natural scale vs. log scale

GeneExpression II 25BI2010

Page 26: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Intensities vs. Ratios

• Intensities:

GeneExpression II 26

ch1 ch2

gene1 517 2100

gene2 3200 13000

gene3 3200 800

gene4 12000 3000

ratio = ch2 / ch1

BI2010

Page 27: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Intensities vs. Ratios

• Ratios:

GeneExpression II 27

ch1 ch2 ratio

gene1 517 2100 4.06

gene2 3200 13000 4.06

gene3 3200 800 0.25

gene4 12000 3000 0.25

ratio = ch2 / ch1

> 0

ratio = 1 if ch1 = ch2

BI2010

Page 28: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Intensities vs. Ratios

• Ratios– convey expression changes– hide base level differences

• But: absolute changes can be important, too!

GeneExpression II 28BI2010

Page 29: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Graphical Representation: Signal Scatter Plot

GeneExpression II 29

X CH1: Cy3

Y

CH

2: C

y5

3000

18000

3000 18000

ch1 ch2

spot1 517 2100

spot2 3200 13000

spot3 3200 800

spot4 12000 3000

ratio = 1

BI2010

Page 30: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Graphical Representation: Signal Scatter Plot

GeneExpression II 30

CH1: Cy3

CH

2: C

y5

ratio = 1

~ 10x

BI2010

Page 31: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Graphical Representation: Histogram

GeneExpression II 31

ratios1

Ratios

Fre

qu

ency

BI2010

Page 32: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Raw vs. Log ratios

• Log transformation

GeneExpression II 32

raw log

0.1 -3.3

0.5 -1

1 0

2 1

10 3.3

x = basey

8 = 23

0.125 = 2-3

y undefined for x <= 0

x = 2y

ratios

BI2010

Page 33: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Log ratios: scatter plot

GeneExpression II 33

CH1: Cy3

CH

2: C

y5

ratio = 1

CH1: log2(Cy3)

CH

2: l

og

2(C

y5)

log-ratio = 0

BI2010

Page 34: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Log ratios: histogram

GeneExpression II 34

ratios1

Ratios

Fre

qu

ency

Log-ratios

BI2010

Page 35: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Microarrays – Data Characteristics

• ratios vs. intensities– convey expression changes– hide base level differences

• log ratios vs. raw ratios– reduce spread– provide symmetry

GeneExpression II 35BI2010

Page 36: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Diagnostic plots

• histogram• scatter plot• box plot• MA plot• chip visualization

GeneExpression II 36BI2010

Page 37: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Diagnostic plots – Histogram

GeneExpression II 37

good bad

log(CH1) log(CH2)

freq

uenc

y

BI2010

Page 38: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Diagnostic plots – Scatter plot

GeneExpression II 38

o.k. bad

BI2010

Page 39: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Diagnostic plots – MA plot

• Rotate scatter plot by ~ 45 degree:

GeneExpression II 39BI2010

Page 40: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Diagnostic plots – MA plot

• Rotate scatter plot by ~ 45 degree:

GeneExpression II 40BI2010

Page 41: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Diagnostic plots – MA plot

• Mathematically:

GeneExpression II 41

= log2(R) – log2(G)

= 0.5 * ( log2(R) + log2(G) )

Minus

Addition

BI2010

Page 42: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Diagnostic plots – MA plot

GeneExpression II 42

A

M

BI2010

Page 43: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

2-fold cut-off

GeneExpression II 43BI2010

Page 44: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

2-fold cut-off

GeneExpression II 44BI2010

Page 45: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

2-fold cut-off

GeneExpression II 45BI2010

Page 46: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

GeneExpression II 46

Cy3

Cy5

Cy5-cDNA

Cy3-cDNA

Unequal labeling efficiency

M =

lo

g(R

/G)

A = ½ log(RG)

Dye Swap

Strong bias towards Cy3!

Cy5

Cy3

BI2010

Page 47: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

GeneExpression II 47

Dye Swap

+

uninfected cells infected cells

cDNA

+

uninfected cells infected cells

cDNA

Merged Data set

Cy5 vs Cy3 Cy3 vs Cy5

BI2010

Page 48: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

GeneExpression II 48

A = ½ log(RG)

Cy3

Cy5

Cy5-cDNA

Cy3-cDNA

Unequal labeling efficiency

Dye SwapM

= l

og

(R/G

)

A = ½ log(RG)

BI2010

Page 49: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Diagnostic plots – Box plot

GeneExpression II 49

[median

lower quartile

upper quartile

Inter-quartile range

whiskers

1.5 times inter-quartile range

[

outliers

BI2010

Page 50: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Diagnostic plots – Box plot

GeneExpression II 50

o.k. bad

BI2010

Page 51: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Diagnostic plots – Box plot (printtip)

GeneExpression II 51BI2010

Page 52: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Diagnostic plots – Chip visualization

GeneExpression II 52

good:

bad:

BI2010

Page 53: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Diagnostic plots: Summary

• histogram– data distribution (intensities, ratios)

• scatter plot– dye effect, print-tip effect

• box plot– equal average ratio and distribution, print-tip effect

• MA plot– dye effect and intensity-dependant ratio

• chip visualization– spatial bias, scratches, bubbles, smears

GeneExpression II 53BI2010

Page 54: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Microarrays – Preprocessing

• Flagging• Background correction• Normalization• Flawed slides: Discard and repeat

GeneExpression II 54BI2010

Page 55: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Microarrays – Flagging

• Skip or keep (but warn)• e.g. skip low intensities and saturated spots

GeneExpression II 55BI2010

Page 56: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Microarrays – Background correction

• Subtract background measurements from foreground intensities

• Brings intensities lower to zero, increases ratios:

example spot with five fold upregulation: 500 / 100 = 5

subtract background (50) from both channels 450 / 50 = 9• Additional source of variance!

GeneExpression II 56BI2010

Page 57: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Microarrays – Normalization

• Remove effect from intensities, dye bias, spatial bias or print-tip variations:– Global mean, median– Loess, lowess– Print-tip loess– 2D loess– Variance stabilazation (VSN)

GeneExpression II 57BI2010

Page 58: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Microarrays – Normalization

GeneExpression II 58

rawGlobal meanLOESS

A

M

printTip LOESS

BI2010

Page 59: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Microarrays – Normalization

GeneExpression II 59

rawglobal meanLOESSprintTip LOESS

BI2010

Page 60: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Microarrays – Discard and repeat

• Some slides turn out to be uncorrectable and need to be repeated (unless a sufficient number of replicates remains).

• Remember: bad data in = bad data out!

GeneExpression II 60BI2010

Page 61: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Microarrays – Statistical Analysis

• Replicates• Variation• t-tests• multiple-testing correction• gene lists

GeneExpression II 61BI2010

Page 62: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Statistical Analysis – Replicates

• Two types of repeats• Technical:

– multiple copies of probes on array– multiple repeats of hybridiztion (same RNA)

• Biological:– multiple hybridizations with RNA from multiple

extractions

GeneExpression II 62BI2010

Need replicates to measure variation!

Page 63: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Statistical Analysis – Variation

• Biological variation different from technical• Statistically incorrect to mix• Important consideration for repeats:

High confidence in results fora) one sample/patient/colonyb) group of samples/patients/colonies

GeneExpression II 63BI2010

Prioritise biological repeats!

Page 64: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Statistical Analysis – t-tests

Different classes of samples:- find genes that are affected by a

treatment- p-value = degree of evidence- H0: expression does not change

- t-test requires at least 2 replicates provides p-value for each gene

GeneExpression II 64BI2010

Page 65: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Statistical Analysis – multiple-testing correction

Carrying out t-tests on 10,000 genes average of 500 will have p-value <= 0.05

Methods for multiple testing:Bonferroni (very strict)Benjamini-Hochberg false-discovery rate (FDR)

GeneExpression II 65BI2010

Page 66: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Statistical Analysis – Gene lists

• List of good candidate genes to follow up• FP vs FN• Fold-change vs p-value

Choice depends on downstream analysis

Input for downstream analysis: Clustering, pathway analysis, enrichment, etc.

GeneExpression II 66BI2010

Page 67: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Analysis tools

• Stand-alone tools:– R– BioConductor– ArrayNorm– TM4– GeneSpring (commercial)

• Web-based tools– ArrayPipe– ExpressYourself– GenePublisher– GEPAS– GeneTraffic (commercial)

GeneExpression II 67BI2010

Page 68: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Public Repositories

• ArrayExpress– EBI, MIAME-compliant

• Gene Expression Omnibus (GEO)– NCBI– „world‘s first write-only database“

GeneExpression II 68BI2010

Page 69: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

Summary

GeneExpression II 69

• Many sources of variance• Large numbers of replicates required for reliable

results• Data: be aware of flaws/bias• Flagging/discarding results in data loss• Correction often possible but can insert artifacts

• However:

Microarrays can still help making great discoveries!

BI2010

Page 70: GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th  May, 2010

END

GeneExpression II 70BI2010