ChIP seq - Departments

ChIP‐seq

ChIP SeqChIP‐Seq

Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

ChIP Seq AnalysisChIP‐Seq Analysis

Alignment

Peak Detection

Annotation Visualization

Sequence Analysis

Motif Analysis

AlignmentAlignment

• ELAND

• BowtieBowtie

• SOAP

• SeqMap

• …

Peak detectionPeak detection

i d k• FindPeaks• CHiPSeqq• BS‐Seq• SISSRs• SISSRs• QuEST• MACS• CisGenomeCisGenome• …

Two common designsTwo common designs

• One sample experiment

contains only a ChIP’d samplecontains only a ChIP d sample

• Two sample experiment

contains a ChIP’d sample and a negativecontains a ChIP d sample and a negative control sample

One sample analysisOne sample analysisA simple way is the sliding window method

Poisson background model is commonly used to estimate error rateki ~ Poisson(λ0)

ki

Or people use Monte Carlo simulations

Both are based on the assumption that read sampling rate is a constant p p gacross the genome.

Ji et al. Nat Biotechnol, 26: 1293-1300. 2008

The constant rate assumption does not hold!The constant rate assumption does not hold!

Negative binomial model fits the data better!ki | λi ~ Poisson(λi)ki | λi Poisson(λi)λi ~ Gamma(α, β)

Marginally,ki ~ NegBinom(α, β)


FDR estimation based on Poisson and negative binomial model


Read direction provides extra informationRead direction provides extra information

CisGenome procedureCisGenome procedure

Alignment

Exploration

FDR computation

Negative binomial model

Peak DetectionPeak Detection

Post Use read direction to refine Post Processing peak boundary and filter

low quality peaks

Two sample analysisTwo sample analysisReason: read sample rates at the same genomic locus are correlated across different

lsamples.


CisGenome two sample analysisCisGenome two sample analysis

Ali tAlignment

k1i

k2i

Exploration

ni =k1i + k2ik1i | ni ~ Binom(ni , p0)

FDR computation

Peak Detection

Post Processingg

A comparative study of ChIP chip and ChIP seqA comparative study of ChIP‐chip and ChIP‐seq

• NRSF ChIP‐chip

2 ChIP + 2 Mock IP in Jurkat cells, profiled using Affymetrix Human Tiling 2.0R arrays.

• NRSF ChIP‐seq

ChIP + Negative Control in Jurkat cells sequenced with theChIP + Negative Control in Jurkat cells, sequenced with the next generation sequencer made by Illumina/Solexa.

IntersectionIntersection

Before post‐processing After post‐processing


Signal correlationSignal correlation


Visual comparisonVisual comparison


Comparison of peak detection resultsComparison of peak detection results


Are array specific peaks noise or signal?Are array specific peaks noise or signal?


Effects of read number in ChIP seqEffects of read number in ChIP‐seq


Motif Analysis

Sequence motif – a pattern of nucleotide or amino dacid sequences

TF

DNA motif:

GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA

TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA

TF

TF

123456789

TGGGTGGTC

TGGGTGGTA

TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG

AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC

TF

TF

TGGGAGGTC

TGGGTGGTG

TGAGTGGTC

TGGGTGGTCATCAGAATGGGTGGTCCATATATCCCAAAGAAGAGGGTAG

TF TGGGTGGTC

Transcription Factor Binding Sites (TFBS)

Protein motif:

Motif representationMotif representation

Consensus sequenceConsensus sequence

Example: CACSTGExample: CACSTG

Sequence Logoq gSchneider & Stephens, Nucleic Acids Res. 18:6097‐6100 (1990)

Entropy (Shannon) – a measurement of uncertainty

The amount of uncertainty reduced by observing sequences is the amount of information (or information content) we obtained:

This is the height of each position in the logo plot.

Height of each nucleotide is proportional to its frequency

Two questions in motif analysisTwo questions in motif analysis

• Known motif mapping

Finding occurrences of a motif in nucleotide or amino acid sequences

• De novomotif discovery

Finding motifs that are previously unknown

Known motif mappingKnown motif mapping

• Consensus mapping

STEP 1: provide a motif (e.g. CACSTG = CAC[C,G]TG)

STEP 2: specify number of mismatches allowed (e.g. <=1)

STEP 3 thSTEP 3: scan the sequence

CGCCGGGACCAGATCAACGCCGAGATCCGGCACATGAAGGAGCTCGCCGGG CC G C CGCCG G CCGGC C G GG GC

m=3, no m=1, yes

A useful tool: CisGenome (http://www.biostat.jhsph.edu/~hji/cisgenome)

Known motif mappingKnown motif mapping

• Motif matrix mapping (CisGenome)Motif matrix mapping (CisGenome)STEP 1: provide a motif and background model

STEP 2: specify a likelihood ratio cutoff (e.g. LR>=500)p y ( g )

STEP 3: scan the sequence

θ0ΘMotif:Background: 0

A C G TA .3 .2 .2 .3C .2 .3 .3 .2G .2 .3 .3 .2T .3 .2 .2 .3

1 2 3 4 5 6 7 8 9A 0.00 0.00 0.17 0.00 0.17 0.00 0.00 0.00 0.17C 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66G 0.00 1.00 0.83 1.00 0.00 1.00 1.00 0.00 0.17T 1 00 0 00 0 00 0 00 0 83 0 00 0 00 1 00 0 00

GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA

LR>500 yes LR<500 no

T 1.00 0.00 0.00 0.00 0.83 0.00 0.00 1.00 0.00

LR>500, yes LR<500, no

• Another tool for matrix mappingMAST (http://meme.sdsc.edu/meme/mast‐intro.html)

De novomotif discoveryDe novomotif discovery

• Two major class of methods:

1. Word enumeration

2. Matrix updating

Word enumerationWord enumeration

STEP 1: enumerate possible words;

STEP 2: count word occurrences;

STEP 3: compare observed word count with random expectation.

Example: Sinha & Tompa, Nucleic Acids Res. 30: 5549‐5560 (2002)

Matrix updatingMatrix updating

• CONSENSUS (Stormo & Hartzell, PNAS, 86: 1183‐1187, 1990)

STEP 1: use all k‐mers in the first sequence as seeds;

STEP 2: find matches (often use best matches) of each seed in the second sequence;

STEP 3: update seed matrices, exclude matrices with low informationSTEP 3: update seed matrices, exclude matrices with low information content;

STEP 4: repeat step 2 and 3 for all sequences.

Motif discovery – a mixture model methodMotif discovery – a mixture model method

A C G TA C G T

A .3 .2 .2 .3C .2 .3 .3 .2G .2 .3 .3 .2T .3 .2 .2 .3

1 2 3 4 5 6 7 8 9A 0.00 0.00 0.17 0.00 0.17 0.00 0.00 0.00 0.17C 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66G 0.00 1.00 0.83 1.00 0.00 1.00 1.00 0.00 0.17T 1.00 0.00 0.00 0.00 0.83 0.00 0.00 1.00 0.00

θ0 Θ, W

S: GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA

Motif:Background:

q = [q0,q1]q0 q1

S: GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA

A: 000000000000001000000000000000000000000001000000000000000000000000000000

)()|()|( qWΘθqWΘASθSqWΘA 00 πff ∝

EM:

),,(),,,|,(),|,,,( qWΘθqWΘASθSqWΘA 00 πff ∝

Inference by iterative estimation/sampling

Lawrence and Reilly (1990)

Bailey and Elkan (1994), etc.

Gibbs Sampler:Θ,W,q A

Lawrence et al. (1993)

Liu (1994), Liu et al. (1995), etc.

Ci l t d l diCis-regulatory module discovery(Zhou and Wong, PNAS 2004)

• Module structure: consider co-localization of motif sites.

0θ 1Θ KΘL

⎥⎥⎤

⎢⎢⎡

25.025.0

L⎥⎥⎥

⎦⎢⎢⎢

⎣ 25.025.0

Motif 1 Motif 2 Motif 3

Hi hi l Mi d li B M

0q1q

Kq

Hierarchical Mixture modeling

K: # of motifs

B M

r−1 r

SS

Phylogenetic FootprintingPhylogenetic Footprinting

For example, exons are conserved due to the selection pressure. Introns and intergenic regions are less likely to be conservedintergenic regions are less likely to be conserved.

Phylogenetic footprinting & motif discoveryPhylogenetic footprinting & motif discovery

• Evolutionary model based approach

EMnEM (Moses et al. 2004)EMnEM (Moses et al. 2004)

PhyME (Sinha et al. 2004)

PhyloGibbs (Siddharthan et al. 2005)

Tree Sampler (Li and Wong 2005)Tree Sampler (Li and Wong, 2005)

…

Documents

ChIP seq - Departments