Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
ChIP‐seq
ChIP SeqChIP‐Seq
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
ChIP Seq AnalysisChIP‐Seq Analysis
Alignment
Peak Detection
Annotation Visualization
Sequence Analysis
Motif Analysis
AlignmentAlignment
• ELAND
• BowtieBowtie
• SOAP
• SeqMap
• …
Peak detectionPeak detection
i d k• FindPeaks• CHiPSeqq• BS‐Seq• SISSRs• SISSRs• QuEST• MACS• CisGenomeCisGenome• …
Two common designsTwo common designs
• One sample experiment
contains only a ChIP’d samplecontains only a ChIP d sample
• Two sample experiment
contains a ChIP’d sample and a negativecontains a ChIP d sample and a negative control sample
One sample analysisOne sample analysisA simple way is the sliding window method
Poisson background model is commonly used to estimate error rateki ~ Poisson(λ0)
ki
Or people use Monte Carlo simulations
Both are based on the assumption that read sampling rate is a constant p p gacross the genome.
Ji et al. Nat Biotechnol, 26: 1293-1300. 2008
The constant rate assumption does not hold!The constant rate assumption does not hold!
Negative binomial model fits the data better!ki | λi ~ Poisson(λi)ki | λi Poisson(λi)λi ~ Gamma(α, β)
Marginally,ki ~ NegBinom(α, β)
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
FDR estimation based on Poisson and negative binomial model
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
Read direction provides extra informationRead direction provides extra information
CisGenome procedureCisGenome procedure
Alignment
Exploration
FDR computation
Negative binomial model
Peak DetectionPeak Detection
Post Use read direction to refine Post Processing peak boundary and filter
low quality peaks
Two sample analysisTwo sample analysisReason: read sample rates at the same genomic locus are correlated across different
lsamples.
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
CisGenome two sample analysisCisGenome two sample analysis
Ali tAlignment
k1i
k2i
Exploration
ni =k1i + k2ik1i | ni ~ Binom(ni , p0)
FDR computation
Peak Detection
Post Processingg
A comparative study of ChIP chip and ChIP seqA comparative study of ChIP‐chip and ChIP‐seq
• NRSF ChIP‐chip
2 ChIP + 2 Mock IP in Jurkat cells, profiled using Affymetrix Human Tiling 2.0R arrays.
• NRSF ChIP‐seq
ChIP + Negative Control in Jurkat cells sequenced with theChIP + Negative Control in Jurkat cells, sequenced with the next generation sequencer made by Illumina/Solexa.
IntersectionIntersection
Before post‐processing After post‐processing
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
Signal correlationSignal correlation
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
Visual comparisonVisual comparison
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
Comparison of peak detection resultsComparison of peak detection results
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
Are array specific peaks noise or signal?Are array specific peaks noise or signal?
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
Effects of read number in ChIP seqEffects of read number in ChIP‐seq
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
Motif Analysis
Sequence motif – a pattern of nucleotide or amino dacid sequences
TF
DNA motif:
GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA
TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA
CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA
TF
TF
123456789
TGGGTGGTC
TGGGTGGTA
TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG
AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC
TF
TF
TGGGAGGTC
TGGGTGGTG
TGAGTGGTC
TGGGTGGTCATCAGAATGGGTGGTCCATATATCCCAAAGAAGAGGGTAG
TF TGGGTGGTC
Transcription Factor Binding Sites (TFBS)
Protein motif:
Motif representationMotif representation
Consensus sequenceConsensus sequence
Example: CACSTGExample: CACSTG
Sequence Logoq gSchneider & Stephens, Nucleic Acids Res. 18:6097‐6100 (1990)
Entropy (Shannon) – a measurement of uncertainty
The amount of uncertainty reduced by observing sequences is the amount of information (or information content) we obtained:
This is the height of each position in the logo plot.
Height of each nucleotide is proportional to its frequency
Two questions in motif analysisTwo questions in motif analysis
• Known motif mapping
Finding occurrences of a motif in nucleotide or amino acid sequences
• De novomotif discovery
Finding motifs that are previously unknown
Known motif mappingKnown motif mapping
• Consensus mapping
STEP 1: provide a motif (e.g. CACSTG = CAC[C,G]TG)
STEP 2: specify number of mismatches allowed (e.g. <=1)
STEP 3 thSTEP 3: scan the sequence
CGCCGGGACCAGATCAACGCCGAGATCCGGCACATGAAGGAGCTCGCCGGG CC G C CGCCG G CCGGC C G GG GC
m=3, no m=1, yes
A useful tool: CisGenome (http://www.biostat.jhsph.edu/~hji/cisgenome)
Known motif mappingKnown motif mapping
• Motif matrix mapping (CisGenome)Motif matrix mapping (CisGenome)STEP 1: provide a motif and background model
STEP 2: specify a likelihood ratio cutoff (e.g. LR>=500)p y ( g )
STEP 3: scan the sequence
θ0ΘMotif:Background: 0
A C G TA .3 .2 .2 .3C .2 .3 .3 .2G .2 .3 .3 .2T .3 .2 .2 .3
1 2 3 4 5 6 7 8 9A 0.00 0.00 0.17 0.00 0.17 0.00 0.00 0.00 0.17C 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66G 0.00 1.00 0.83 1.00 0.00 1.00 1.00 0.00 0.17T 1 00 0 00 0 00 0 00 0 83 0 00 0 00 1 00 0 00
GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA
LR>500 yes LR<500 no
T 1.00 0.00 0.00 0.00 0.83 0.00 0.00 1.00 0.00
LR>500, yes LR<500, no
• Another tool for matrix mappingMAST (http://meme.sdsc.edu/meme/mast‐intro.html)
De novomotif discoveryDe novomotif discovery
• Two major class of methods:
1. Word enumeration
2. Matrix updating
Word enumerationWord enumeration
STEP 1: enumerate possible words;
STEP 2: count word occurrences;
STEP 3: compare observed word count with random expectation.
Example: Sinha & Tompa, Nucleic Acids Res. 30: 5549‐5560 (2002)
Matrix updatingMatrix updating
• CONSENSUS (Stormo & Hartzell, PNAS, 86: 1183‐1187, 1990)
STEP 1: use all k‐mers in the first sequence as seeds;
STEP 2: find matches (often use best matches) of each seed in the second sequence;
STEP 3: update seed matrices, exclude matrices with low informationSTEP 3: update seed matrices, exclude matrices with low information content;
STEP 4: repeat step 2 and 3 for all sequences.
Motif discovery – a mixture model methodMotif discovery – a mixture model method
A C G TA C G T
A .3 .2 .2 .3C .2 .3 .3 .2G .2 .3 .3 .2T .3 .2 .2 .3
1 2 3 4 5 6 7 8 9A 0.00 0.00 0.17 0.00 0.17 0.00 0.00 0.00 0.17C 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66G 0.00 1.00 0.83 1.00 0.00 1.00 1.00 0.00 0.17T 1.00 0.00 0.00 0.00 0.83 0.00 0.00 1.00 0.00
θ0 Θ, W
S: GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA
Motif:Background:
q = [q0,q1]q0 q1
S: GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA
A: 000000000000001000000000000000000000000001000000000000000000000000000000
)()|()|( qWΘθqWΘASθSqWΘA 00 πff ∝
EM:
),,(),,,|,(),|,,,( qWΘθqWΘASθSqWΘA 00 πff ∝
Inference by iterative estimation/sampling
Lawrence and Reilly (1990)
Bailey and Elkan (1994), etc.
Gibbs Sampler:Θ,W,q A
Lawrence et al. (1993)
Liu (1994), Liu et al. (1995), etc.
Ci l t d l diCis-regulatory module discovery(Zhou and Wong, PNAS 2004)
• Module structure: consider co-localization of motif sites.
0θ 1Θ KΘL
⎥⎥⎤
⎢⎢⎡
25.025.0
L⎥⎥⎥
⎦⎢⎢⎢
⎣ 25.025.0
Motif 1 Motif 2 Motif 3
Hi hi l Mi d li B M
0q1q
Kq
Hierarchical Mixture modeling
K: # of motifs
B M
r−1 r
SS
Phylogenetic FootprintingPhylogenetic Footprinting
For example, exons are conserved due to the selection pressure. Introns and intergenic regions are less likely to be conservedintergenic regions are less likely to be conserved.
Phylogenetic footprinting & motif discoveryPhylogenetic footprinting & motif discovery
• Evolutionary model based approach
EMnEM (Moses et al. 2004)EMnEM (Moses et al. 2004)
PhyME (Sinha et al. 2004)
PhyloGibbs (Siddharthan et al. 2005)
Tree Sampler (Li and Wong 2005)Tree Sampler (Li and Wong, 2005)
…