Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
Tiling Array and ChIP-chipTiling Array and ChIP chip
Gene Regulation
Expression No Expression SpatiallyExpression No Expression
X
Y
Spatially
X
Z
Temporally
A
B
A
B
A
B
Y
Z
C C CX
Y
Z
X
Y
Z
Transcription Factors and Their Binding Sites
TF1 TF2Transcription factors (TF): TF1 TF2Transcription factors (TF):
Transcription factor binding sites (TFBS): CCACCCAC, TAATAAAAT
TF1TF1TF2
TF1TTATGTAACCTGCACTTACTACCACCCACAACATAATAAAATCTAAACCACTGAATGAAATACAAAATCTATGTATGA...
TF2TTATGTAACCTGCACTTACTACCACCCACAACATAATAAAATCTAAACCACTGAATGAAATACAAAATCTATGTATGA...
Transcription factor binding motif
GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA
TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA
TF
TF
TF
123456789
TGGGTGGTC
TGGGTGGTA
1 2 3 4 5 6 7 8 9
A 0 0 1 0 1 0 0 0 1
CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA
TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG
TF
TF
TF
TGGGTGGTA
TGGGAGGTC
TGGGTGGTG
TGAGTGGTC
C 0 0 0 0 0 0 0 0 4
G 0 6 5 6 0 6 6 0 1
T 6 0 0 0 5 0 0 6 0AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC
ATCAGAATGGGTGGTCCATATATCCCAAAGAAGAGGGTAGTF
TGAGTGGTC
TGGGTGGTC
Transcription Factor Binding Sites (TFBS)
1 2 3 4 5 6 7 8 9
A 0.00 0.00 0.17 0.00 0.17 0.00 0.00 0.00 0.17
a sc pt o acto d g S tes ( S)
C 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66
G 0.00 1.00 0.83 1.00 0.00 1.00 1.00 0.00 0.17
T 1.00 0.00 0.00 0.00 0.83 0.00 0.00 1.00 0.00
Motif
Finding motifs from co-regulated genes
(R th t l 1998 H h t l 2000 t )(Roth et al., 1998; Hughes et al., 2000; etc.)
GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGAGTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA G 1 GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA
CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA
TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA
GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA
CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA
TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA
Gene1
Gene2
Gene3
Gene 1Gene 2Gene 3
Condition1 Condition2
…
Gene N
Motif discovery is difficult in mammalian genomes due to a low signal to noise ratiogenomes due to a low signal-to-noise ratio
Gene1100~1000 bp
G 2100~1000 bp
t Gene2
Gene3100~1000 bp
yeast
10k~1000k bpGene1
Gene210k~1000k bp
10k~1000k bp
human
Gene3p
ChIP-chipp
Genome Tiling Arraysg y
• Affymetrix genome tiling microarraysy g g y– Tile the genome non-repeat regions– Chr21/22 tiling (earlier version): 1 million probe pairsChr21/22 tiling (earlier version): 1 million probe pairs
(PM & MM) at 35 bp resolution on 3 arrays– Whole genome: 42 million PM probes on 7 arraysg p y
PM CGACATTGATTCAAGACTACATACAPM CGACATTGATTCAAGACTACATACAMM CGACATTGATTCTAGACTACATACA
Probes
ChromosomeChromosome
By Xiaole Shirley Liu at Harvard
Genome Tiling Arraysg y
# Arrays # Probes # Total Probe Probehuman genome
# Probes / Array
# Total Probes
Probe Length
Probe Resolution Price
Affymetrix 7 6M 42.0M 25mer 35 bp $2,000
Ni bl 38 390K 14 8M 50 110 b $30 000Nimblegen 38 390K 14.8M 50mer 110 bp $30,000
300 bp in
Agilent 21 244K 5.1M 60mer
300 bp in genes;
500 bp in $11,000
intergenic
By Xiaole Shirley Liu at Harvard
ChIP-chip Array Hybridizationp y y
• Map high intensity probes back to the genomep g y p g• Locate TF binding location
ChIP-DNA
Noise
Probes
ChromosomeChromosome
By Xiaole Shirley Liu at Harvard
Identify ChIP-enriched Regiony g
• Controls: sonicated genomic Input DNAg p• Often 3 ChIP, 3 Ctrl replicates are needed
ChIPChIP
Ctrl
By Xiaole Shirley Liu at Harvard
Other Applicationspp
• Transcription factor binding (ChIP-chip)
• Chromatin modifications
• DNA methylation
• Transcriptome
• Nucleosome positioning
• Copy number variations
Back to ChIP-chipp
Data Analysisy
P i &Preprocessing & Normalization
Peak DetectionPeak Detection
DownstreamDownstream Analyses
Raw dataw d
ChIP Control
Mann-Whitney U-testfor ChIP region Detectionfor ChIP-region Detection
• Affy TAS, Cawley et al (Cell 2004):Affy TAS, Cawley et al (Cell 2004): – Each probe: rank probes (either PM-MM or
PM) within [ 500bp +500bp] windowPM) within [-500bp, +500bp] window– Check whether sum of ChIP ranks is much
llsmaller
By Xiaole Shirley Liu at Harvard
TileMap (Ji d W Bi i f ti 2005)(Ji and Wong, Bioinformatics 2005)
STEP 1:Compute a test statistic for each probe toCompute a test statistic for each probe to
summarize probe level information
STEP 2:C bi b l l i i fCombine probe level test statistics of
neighboring probes to help infer binding regions
Probe level test statistic: empirical Bayes approach
Probe 1 2 3 I
22s 2
3s 2Is2
1s …
Probe
Sample Variance (df)
1 2 3 … I
Mean Sum of Squares
∑ −=i i ssS 222 )]([2s
IsIB 1)(212ˆ 22 −+
−=Shrinkage Factor
Ss
dfIdfB )(
22 ++
+=Shrinkage Factor
222 ˆ)ˆ1(ˆ sBsB ii +−=σ
Variance Shrinkage Estimator
21σ̂
22σ̂ 2
3σ̂ 2ˆ Iσ…Variance EstimatesA modified t-statistic
i
iii
KK
xxtσ̂11
~
21
21
+
−=
1~t 2
~t 3~t It
~…Probe level test statistics 1t 2t 3t ItProbe level test statistics
Combining neighboring probesg g g p
TileMap (MA)TileMap (MA)
1. Compute the probe level test statistic t for each probe;
2. Compute a moving average statistic to measure enrichment;
3. Estimate FDR.
TileMap (HMM)
1. Compute the probe level test statistic t for each probe;
2. Estimate the distribution of t under H0 and H1;
3. Model t by a Hidden Markov Model, and decode the HMM.
Shrinking variance increases statistical power
Moving Average
t-statistic, variance shrinking
Moving Average
t-statistic, canonical
g
Mean(X1)-Mean(X2)
,
Mean(X1) Mean(X2)
Peak 2 (180bp) transgenics( p) g
N l t b i T iNeural tube expression Transgenics
Comparisons between TileMap and previous methodsmethods
cMyc ChIP-chip Data: 6 IP + 6 CT1 + 6 CT2cMyc ChIP chip Data: 6 IP + 6 CT1 + 6 CT2
Gold Standard: Using GTRANS and Keles’ method to analyze all 18 arrays
Test data: 4 arrays, 2 IP vs 2 CT1 (s2r2)
TileMap-HMM (Ji & Wong, 2005)
GTRANS or TAS (Kampa et al., 2004)
1. Set a window;
2. Perform a Wilcoxon signed rank test for . e o a W co o s g ed a test oeach window.
Keles et al. (2004)
1. Compute a t-statistic t for each probe p p(no shrinking, two sample only);
2. Rank probes by a moving average.
Shrinking variance saves money
Using non-shrinking method (Keles’ method) to analyze all probesUsing non shrinking method (Keles method) to analyze all probes
Using shrinking method to analyze half of the probes, i.e., reduce information by half
MAT(J h W E t l PNAS 2006)(Johnson W.E. et al. PNAS, 2006)
• Model-based Analysis of Tiling arrays for ChIP-chip
• Goal: – Find ChIP-regions without replicates
Find ChIP region without controls– Find ChIP-region without controls– Find ChIP-regions without MM probes– Can analyze data array by arrayy y y y
By Xiaole Shirley Liu at Harvard
MAT
• Estimate probe behavior by checking other probes with similar sequence on the same array
• Probe sequence plays a big role in signal value
• Most of the probes inMost of the probes in ChIP-chip measures
ifinon-specific hybridization
By Xiaole Shirley Liu at Harvard
Probe Behavior Model
Baseline on number of Ts
A,C,G at each position
A,C,G,T Count Square
25mer Copy NumberA,C,G at each position of the 25mer
25mer Copy Number along the Genome
By Xiaole Shirley Liu at Harvard
Probe Standardization
• Fit the probe model array by arrayFit the probe model array by array• Divide array probes to bins (3k probes/bin)
B k d bt ti d t d di ti• Background-subtraction and standardization (normalization) on a single array;
Model predicted probe intensity
Observed probe intensity
iii s
mPMLogt ˆ)( −=
Observed probebinaffinityis Observed probe
variance within each bin
By Xiaole Shirley Liu at Harvard
Eliminate Normalization
• Probe log(PM) values before and after g( )standardization
• If normalize before model fitting– Predicted same ChIP-regions, although less confident
By Xiaole Shirley Liu at Harvard
ChIP-region Detectiong
• Window-based MATscore– ChIP without Ctrl
iiTMiMAT )'()(– TM: trimmed mean
ChIPnregioninstTMregionMAT )'()( =
TM: trimmed mean– Multiple ChIP with multiple Ctrl
ChIPInput
nInputinstTMChIPinstTMregionMATσ
)'()'()( −=
– More probes, higher t values in ChIP, less variance (fluctuation) more confident(fluctuation) more confident
By Xiaole Shirley Liu at Harvard
Raw probe values at two spike-in regions with concentration 2X2X 2X
ChIP_1 Log(PM)
Input 1 Log(PM)Input_1 Log(PM)
Sequence-based probe behavior standardizationq pChIP_1 t-value
I t 1 t lInput_1 t-value
Window-based neighboring probe combination for ChIP-region detectionWindow based neighboring probe combination for ChIP region detectionChIP_1 MATscore
ChIP 1/Input 1ChIP_1/Input_1MATscore
3 Reps ChIP/InputMATscore
By Xiaole Shirley Liu at Harvard
Statistical Significance of Hitsg
Background
<1% enriched
Enriched DNA
• P-value and FDR cutoff:– P-value from MATscore distribution– Estimate negative peaks under the same P value cutoff– Regional FDR = #negative_peaks / #positive_peaks
By Xiaole Shirley Liu at Harvard
MAT summary y
• Open source python p pyhttp://chip.dfci.harvard.edu/~wli/MAT/
• Runs faster than array scannerRuns faster than array scanner• Can work with single ChIP, multiple ChIP, and
multiple ChIP with controls with increasingmultiple ChIP with controls with increasing accuracy
U i l ChIP t t t t tib d– Use single ChIP on promoter arrays to test antibody and protocol before going whole genome
Can identify individual failed samples• Can identify individual failed samples
By Xiaole Shirley Liu at Harvard
Benchmark for ChIP-chip Target Detection(J h D S t l G R h 2008)(Johnson D.S. et al. Genome Research, 2008)
• ENCODE Spike in experiment:• ENCODE Spike-in experiment: both amplified and un-amplified
ChIP96 ENCODE clones,
2 4 8 256X i h t +
Input
t t l i DNA2,4,8,...,256X enrichment + total chromatin DNA
total genomic DNA
• Blind test: Samples hybridized to different tiling arraysSamples hybridized to different tiling arrays, predictions made before the key was released
Comparison of platformsp p
Comparison of algorithmsp g
Combined Johnson D.S. et al. Genome Research 2008 with Ji H. et al. Nature Biotechnology 2008
Residual Probe Effects after MAT
TileProbe (Judy & Ji, Bioinformatics, 2009)
TileProbe vs. MAT (GLI3)( )
1IP 0CT 3IP 0CT
TileProbe vs. MAT (Oct4)( )
1IP 0CT 3IP 0CT
TileProbe vs. MAT (NRSF)( )
1IP 0CT 2IP 0CT
Motif enrichment
MBR: Microarray Blob Removery
By Xiaole Shirley Liu at Harvard
xMAN: eXtreme MApping of oligoNucleotidesoligoNucleotides
• http://chip dfci harvard edu/~wli/xMAN• http://chip.dfci.harvard.edu/~wli/xMAN• xMAN maps ~42 M Affymetrix tiling probes to the newest
human genome assembly in less than 6 CPU hourshuman genome assembly in less than 6 CPU hours– BLAST needs 20 CPU years; BLAT needs 55 CPU days– Probe TCCCAGCACTTTGGGAGGCTGAGGC maps to 50,660 p ,
times in the genome
• Can map long oligos, and paired tag high throughput sequencing fragments
• Store the copy number information of every probe• mXAN filters tiling array probes to ensure one unique
probe measurement per 1 kb, improves peak detection
By Xiaole Shirley Liu at Harvard
CisGenome(Ji H t l N t Bi t h l 2008)(Ji H. et al. Nature Biotechnol., 2008)
Graphic User Interface
CisGenome Browser
Core Data Analysis
Programsg
CEAS: Cis-regulatory Element Annotation SystemSystem
• Data Analysis Button for Biologists
http://ceas.cbi.pku.edu.cnBy Xiaole Shirley Liu at Harvard