Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Bioinformatics 2 - Lecture 4
Gabriele Schweikert
University of Edinburgh
February 8, 2013
Gabriele Schweikert Bioinformatics 2 - Lecture 4 1
http://www.arthursclipart.org/medical/humanbody/page 01.html
Gabriele Schweikert Bioinformatics 2 - Lecture 4 2
XX -Seq
Credits: Darryl Leja (NHGRI), Ian Dunham (EBI)
Gabriele Schweikert Bioinformatics 2 - Lecture 4 3
Gene regulation by transcription factor binding
Hobert, Science, 2008
Gabriele Schweikert Bioinformatics 2 - Lecture 4 4
Epigenomics
Marks, Nature Reviews Cancer, 2001
Gabriele Schweikert Bioinformatics 2 - Lecture 4 5
Introduction: ChIP-Seq
- Cross-linkingDNA - binding
protein
DNA
adopted from Kim and Park, 2011
Gabriele Schweikert Bioinformatics 2 - Lecture 4 6
Introduction: ChIP-Seq
- Cross-linking
- DNA fragmentation
- Enrichment with specific antibody (ChIP)
DNA - bindingprotein
DNA
adopted from Kim and Park, 2011
Gabriele Schweikert Bioinformatics 2 - Lecture 4 6
Introduction: ChIP-Seq
- Cross-linking
- DNA fragmentation
- Enrichment with specific antibody (ChIP)
- Profiling of enriched DNA (Seq)
DNA - bindingprotein
DNA
Individual sequencing read (tag)
Read (tag) density
- Cross-linking
- DNA fragmentation
- Enrichment with specific antibody (ChIP)
- Profiling of enriched DNA (Seq)
adopted from Kim and Park, 2011
Gabriele Schweikert Bioinformatics 2 - Lecture 4 6
ChIP-Seq analysis pipeline
Park, Nature Reviews Genetics, 2009
Gabriele Schweikert Bioinformatics 2 - Lecture 4 7
Differential profile analysis
compare binding profiles in different conditions/tissues
find regions which are significantly different between condition Aand B.
Gabriele Schweikert Bioinformatics 2 - Lecture 4 8
Two fundamentally different questions:
1 Is the level of enrichment at a given position different in twosamples?
2 May this difference be attributed to the difference inexperimental conditions?i.e., are we confident that it is due to the experimentaltreatment and not due to fluctuations (”biological variation”)?
→ We are more interested in answering the second question→ Requires ’biological replicates’→ We also need input control: ’non-ChIP genomic DNA’,to account for sequencing bias
Gabriele Schweikert Bioinformatics 2 - Lecture 4 9
Two fundamentally different questions:
1 Is the level of enrichment at a given position different in twosamples?
2 May this difference be attributed to the difference inexperimental conditions?i.e., are we confident that it is due to the experimentaltreatment and not due to fluctuations (”biological variation”)?
→ We are more interested in answering the second question→ Requires ’biological replicates’→ We also need input control: ’non-ChIP genomic DNA’,to account for sequencing bias
Gabriele Schweikert Bioinformatics 2 - Lecture 4 9
Two fundamentally different questions:
1 Is the level of enrichment at a given position different in twosamples?
2 May this difference be attributed to the difference inexperimental conditions?i.e., are we confident that it is due to the experimentaltreatment and not due to fluctuations (”biological variation”)?
→ We are more interested in answering the second question→ Requires ’biological replicates’
→ We also need input control: ’non-ChIP genomic DNA’,to account for sequencing bias
Gabriele Schweikert Bioinformatics 2 - Lecture 4 9
Two fundamentally different questions:
1 Is the level of enrichment at a given position different in twosamples?
2 May this difference be attributed to the difference inexperimental conditions?i.e., are we confident that it is due to the experimentaltreatment and not due to fluctuations (”biological variation”)?
→ We are more interested in answering the second question→ Requires ’biological replicates’→ We also need input control: ’non-ChIP genomic DNA’,to account for sequencing bias
Gabriele Schweikert Bioinformatics 2 - Lecture 4 9
Pipeline: Differential profile analysis
1 quality control
2 alignment (BWA)
3 filtering (duplicates)
4 define regions of interest (peak calling)
5 strand shift correction
6 normalization
7 differential profile analysis
Gabriele Schweikert Bioinformatics 2 - Lecture 4 10
Pipeline: Differential profile analysis
1 quality control
2 alignment (BWA)
3 filtering (duplicates)
4 define regions of interest (peak calling)
5 strand shift correction
6 normalization
7 differential profile analysis
Gabriele Schweikert Bioinformatics 2 - Lecture 4 10
Strand shift
Park, Nature Reviews Genetics, 2009
Gabriele Schweikert Bioinformatics 2 - Lecture 4 11
Peak Calling
in general only a small fraction of the genome shows significantenrichment (binding)
discriminate true peaks in sequence coverage (protein bindingsites) from the background
> 31 open source methods (’peak callers’)
1 find overlapping extended reads2 sliding window approaches3 Gaussian kernel density estimators4 look for bimodal peaks
Gabriele Schweikert Bioinformatics 2 - Lecture 4 12
Peak Calling
in general only a small fraction of the genome shows significantenrichment (binding)
discriminate true peaks in sequence coverage (protein bindingsites) from the background
> 31 open source methods (’peak callers’)1 find overlapping extended reads2 sliding window approaches3 Gaussian kernel density estimators4 look for bimodal peaks
Gabriele Schweikert Bioinformatics 2 - Lecture 4 12
Peak Callers
Wilbanks and Facciotti, 2010
Gabriele Schweikert Bioinformatics 2 - Lecture 4 13
Peak Calling / sliding window
great differences in results
potentially use several peak callers
performance depends on type of peak1 punctuate peaks for most transcription factor binding sites2 potentially, large extended peaks for histone modifications (e.g.
H3K27me3)
alternatively use sliding windows for very extended regions
use fixed windows around annotated sites. (e.g. +/- 2000bpwindows around transcription start sites for H3K4me3)
→ Output: a set of genomic regions
Gabriele Schweikert Bioinformatics 2 - Lecture 4 14
Peak Calling / sliding window
great differences in results
potentially use several peak callers
performance depends on type of peak1 punctuate peaks for most transcription factor binding sites2 potentially, large extended peaks for histone modifications (e.g.
H3K27me3)
alternatively use sliding windows for very extended regions
use fixed windows around annotated sites. (e.g. +/- 2000bpwindows around transcription start sites for H3K4me3)
→ Output: a set of genomic regions
Gabriele Schweikert Bioinformatics 2 - Lecture 4 14
Peak Calling / sliding window
great differences in results
potentially use several peak callers
performance depends on type of peak1 punctuate peaks for most transcription factor binding sites2 potentially, large extended peaks for histone modifications (e.g.
H3K27me3)
alternatively use sliding windows for very extended regions
use fixed windows around annotated sites. (e.g. +/- 2000bpwindows around transcription start sites for H3K4me3)
→ Output: a set of genomic regions
Gabriele Schweikert Bioinformatics 2 - Lecture 4 14
Strand shift correction
1 use cross correlation profiles to estimate fragment length
2 shift / extend reads on forward / reverse strand
Gabriele Schweikert Bioinformatics 2 - Lecture 4 15
Normalization
sequencing depth (number of clusters) varies between samples→ normalization
if sample A has been sampled deeper than sample B,counts are expected to be higher in A
can we use total number of reads per sample (library size)?
only works if we assume that the total number of molecules inthe sample is the same
differential regions with high counts distort the ratio of totalreads.
Gabriele Schweikert Bioinformatics 2 - Lecture 4 16
Normalization
sequencing depth (number of clusters) varies between samples→ normalization
if sample A has been sampled deeper than sample B,counts are expected to be higher in A
can we use total number of reads per sample (library size)?
only works if we assume that the total number of molecules inthe sample is the same
differential regions with high counts distort the ratio of totalreads.
Gabriele Schweikert Bioinformatics 2 - Lecture 4 16
Normalization
sequencing depth (number of clusters) varies between samples→ normalization
if sample A has been sampled deeper than sample B,counts are expected to be higher in A
can we use total number of reads per sample (library size)?
only works if we assume that the total number of molecules inthe sample is the same
differential regions with high counts distort the ratio of totalreads.
Gabriele Schweikert Bioinformatics 2 - Lecture 4 16
Normalization
Robinson and Oshlack, Genome Biology, 2010
Gabriele Schweikert Bioinformatics 2 - Lecture 4 17
Normalization (simple example)Normalization
Condition 1
Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000
fraction 0.2 0.3 0.2 0.3
nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500
after lib normalization (exp 2) 200 300 200 300 1000
Condition 2
Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800
fraction 0.22 0.33 0.22 0.22
nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000
Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6
Normalization
Condition 1
Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000
fraction 0.2 0.3 0.2 0.3
nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500
after lib normalization (exp 2) 200 300 200 300 1000
Condition 2
Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800
fraction 0.22 0.33 0.22 0.22
nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000
Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6
Condition A
Condition B
Gabriele Schweikert Bioinformatics 2 - Lecture 4 18
Normalization (simple example)Normalization
Condition 1
Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000
fraction 0.2 0.3 0.2 0.3
nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500
after lib normalization (exp 2) 200 300 200 300 1000
Condition 2
Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800
fraction 0.22 0.33 0.22 0.22
nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000
Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6
Normalization
Condition 1
Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000
fraction 0.2 0.3 0.2 0.3
nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500
after lib normalization (exp 2) 200 300 200 300 1000
Condition 2
Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800
fraction 0.22 0.33 0.22 0.22
nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000
Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6
Condition A
Condition B
Gabriele Schweikert Bioinformatics 2 - Lecture 4 18
Normalization (simple example)Normalization
Condition 1
Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000
fraction 0.2 0.3 0.2 0.3
nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500
after lib normalization (exp 2) 200 300 200 300 1000
Condition 2
Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800
fraction 0.22 0.33 0.22 0.22
nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000
Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6
Normalization
Condition 1
Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000
fraction 0.2 0.3 0.2 0.3
nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500
after lib normalization (exp 2) 200 300 200 300 1000
Condition 2
Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800
fraction 0.22 0.33 0.22 0.22
nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000
Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6
Condition A
Condition B
Gabriele Schweikert Bioinformatics 2 - Lecture 4 18
Normalization (simple example)Normalization
Condition 1
Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000
fraction 0.2 0.3 0.2 0.3
nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500
after lib normalization (exp 2) 200 300 200 300 1000
Condition 2
Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800
fraction 0.22 0.33 0.22 0.22
nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000
Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6
Normalization
Condition 1
Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000
fraction 0.2 0.3 0.2 0.3
nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500
after lib normalization (exp 2) 200 300 200 300 1000
Condition 2
Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800
fraction 0.22 0.33 0.22 0.22
nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000
Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6
Condition A
Condition B
Gabriele Schweikert Bioinformatics 2 - Lecture 4 18
Normalization (simple example)Normalization
Condition 1
Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000
fraction 0.2 0.3 0.2 0.3
nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500
after lib normalization (exp 2) 200 300 200 300 1000
Condition 2
Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800
fraction 0.22 0.33 0.22 0.22
nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000
Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6
Normalization
Condition 1
Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000
fraction 0.2 0.3 0.2 0.3
nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500
after lib normalization (exp 2) 200 300 200 300 1000
Condition 2
Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800
fraction 0.22 0.33 0.22 0.22
nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000
Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6
Condition A
Condition B
Gabriele Schweikert Bioinformatics 2 - Lecture 4 18
Normalization (simple example)Normalization
Condition 1
Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000
fraction 0.2 0.3 0.2 0.3
nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500
after lib normalization (exp 2) 200 300 200 300 1000
Condition 2
Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800
fraction 0.22 0.33 0.22 0.22
nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000
Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6
Normalization
Condition 1
Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000
fraction 0.2 0.3 0.2 0.3
nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500
after lib normalization (exp 2) 200 300 200 300 1000
Condition 2
Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800
fraction 0.22 0.33 0.22 0.22
nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000
Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6
Condition A
Condition B
Gabriele Schweikert Bioinformatics 2 - Lecture 4 18
Normalization (simple example)Normalization
Condition 1
Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000
fraction 0.2 0.3 0.2 0.3
nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500
after lib normalization (exp 2) 200 300 200 300 1000
Condition 2
Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800
fraction 0.22 0.33 0.22 0.22
nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000
Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6
Normalization
Condition 1
Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000
fraction 0.2 0.3 0.2 0.3
nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500
after lib normalization (exp 2) 200 300 200 300 1000
Condition 2
Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800
fraction 0.22 0.33 0.22 0.22
nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000
Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6
Condition A
Condition B
Gabriele Schweikert Bioinformatics 2 - Lecture 4 18
Normalization (Anders and Huber. 2010)
for each gene divide counts from sample A by the counts forsample B
per gene estimate for the size ratio of sample A to sample B
use median of all these ratios
what is the assumption we make about sample A and B?
the majority of events is not changing in sample A vs sample B
Gabriele Schweikert Bioinformatics 2 - Lecture 4 19
Normalization (Anders and Huber. 2010)
for each gene divide counts from sample A by the counts forsample B
per gene estimate for the size ratio of sample A to sample B
use median of all these ratios
what is the assumption we make about sample A and B?
the majority of events is not changing in sample A vs sample B
Gabriele Schweikert Bioinformatics 2 - Lecture 4 19
Normalization (Anders and Huber. 2010)
for each gene divide counts from sample A by the counts forsample B
per gene estimate for the size ratio of sample A to sample B
use median of all these ratios
what is the assumption we make about sample A and B?
the majority of events is not changing in sample A vs sample B
Gabriele Schweikert Bioinformatics 2 - Lecture 4 19
Normalization
Blue Yellow Green Rednb of reads (exp 1) 200 300 200 300nb of reads (exp 3) 222 333 222 222
geometric mean 210 316 210 258
Determine normalization factor:
Blue Yellow Green Red mediannb of reads (exp 1) 0.95 0.95 0.95 1.16 0.95nb of reads (exp 3) 1.05 1.05 1.05 0.86 1.05
Counts after normalization:
Blue Yellow Green Rednb of reads (exp 1) 211 316 211 316nb of reads (exp 3) 211 316 211 211
Gabriele Schweikert Bioinformatics 2 - Lecture 4 20
Normalization
Blue Yellow Green Rednb of reads (exp 1) 200 300 200 300nb of reads (exp 3) 222 333 222 222
geometric mean 210 316 210 258
Determine normalization factor:
Blue Yellow Green Red mediannb of reads (exp 1) 0.95 0.95 0.95 1.16 0.95nb of reads (exp 3) 1.05 1.05 1.05 0.86 1.05
Counts after normalization:
Blue Yellow Green Rednb of reads (exp 1) 211 316 211 316nb of reads (exp 3) 211 316 211 211
Gabriele Schweikert Bioinformatics 2 - Lecture 4 20
Normalization
Blue Yellow Green Rednb of reads (exp 1) 200 300 200 300nb of reads (exp 3) 222 333 222 222
geometric mean 210 316 210 258
Determine normalization factor:
Blue Yellow Green Red mediannb of reads (exp 1) 0.95 0.95 0.95 1.16 0.95nb of reads (exp 3) 1.05 1.05 1.05 0.86 1.05
Counts after normalization:
Blue Yellow Green Rednb of reads (exp 1) 211 316 211 316nb of reads (exp 3) 211 316 211 211
Gabriele Schweikert Bioinformatics 2 - Lecture 4 20
Simulation: Biological Replicates
Gabriele Schweikert Bioinformatics 2 - Lecture 4 21
Simulation: Biological Replicates
Gabriele Schweikert Bioinformatics 2 - Lecture 4 21
Simulation: Biological Replicates
Gabriele Schweikert Bioinformatics 2 - Lecture 4 21
Simulation: add big changes (-) (at promoters)
Gabriele Schweikert Bioinformatics 2 - Lecture 4 22
Simulation: add big changes (-) (at promoters)
Gabriele Schweikert Bioinformatics 2 - Lecture 4 22
Simulation: add big changes (-) (at promoters)
Gabriele Schweikert Bioinformatics 2 - Lecture 4 22
Normalization check
Total counts: 1 : 0.76 : 1.12 : 0.88
Gabriele Schweikert Bioinformatics 2 - Lecture 4 23
Normalization check
Gabriele Schweikert Bioinformatics 2 - Lecture 4 24
Differential peak calling
Clouaire et al., 2012
Gabriele Schweikert Bioinformatics 2 - Lecture 4 25
Differential peak calling
Which Peaks are statistically significant different ?
→ Problem is related to detection of differential expressed genes inRNA-Seq
→ Current approaches mostly adopted from RNA-Seq basedmethods e.g. DESeq (Anders and Huber, 2010)
DBChIP (Liang and Keles, 2012)DiffBind (Ross-Innes et al., 2012)
→ Peaks are represented by a single value: total counts
Gabriele Schweikert Bioinformatics 2 - Lecture 4 26
Differential peak calling
Which Peaks are statistically significant different ?
→ Problem is related to detection of differential expressed genes inRNA-Seq
→ Current approaches mostly adopted from RNA-Seq basedmethods e.g. DESeq (Anders and Huber, 2010)
DBChIP (Liang and Keles, 2012)DiffBind (Ross-Innes et al., 2012)
→ Peaks are represented by a single value: total counts
Gabriele Schweikert Bioinformatics 2 - Lecture 4 26
Differential peak calling
Which Peaks are statistically significant different ?
→ Problem is related to detection of differential expressed genes inRNA-Seq
→ Current approaches mostly adopted from RNA-Seq basedmethods e.g. DESeq (Anders and Huber, 2010)
DBChIP (Liang and Keles, 2012)DiffBind (Ross-Innes et al., 2012)
→ Peaks are represented by a single value: total counts
Gabriele Schweikert Bioinformatics 2 - Lecture 4 26
Differential peak calling
Which Peaks are statistically significant different ?
→ Problem is related to detection of differential expressed genes inRNA-Seq
→ Current approaches mostly adopted from RNA-Seq basedmethods e.g. DESeq (Anders and Huber, 2010)
DBChIP (Liang and Keles, 2012)DiffBind (Ross-Innes et al., 2012)
→ Peaks are represented by a single value: total counts
Gabriele Schweikert Bioinformatics 2 - Lecture 4 26
Differential Peak Calling
Challenges with count data from NGS
small number of replicates(mind you: these a large experiments, we look at hundreds ofthousands of binding sites,however each binding site is only tested a few timesusually <3 !! )→ no rank based or permutation methods
large dynamic range (0...106) between binding sites
distribution is discrete, positive, skewed→ no (log-)normal model
Gabriele Schweikert Bioinformatics 2 - Lecture 4 27
Differential Peak Calling
Challenges with count data from NGS
small number of replicates(mind you: these a large experiments, we look at hundreds ofthousands of binding sites,however each binding site is only tested a few timesusually <3 !! )→ no rank based or permutation methods
large dynamic range (0...106) between binding sites
distribution is discrete, positive, skewed→ no (log-)normal model
Gabriele Schweikert Bioinformatics 2 - Lecture 4 27
Differential Peak Calling
Challenges with count data from NGS
small number of replicates(mind you: these a large experiments, we look at hundreds ofthousands of binding sites,however each binding site is only tested a few timesusually <3 !! )→ no rank based or permutation methods
large dynamic range (0...106) between binding sites
distribution is discrete, positive, skewed→ no (log-)normal model
Gabriele Schweikert Bioinformatics 2 - Lecture 4 27
Differential Peak Calling
to decide whether a peak is significantly different under onecondition vs another we need to estimate the variance
→ variance estimated from comparing two replicates
→ variance depends strongly on the mean
→ share information between peaks
Anders and Huber, 2010
Gabriele Schweikert Bioinformatics 2 - Lecture 4 28
Differential Peak Calling
to decide whether a peak is significantly different under onecondition vs another we need to estimate the variance
→ variance estimated from comparing two replicates
→ variance depends strongly on the mean
→ share information between peaks
Anders and Huber, 2010
Gabriele Schweikert Bioinformatics 2 - Lecture 4 28
Differential Peak Calling
whenever things are counted, which distribution comes intomind?
Poisson distribution
for Poisson-distributed data, the variance is equal to the mean
→ in NGS data we observe overdispersion(greater variability than expected from this simple model ! )
Gabriele Schweikert Bioinformatics 2 - Lecture 4 29
Differential Peak Calling
whenever things are counted, which distribution comes intomind?
Poisson distribution
for Poisson-distributed data, the variance is equal to the mean
→ in NGS data we observe overdispersion(greater variability than expected from this simple model ! )
Gabriele Schweikert Bioinformatics 2 - Lecture 4 29
Differential Peak Calling
whenever things are counted, which distribution comes intomind?
Poisson distribution
for Poisson-distributed data, the variance is equal to the mean
→ in NGS data we observe overdispersion(greater variability than expected from this simple model ! )
Gabriele Schweikert Bioinformatics 2 - Lecture 4 29
Differential Peak Calling
whenever things are counted, which distribution comes intomind?
Poisson distribution
for Poisson-distributed data, the variance is equal to the mean
→ in NGS data we observe overdispersion(greater variability than expected from this simple model ! )
Gabriele Schweikert Bioinformatics 2 - Lecture 4 29
Types of noise
1 Shot noise
unavoidabledominant for small peakscan be computed
2 Technical noise
from sample preparation and sequencing
3 Biological noise
differences between samples of the same conditiondominant for high count peaks peakscan’t be computed, needs to be estimated
Gabriele Schweikert Bioinformatics 2 - Lecture 4 30
The negative binomial distribution
Two-stage hierarchical process: Gamma distribution + Poisson
from Anders, BioC 2010
Gabriele Schweikert Bioinformatics 2 - Lecture 4 31
Testing
Model:The binding intensity (counts) for a given site in sample j stemsfrom a negative binomial distribution with mean sjµρ andvariance s2
j v(µρ)
sj relative size of library j (normalization factor)µρ mean value for condition ρv(µρ) fitted variance for mean µρ
Null hypothesis:The intensity of binding is not influenced by the experimentalcondition ρ:µρ1 = µρ2
Gabriele Schweikert Bioinformatics 2 - Lecture 4 32
Testing
Model:The binding intensity (counts) for a given site in sample j stemsfrom a negative binomial distribution with mean sjµρ andvariance s2
j v(µρ)
sj relative size of library j (normalization factor)µρ mean value for condition ρv(µρ) fitted variance for mean µρ
Null hypothesis:The intensity of binding is not influenced by the experimentalcondition ρ:µρ1 = µρ2
Gabriele Schweikert Bioinformatics 2 - Lecture 4 32
Model fitting
Estimate variance from replicates
Fit a line to get the variance-mean dependence v(µ)(local regression for a gamma-family generalized linear model)
Anders and Huber, 2010
Gabriele Schweikert Bioinformatics 2 - Lecture 4 33
Model fitting
For condition A and B , add counts from all replicates: KiA,KiB
Consider KiA,KiB as NB-distributed with moments as estimatedand fitted
calculate the probability of observing the actual sums or moreextreme ones, conditioned on A = B .
DESeq, Anders and Huber, 2010
Gabriele Schweikert Bioinformatics 2 - Lecture 4 34
Correction for multiple testing
The false discovery rate (see lecture 1)
Defined as the expectation of the ratio of false positives (type Ierrors) to total positives (number of times the null is rejected)
Assume we are testing m hypotheses; the Benjamini-Hochbergprocedure for a given FDR α works as follows:
1 Rank p-values in increasing order;2 Find the largest k s.t. pk ≤ k
mα;3 Reject all null hypotheses 1,. . . ,k
Gabriele Schweikert Bioinformatics 2 - Lecture 4 35
DESeq results: MA plot
Anders and Huber, 2010
Gabriele Schweikert Bioinformatics 2 - Lecture 4 36
Example: Differential oestrogen receptor binding
Gabriele Schweikert Bioinformatics 2 - Lecture 4 37
Example: The colors of Chromatin
Filion et al, 2010
Gabriele Schweikert Bioinformatics 2 - Lecture 4 38
Example: ENCODE
Gabriele Schweikert Bioinformatics 2 - Lecture 4 39