Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Bioinformatics 2 - Lecture 4

Gabriele Schweikert

University of Edinburgh

February 8, 2013

Gabriele Schweikert Bioinformatics 2 - Lecture 4 1

http://www.arthursclipart.org/medical/humanbody/page 01.html


XX -Seq

Credits: Darryl Leja (NHGRI), Ian Dunham (EBI)


Gene regulation by transcription factor binding

Hobert, Science, 2008


Epigenomics

Marks, Nature Reviews Cancer, 2001


Introduction: ChIP-Seq

- Cross-linkingDNA - binding

protein

DNA

adopted from Kim and Park, 2011



- Cross-linking

- DNA fragmentation

- Enrichment with specific antibody (ChIP)

DNA - bindingprotein

DNA




- Cross-linking

- DNA fragmentation


- Profiling of enriched DNA (Seq)

DNA - bindingprotein

DNA

Individual sequencing read (tag)

Read (tag) density

- Cross-linking

- DNA fragmentation


- Profiling of enriched DNA (Seq)



ChIP-Seq analysis pipeline

Park, Nature Reviews Genetics, 2009


Differential profile analysis

compare binding profiles in different conditions/tissues

find regions which are significantly different between condition Aand B.


Two fundamentally different questions:

1 Is the level of enrichment at a given position different in twosamples?

2 May this difference be attributed to the difference inexperimental conditions?i.e., are we confident that it is due to the experimentaltreatment and not due to fluctuations (”biological variation”)?

→ We are more interested in answering the second question→ Requires ’biological replicates’→ We also need input control: ’non-ChIP genomic DNA’,to account for sequencing bias










→ We are more interested in answering the second question→ Requires ’biological replicates’

→ We also need input control: ’non-ChIP genomic DNA’,to account for sequencing bias







Pipeline: Differential profile analysis

1 quality control

2 alignment (BWA)

3 filtering (duplicates)

4 define regions of interest (peak calling)

5 strand shift correction

6 normalization

7 differential profile analysis


Pipeline: Differential profile analysis

1 quality control

2 alignment (BWA)

3 filtering (duplicates)

4 define regions of interest (peak calling)

5 strand shift correction

6 normalization

7 differential profile analysis


Strand shift

Park, Nature Reviews Genetics, 2009


Peak Calling

in general only a small fraction of the genome shows significantenrichment (binding)

discriminate true peaks in sequence coverage (protein bindingsites) from the background

> 31 open source methods (’peak callers’)

1 find overlapping extended reads2 sliding window approaches3 Gaussian kernel density estimators4 look for bimodal peaks


Peak Calling

in general only a small fraction of the genome shows significantenrichment (binding)

discriminate true peaks in sequence coverage (protein bindingsites) from the background

> 31 open source methods (’peak callers’)1 find overlapping extended reads2 sliding window approaches3 Gaussian kernel density estimators4 look for bimodal peaks


Peak Callers

Wilbanks and Facciotti, 2010


Peak Calling / sliding window

great differences in results

potentially use several peak callers

performance depends on type of peak1 punctuate peaks for most transcription factor binding sites2 potentially, large extended peaks for histone modifications (e.g.

H3K27me3)

alternatively use sliding windows for very extended regions

use fixed windows around annotated sites. (e.g. +/- 2000bpwindows around transcription start sites for H3K4me3)

→ Output: a set of genomic regions






H3K27me3)









H3K27me3)





Strand shift correction

1 use cross correlation profiles to estimate fragment length

2 shift / extend reads on forward / reverse strand


Normalization

sequencing depth (number of clusters) varies between samples→ normalization

if sample A has been sampled deeper than sample B,counts are expected to be higher in A

can we use total number of reads per sample (library size)?

only works if we assume that the total number of molecules inthe sample is the same

differential regions with high counts distort the ratio of totalreads.


Normalization







Normalization







Normalization

Robinson and Oshlack, Genome Biology, 2010


Normalization (simple example)Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2


fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Normalization

Condition 1


fraction 0.2 0.3 0.2 0.3



Condition 2


fraction 0.22 0.33 0.22 0.22



Condition A

Condition B



Condition 1


fraction 0.2 0.3 0.2 0.3



Condition 2


fraction 0.22 0.33 0.22 0.22



Normalization

Condition 1


fraction 0.2 0.3 0.2 0.3



Condition 2


fraction 0.22 0.33 0.22 0.22



Condition A

Condition B



Condition 1


fraction 0.2 0.3 0.2 0.3



Condition 2


fraction 0.22 0.33 0.22 0.22



Normalization

Condition 1


fraction 0.2 0.3 0.2 0.3



Condition 2


fraction 0.22 0.33 0.22 0.22



Condition A

Condition B



Condition 1


fraction 0.2 0.3 0.2 0.3



Condition 2


fraction 0.22 0.33 0.22 0.22



Normalization

Condition 1


fraction 0.2 0.3 0.2 0.3



Condition 2


fraction 0.22 0.33 0.22 0.22



Condition A

Condition B



Condition 1


fraction 0.2 0.3 0.2 0.3



Condition 2


fraction 0.22 0.33 0.22 0.22



Normalization

Condition 1


fraction 0.2 0.3 0.2 0.3



Condition 2


fraction 0.22 0.33 0.22 0.22



Condition A

Condition B



Condition 1


fraction 0.2 0.3 0.2 0.3



Condition 2


fraction 0.22 0.33 0.22 0.22



Normalization

Condition 1


fraction 0.2 0.3 0.2 0.3



Condition 2


fraction 0.22 0.33 0.22 0.22



Condition A

Condition B



Condition 1


fraction 0.2 0.3 0.2 0.3



Condition 2


fraction 0.22 0.33 0.22 0.22



Normalization

Condition 1


fraction 0.2 0.3 0.2 0.3



Condition 2


fraction 0.22 0.33 0.22 0.22



Condition A

Condition B


Normalization (Anders and Huber. 2010)

for each gene divide counts from sample A by the counts forsample B

per gene estimate for the size ratio of sample A to sample B

use median of all these ratios

what is the assumption we make about sample A and B?

the majority of events is not changing in sample A vs sample B
















Normalization

Blue Yellow Green Rednb of reads (exp 1) 200 300 200 300nb of reads (exp 3) 222 333 222 222

geometric mean 210 316 210 258

Determine normalization factor:

Blue Yellow Green Red mediannb of reads (exp 1) 0.95 0.95 0.95 1.16 0.95nb of reads (exp 3) 1.05 1.05 1.05 0.86 1.05

Counts after normalization:



Normalization








Normalization








Simulation: Biological Replicates






Simulation: add big changes (-) (at promoters)






Normalization check

Total counts: 1 : 0.76 : 1.12 : 0.88


Normalization check


Differential peak calling

Clouaire et al., 2012



Which Peaks are statistically significant different ?

→ Problem is related to detection of differential expressed genes inRNA-Seq

→ Current approaches mostly adopted from RNA-Seq basedmethods e.g. DESeq (Anders and Huber, 2010)

DBChIP (Liang and Keles, 2012)DiffBind (Ross-Innes et al., 2012)

→ Peaks are represented by a single value: total counts























Differential Peak Calling

Challenges with count data from NGS

small number of replicates(mind you: these a large experiments, we look at hundreds ofthousands of binding sites,however each binding site is only tested a few timesusually <3 !! )→ no rank based or permutation methods

large dynamic range (0...106) between binding sites

distribution is discrete, positive, skewed→ no (log-)normal model















to decide whether a peak is significantly different under onecondition vs another we need to estimate the variance

→ variance estimated from comparing two replicates

→ variance depends strongly on the mean

→ share information between peaks

Anders and Huber, 2010



to decide whether a peak is significantly different under onecondition vs another we need to estimate the variance

→ variance estimated from comparing two replicates

→ variance depends strongly on the mean

→ share information between peaks




whenever things are counted, which distribution comes intomind?

Poisson distribution

for Poisson-distributed data, the variance is equal to the mean

→ in NGS data we observe overdispersion(greater variability than expected from this simple model ! )




















Types of noise

1 Shot noise

unavoidabledominant for small peakscan be computed

2 Technical noise

from sample preparation and sequencing

3 Biological noise

differences between samples of the same conditiondominant for high count peaks peakscan’t be computed, needs to be estimated


The negative binomial distribution

Two-stage hierarchical process: Gamma distribution + Poisson

from Anders, BioC 2010


Testing

Model:The binding intensity (counts) for a given site in sample j stemsfrom a negative binomial distribution with mean sjµρ andvariance s2

j v(µρ)

sj relative size of library j (normalization factor)µρ mean value for condition ρv(µρ) fitted variance for mean µρ

Null hypothesis:The intensity of binding is not influenced by the experimentalcondition ρ:µρ1 = µρ2


Testing

Model:The binding intensity (counts) for a given site in sample j stemsfrom a negative binomial distribution with mean sjµρ andvariance s2

j v(µρ)

sj relative size of library j (normalization factor)µρ mean value for condition ρv(µρ) fitted variance for mean µρ

Null hypothesis:The intensity of binding is not influenced by the experimentalcondition ρ:µρ1 = µρ2


Model fitting

Estimate variance from replicates

Fit a line to get the variance-mean dependence v(µ)(local regression for a gamma-family generalized linear model)



Model fitting

For condition A and B , add counts from all replicates: KiA,KiB

Consider KiA,KiB as NB-distributed with moments as estimatedand fitted

calculate the probability of observing the actual sums or moreextreme ones, conditioned on A = B .

DESeq, Anders and Huber, 2010


Correction for multiple testing

The false discovery rate (see lecture 1)

Defined as the expectation of the ratio of false positives (type Ierrors) to total positives (number of times the null is rejected)

Assume we are testing m hypotheses; the Benjamini-Hochbergprocedure for a given FDR α works as follows:

1 Rank p-values in increasing order;2 Find the largest k s.t. pk ≤ k

mα;3 Reject all null hypotheses 1,. . . ,k


DESeq results: MA plot



Example: Differential oestrogen receptor binding


Example: The colors of Chromatin

Filion et al, 2010


Example: ENCODE


Documents

Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green