ChIP-Seq - method for studying
epigenetic mechanisms
Oleg Shpynov
28.07.2018
● Regulation
● TFs and histone modifications
● ChIP-Seq protocol
● Ultra-Low-Input ChIP-Seq
● MACS2 and SICER peak callers
● Semi-supervised approach to Peak Calling
● Human monocytes aging project results
Agenda
2
3
● Evolution mainly works on regulatory, not protein-coding, DNA
● If we know regulation, we can find key spots in large pathways
● Next-generation sequencing!
http://massgenomics.org/2012/01/the-current-state-of-dbsnp.html
Why study regulation?
Primate to human: it’s in the regulation
4
● Gene structure and expression are
well conserved
● Gene coexpression is not
● The difference lies in gene regulation
A
C
B
D
A
C
B
D
humanchimpanzee
Proc Natl Acad Sci U S A. 2006 Nov 21;103(47):17973-8.
Chromosomes and chromatin
5
● Chromosomes are dense complexes of DNA and proteins
● Each human chromosome contains on average 5 cm of DNA
● This is about 2 m of DNA overall – too much!
● Chromatin = euchromatin + heterochromatin
https://www.shmoop.com/dna/dna-packaging.htmlhttp://www.mun.ca/biology/scarr/FISH_chromosome_painting.html
ChromEMT
6
● In 2017, a new method for chromatin
staining allowed to obtain high-contrast
electron tomography images of mitotic
chromosomes
● Chaotic 5 to 22nm structures observed
http://science.sciencemag.org/content/357/6349/eaag0025.long
Regulation of transcription
7
● Basal transcription: general
transcription factors bind the
promoter and RNA polymerase II
● Activator proteins bind DNA spots
named enhancers
● Enhancers are often located far
and have to loop
https://courses.lumenlearning.com/suny-wmopen-biology1/chapter/eukaryotic-gene-regulation/
Transcription factors
8
● ~1,500 transcription factors in humans
● Binding motif represented by consensus sequence
● Master regulators exist but are not always known
● Functions:
○ stabilize/block RNAP II binding to DNA
○ catalyze histone acetylation or deacetylation
○ recruit coactivator or corepressor
http://www.broadinstitute.org/education/glossary/transcription-factor
Chromatin Immunoprecipitation (ChIP)
9
http://www.bio.brandeis.edu/haberlab/jehsite/chIP.html
DNA-binding proteins are crosslinked
to DNA with formaldehyde in vivo.
Isolate the chromatin. Shear DNA
along with bound proteins into small
fragments.
Bind antibodies specific to the DNA-
binding protein to isolate the complex
by precipitation. Reverse the cross-
linking to release the DNA and digest
the proteins.
Use PCR to amplify specific DNA
sequences to see if they were
precipitated with the antibody.
Who was first?
10
http://www.snarkyscientist.com/2013/06/19/the-history-of-the-biggest-technique-of-2009-who-invented-chip-seq/
ChIP-Seq
● DNA library obtained after ChIP can be amplified and sequenced
● ChIP-Seq can be used for both transcription factors and histone
modifications, like H3K4me3
11
Epigenetic regulation
12
http://en.wikipedia.org/wiki/Epigenetics
● Histone modifications
● DNA methylation
● Noncoding RNA
Histone tail modification
13
https://www.irbbarcelona.org/en/news/understanding-the-molecular-origin-of-epigenetic-markers
● Histones tails stick outside
and can be recognized
● Chemical modifications
of histones influence
DNA accessibility
● Histone modifications
can be read, erased, and
recognized
Promoter histone marks
14The EMBO Journal, 31, pp 3130–3146 (2012) C. Xu et al, Nature Communications , 2(227),pp 1-8 (2011)
Mouse
heart
● Narrow peaks of H3K4me3 mark promoters
● Enzyme that methylates K4 binds only to non-CpG-methylated promoters!
Enhancer histone marks
15
Bauer, D.E.; Kamran, S.C. et al, Science, 342(6155), pp 253-7 (2013)
Human
erythroblasts
● Enhancers are associated with H3K4me1 and H3K27ac
● H3K27ac is thought to distinguish active enhancers from poised
Transcription elongation marks
16
Mouse
proB cells
Proc Natl Acad Sci U S A, 107(50), pp 21931-6 (2010)
Inactive Active Active
● Elongation is marked my H3K36me3 and H3K79me2
Gene repression by chromatin marks
17
H3K4me3
H3K4me3
H3K4me3
H3K27me3
H3K27me3
H3K27me3
ES
NPC
MEF
Mikkelsen, T.S.; Ku, M. et al, Nature 448, pp 553-560 (2007)Vastenhouw N.L.; Zhang Y. et al, Nature 464, pp 922-6 (2010)
● H3K27me3 marks suppressed genes poised to be activated
● Stem cells can have both H3K4me3 and H3K27me3 – unique!
Heterochromatin marks
18
http://medcell.med.yale.edu/histology/cell_lab/euchromatin_and_heterochromatin.php
● Marks H3K9me2 and
H3K9me3 are strongly
associated with heterochromatin
● Binding with protein HP1
Chromatin marks regulation (simplified)
19
Lee, T.I.; Young, R.A., Cell, 152(6), pp 1237-51 (2013)
Let’s study how regulation by histone modifications
changes with aging
20
21
Multiomics dissection of healthy human aging
5 marks x 40 donors
http://artyomovlav.wustl.edu/aging
Conventional vs Ultra Low Input ChIP-Seq+
Robust well-adopted
protocol
Good signal-to-noise
ratio
Lots of high quality data
available for human
Guidelines and pipelines
by ENCODE, Blueprint,
etc
–
2-5mln cells required
per single run
22
+
100k cells required per
single run
–
Difficulties to process in
wet lab
Worse signal-to-noise
ratio than conventional
ChIP-Seq Original
protocol is for mice
No high quality data
available for human
Сonventional vs ULI ChIP-Seq
23
Co
nve
ntio
na
lU
LI
24
ULI ChIP-Seq is always noisy
25
H3K4me3 - big variance in signal-to-noise ratio
26
ULI ChIP-Seq challenges
● High noise in the data due to ULI protocol
● High variance in signal-to-noise ratio
Peak calling - easy signal extraction problem
27
28
86 existing tools on the list*
Tools for easy problem?
* https://omictools.com/peak-calling-category
29
How to chose?
ENCODE ChIP-Seq pipeline
30
Problems
● MACS2 performs poorly on broad modifications
● Different signal-to-noise ratio in replicates
● Replicate concordance step fails
● IDR method works only for 2 replicates
https://www.encodeproject.org/pipelines/ENCPL272XAE/
31
Different tools = Different Data models
32
MACS2 - not good for broad marks● Estimate fragment size to shift tags
● Estimate local λ for Poisson from control track (non-specific binding)
● Use posterior probabilities to compute p-values and q-values,
merging close enriched locations to peaks
SICER - fails for TFs and narrow marks
● Uses coverage to estimate for λ-s for 2 Poisson distributions
● Uses blacklist regions to overcome mappability issues
● Complicated procedure of scoring islands and significance detection
Different Data models = Different cases
Modification Tool
TFs or NARROW Histone marks MACS2 or SPP or PeakSeg
BROAD or MIXED Histone marks SICER or PeakSeg or RSEG
33
Different cases = Different tools
https://github.com/olegs/bioinformatics/blob/master/chipseq/chipseq.pdf
Application for ULI ChIP-Seq data
34
MACS2, SICER peaks number
35
MACS2, SICER peaks length
36
Proc & Cons
Jaccard* ● Widely used
● Bad for shifts and
enclosed regions
● Symmetric
Overlap● Works with enclosed
regions (A < B)
● Tolerant for shifts
● Non-symmetric
*Jaccard(A, B) = length(A intersect B) / length(A union B) 37
How to estimate consistency?
38
How to estimate consistency?
Overlap(A, B) = ⅓Overlap (B, A) = 1 B
A
MACS2, SICER pairwise peaks overlap
39
~400 points shown,
pairwise 20 vs 20
MACS2, SICER pairwise peaks overlap
40
These are the tracks with
low signal-to-noise ratio
~400 points shown,
pairwise 20 vs 20
Are standard tools not applicable
or we failed to use correct parameters?
41
Classical vs supervised approach
42
43
Peak callers used in our study:● PeakSeg - didn’t work out-of-the-box
● MACS2 --broad
● RSEG - was too slow
● SICER
Peak callers optimization performance
https://academic.oup.com/bioinformatics/article/33/4/491/2608653
Semi-supervised approach
44
● Manually labelled dataset
● Parameter grid for each peak caller (MACS2, SICER, SPAN)
● Determine the parameter which gives the lowest error rate
45
● Preprocess input data
● Create 3 state HMM
● Train model by Baum-Welch
EM algorithm
● Compute posterior
probabilities
SPAN Peak Analyzer
46
● Preprocess input data
● Create 3 state HMM
● Train model by Baum-Welch
EM algorithm
● Compute posterior
probabilities
Parameters
● Use q-values to control FDR
at level alpha
● Use gap to merge close
enriched positions
SPAN Peak Analyzer
● Train models
● Visualize tracks in
genome browser
● Create visual labels:
500+ labels x 5 ChIP-seq
targets
● Optimize parameters in
single click
● Consistent peak calling!
Semi-supervised scheme
Peaks with default parameters
http://artyomovlab.wustl.edu/aging/howto.html
Peaks with optimized parameters
50
Overall ChIP-Seq processing scheme
Peaks number consistency improved
51
* https://genome.cshlp.org/content/11/12/1975.full.html
Peaks length consistency improved
52
* http://bionumbers.hms.harvard.edu/bionumber.aspx?id=105336
Consistency between samples improved
53
Criticism
A: Use consistency as a quality function,
while learning on the same markup
Q: Only a small fraction of genome is used,
labels are created only where we see consistency
visually
54
Validation: consistency with ENCODE improved
55
56
Validation: expected overlap between all experiments
No difference in core 5 histone marks found
57
58
No difference? Talk about variation!
Is it about regulation?
59
● No differences in 5 histone marks
● Difference in DNA methylation
DMRs are overrepresented in histone marks
60
61
[email protected]://research.jetbrains.org/groups/biolabs
Summary
● Histone modifications is mechanism of regulation
● ChIP-Seq allows to profile histone modifications
● ULI ChIP-Seq allows to profile many modifications for same donor
● MACS2, SICER are not applicable for data with different signal-to-noise ratio
● Semi-supervised approach produces high quality results
● No changes in 5 core histone marks in HEALTHY human monocytes aging
● Regulation? Potentially interesting changes in DNA methylation in enhancers
Thank you!
62
ENCODE project
63
● ENCODE = ENCyclopedia Of DNA Elements
● Pilot cost (2007): $55M, up to date: ~$300M
● RNA-Seq, ChIP-seq of major TFs and histone modifications, DNA methylation
● Series of publications in the Fall of 2012 (6 Nature papers, 30 papers overall)
http://www.sciencemag.org/content/337/6099/1159/F2.expansion.html
64
ENCODE project discoveries
65
● 400,000 enhancers and 70,000
promoters
● More than 90% of genomic variation
are in noncoding areas
● DNase I footprint is not that big
● mRNAs are more abundant in cytosol,
other RNAs – in the nucleus
● “More than 80% of human genome is
functionally active”
http://www.evolutionnews.org/2012/09/the_demise_of_j_1064061.html
ENCODE project criticism
66
● 80% of DNA cannot be truly functional, since
only about 10% (5-15%) is conserved
● This means ~70% of genome is either
○ impervious to deleterious mutations, or
○ does not mutate, or
○ does not have deleterious mutations
http://blogs.scientificamerican.com/guest-blog/2012/09/17/junk-dna-junky-pr/
Histone code
hypothesis
67
Strahl, Allis, Nature 403(6), 2000, 41-45
● Concept similar to
genetic code
● Implies existence of
histone mark
combinations that
have specific
function
Main tools for genome segmentation
68
Jason Ernst lab - ChromHMM William Noble lab - Segway
Nat Methods 2012 Feb 28;9(3):215-6. doi: 10.1038/nmeth.1906Nat Methods 2012 Mar 18;9(5):473-6. doi: 10.1038/nmeth.1937
ChromHMM
69
● BED files are binarized using the selected chromatin marks
(present: 1, absent: 0)
● Marks are then grouped in a number of states – biologically meaningful
combinations of marks
● Transition is transfer between states, emission – probability of causing the
observed effect
Nature 2011 May 5;473(7345):43-9. doi: 10.1038/nature09906
Genome annotation
70
● Segmentation
allows discovery
of novel elements,
alternative
promoters
● Here we find a
new non-coding
RNA
Nucleic Acids Res 2013 Jan;41(2):827-41. doi: 10.1093/nar/gks1284
Discovery of lncRNAs
71Nature 2009 Mar 12;458(7235):223-7. doi: 10.1038/nature07672
● Long noncoding RNAs in 2008 were rare, considered artifacts
● ChIP-Seq of H3K4me3/H3K4me36 revealed thousands of lincRNAs
Superenhancers
● There are estimated 400,000 enhancers in human genome
● Not all are active in every cell – estimated 5,000 - 100,000 per cell type
● There are special types of enhancer elements called superenhancers
● Enriched for Med1, H3K27ac, H3K4me1, and master TFs
72Cell 2013 Apr 11;153(2):307-19. doi: 10.1016/j.cell.2013.03.035
73
MACS2
Step 1: estimating fragment length d
● Slide a window of size BANDWIDTH
● Find top regions with MFOLD enrichment of treatment vs input
● Use +/- strand cross correlation to estimate d
74
Step 2: identification of local noise parameter
● Slide a window of size 2*d across treatment and input
● Estimate λ for Poisson distribution
75
Step 3: identification of enriched regions
● Find regions with P-values < PVALUE
● Determine summit position inside enriched regions as max density
76
Step 4: Significance testing
● Swap treatment and control, call peaks using same PVALUE
77
Step 5: Broad peak calling
● Use PVALUE or BROAD-CUTOFF option to filter enriched peaks
● Compose broad regions of nearby enriched peaks
● Max length of region is 4*d
78
79
SICER
Step 1: detection of Islands
● Use coverage to estimate global λ-s for
Poisson distributions (treatment and
control)
● Classify enriched windows
● Enriched windows are separated by gaps
● Island is a cluster of enriched windows
separated by gaps of size at most GAP
windows
80
Example: GAP = 2
Step 2+: scoring
● The scoring function is based on probability of observation tags count in a
random background
● Scoring for enriched window = -ln P(m, lambda)
● Scoring for island is the aggregated score of all enriched windows in the
island, corresponds to the background probability of finding the observed
pattern
81
Score(I) = F* (Score(I1), Gap, Score(I2))
Step N: significance testing
● Use control library as background to calculate p-value for islands
● Or use random background model to calculate p-values for islands
● Compute q-values by p-values
● Filter by p-value of by q-value (FDR)
82