61
Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

Embed Size (px)

Citation preview

Page 1: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets)

The Chinese University of Hong KongCSCI5050 Bioinformatics and Computational Biology

Page 2: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 2

Lecture outline1. Cis-regulatory modules and enhancers2. Prediction of enhancers

– General– Context-specific

3. Prediction of enhancer targets4. Experimental validations

Last update: 3-Oct-2015

Page 3: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CIS-REGULATORY MODULES AND ENHANCERS

Part 1

Page 4: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 4

Regulatory sequence elements• DNA sequence elements playing regulatory

roles in transcription by interacting with DNA-binding proteins– Promoters: Initiating transcription– Enhancers: Enhancing transcription

• Locus control regions (LCRs): Enhancing a set of linked genes

– Silencers: Repressing transcription– Insulators: Setting gene boundaries, blocking

promoter-enhancer interactions

Last update: 3-Oct-2015

Page 5: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 5

The binding proteins• There are different types of DNA binding

proteins– By specificity: sequence-specific vs. non-specific– By function: transcriptional regulation, DNA

cleavage, DNA modification, etc.• The regulatory elements are bound by

proteins generally called transcription factors (TFs)– The binding sites of the TFs are called transcription

factor binding sites (TFBSs)

Last update: 3-Oct-2015

Page 6: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 6

The binding proteins• More specific (, a bit confusing) names of particular

subtypes of TFs:– Enhancers are bound by activators– Silencers are bound by repressors– Insulators are commonly bound by a protein called CTCF (CCCTC-

Binding Factor)– (Promoters are bound by transcription factors and RNA

polymerase, but the polymerase itself is not counted as a transcription factor)

• In many cases, a TF is a protein complex– At least one subunit of the complex contains a DNA-binding

domain– Other subunits do not directly bind DNA (and are called co-

activators, for example)

Last update: 3-Oct-2015

Page 7: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 7

Recognition of sequence elements• How does a TF decide where to bind?

– Where DNA is accessible– Where there are special signals on the DNA (e.g.,

lack of methylation) and the surrounding proteins (e.g., histone modifications)

– Where the DNA structure is suitable• Minor groove shape, propeller twist, etc.

– Where the DNA sequence is suitable• Motifs that are usually short (e.g., 6-10bp)

Last update: 3-Oct-2015

Page 8: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 8

Recognition of sequence elements

Last update: 3-Oct-2015

Image credit: Papavassiliou, Molecular medicine Today 4(8):358-366, (1998)

Page 9: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 9

Effects of regulatory elements

Last update: 3-Oct-2015

Image credit: Sholtis and Noonan, Trends in Genetics 26(3):110-118, (2010)

No TF binding: only basal expression of gene A

TF1 binding enhancer (at limbs): elevated expression of gene A

TF2 binding enhancer (at brain): elevated expression of gene ATF3 binding silencer: expression of gene A inhibited

CTCF not binding insulator: binding of TF1 at enhancer can affect both gene A and gene B

Page 10: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 10

Locations of regulatory elements

Last update: 3-Oct-2015

Image credit: Maston et al., Annual Review of Genomics and Human Genetics 7:29-59, (2006)

Page 11: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 11

Enhancers• Enhancers have been a major research focus in the

past few years due to various reasons:– Very incomplete catalog, difficult to locate them

• Can be upstream or downstream of target gene• Can be relatively close (within kilobases) or far away (MBs)

from target gene, estimated average: ~100kb

– Context-specific– Availability of large-scale experimental methods and data

• Identification of enhancers• Validation of enhancers

– Annotation of disease-associated non-coding variants– Discovery of super enhancers

Last update: 3-Oct-2015

Page 12: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 12

Cis-regulatory modules• A cis-regulatory module (CRM) is a module of

multiple sequence elements that regulate the expression of genes nearby (“in cis”)

• It is sometimes used a synonym as enhancer• However, based on the precise definitions:

– CRMs can also include other cis-acting regulatory elements

– Enhancers may not always function in cis (some enhancers regulate very distal genes in trans)

– An enhancers does not necessarily constitutes a module of multiple TF binding sites

Last update: 3-Oct-2015

Page 13: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 13

Function of enhancers• Enhancer-promoter

looping1. DNA2. Enhancer3. Promoter4. Gene5. Transcriptional

activator/co-activator6. Mediator7. RNA polymerase

Last update: 3-Oct-2015

Image credit: Jon Cheff (Wikipedia)

Page 14: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 14

More about DNA looping

Last update: 3-Oct-2015

(a) Intragenic loops joining the 5 and ′3 end of genes may allow recycling ′of RNA Pol II and facilitate maintenance of transcriptional directionality. (b) Enhancer-promoter loops—mediated by sequence-specific transcription factors, and possibly assisted by noncoding RNAs or by general DNA binding factors such as CTCF and cohesin—lead to transcriptional activation. (c) Loops between Polycomb-bound regions (PREs) and promoters prevent RNA Pol II recruitment and/or impair transcriptional elongation of promoter-bound RNA polymerases. (d) Insulator-mediated loops may segregate individual loci containing the coding part of the gene and its regulatory regions from the surrounding genome landscape with other regulatory elements.

Image credit: Cavalli and Misteli, Nature Structural and Molecular Biology 20(3):290-299, (2013)

Page 15: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 15

More about DNA looping• “Loop gene” vs. “anchor gene”: Up-regulation of

anchor genes > loop genes > non-interacting genes

Last update: 3-Oct-2015

Image credit: Fullwood et al. Nature 462(7269):57-64, (2009)

Page 16: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 16

Signatures of enhancers• General for active/functional DNA:

– Evolutionary conservation– Open chromatin

• General for regulatory elements:– Containing TFBS

• Cluster of TFBS

• Specific to enhancers:– P300 binding– H3K4me1, H3K4me2, H3K27ac

• Signals and signal patterns

– Enhancer RNA• Lack of inactive marks and marks of other types of

regulatory elementsLast update: 3-Oct-2015

Page 17: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 17

Context specificity• An enhancer

can be active in some contexts (cell type, tissue type, etc.) and inactive/posed in some other contexts

Last update: 3-Oct-2015

Image credit: Shlyueva et al. Nature Reviews Genetics 15(4):272-286, (2014)

Page 18: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 18

Computational problems1. Prediction of enhancers

a) “General”, i.e., genomic regions that are enhancers in some contexts

b) Context-specific, i.e., enhancers that are active in a given context

2. Prediction of target genes of enhancersa) “General”b) Context-specific

Last update: 3-Oct-2015

Page 19: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

PREDICTION OF GENERAL ENHANCERS

Part 2a

Page 20: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 20

Static features• In the past, static features are used to predict

general enhancers, i.e., features that remain the same across different contexts– Presence/density of TFBS

• Based on sequence motifs

– Evolutionary conservation• Based on multiple sequence alignment

Last update: 3-Oct-2015

Page 21: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 21

Learning approach• Supervised: Learn features based on known

examples– Few good examples (more on this later)

• Unsupervised: Threshold features– Mostly arbitrarily

• Semi-supervised: Combine information from known examples and distribution of genomic regions in the feature space

Last update: 3-Oct-2015

Page 22: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 22

Overview of strategies

Last update: 3-Oct-2015

Image credit: Su et al. PLOS Computational Biology 6(12):e1001020, (2010)

Page 23: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 23

Classification of methods

Last update: 3-Oct-2015

Image credit: Su et al. PLOS Computational Biology 6(12):e1001020, (2010)

Page 24: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 24

TFBS cluster methods• Choice of TFs:

– All TFs with known motifs– TFs known to bind enhancers (incomplete

knowledge)• Definition of clusters:

– High density of TFBS, as compared to background– Occurrence of binding sites of the same set of TFs

Last update: 3-Oct-2015

Page 25: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 25

Finding TF binding motifs• Alignment of promoter sequences

Last update: 3-Oct-2015

Image credit: D'haeseleer et al. Nature Biotechnology 24(8):959-961, (2006)

Page 26: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 26

Finding TF binding motifs• High-throughput SELEX

(systematic evolution of ligands by exponential enrichment ): Testing the binding of TF protein/binding domain with random nucleotide sequences

Last update: 3-Oct-2015

Image credit: Jolma et al. Cell 152(1-2):327-339, (2013)

Page 27: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 27

Sequence conservation methods• Non-coding elements with “extreme”

conservation: Genomic regions with– High human–pufferfish-Takifugu (Fugu)-rubripes

conservation, or ultra-high human–mouse–rat conservation

– High sequence match score– No sign of transcription or protein-coding

• Results:– Among 167 predictions tested in a mouse assay,

45% reproducibly show enhancer activities

Last update: 3-Oct-2015

Page 28: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 28

Sequence conservation methods

Last update: 3-Oct-2015

Image credit: Pennacchio et al., Nature 444(7118):499-502, (2006)

Page 29: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 29

TFBS cluster conservation methods• Conservation of TFBS clustering (distance),

affinity and conservation

Last update: 3-Oct-2015

(A) EEL scoring function. Top: schematic representation of two TFs (blue and red ovals) bound to DNA of unequal length from two different species. Side view (top left) indicates mean distance (View the MathML sourcex¯) and difference in distance (Δx), and front view (top right) indicates difference in angle (Δϕ) of the two factors bound to DNA (open circle). Position weight matrix scores for TFs were used as a proxy for binding affinity in calculation of ΔGT, the sum of TF affinities to sites in both species. Bottom: the score function. See Supplemental Data for details.

(B) EEL analysis (left) using the five known TFs that regulate eve (Hunchback, Caudal, Knirps, Bicoid, and Kruppel) identifies all four enhancers driving striped expression of Drosophila eve (right). Blue diagonal lines indicate aligned regions, and black lines on the x and y axes represent the conserved TF binding sites that constitute the cis-modules (CM). Number after the CM indicates its rank based on its EEL score.

(C) Text display of EEL alignment of part of the eve Stripe 3/7 enhancer (CM1 from [B]). D. pseudoobscura and D. melanogaster sequences are on top and bottom lines, respectively. EEL aligns the DNA sequences between the conserved TF sites for clarity; the DNA alignment does not contribute to the EEL score. Yellow boxes indicate conserved binding sites of Hunchback (Hb) or Knirps (Kni), which regulate this cis-module ( Small et al., 1996).

(D) A distal −20 kb enhancer element in the mouse and human MyoD genes is identified by EEL analysis.Image credit: Hallikas et al., Cell 124(1):47-59, (2006)

Page 30: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 30

Comparison of methods• REDfly validated regulatory modules against

short exons and introns

Last update: 3-Oct-2015

Image credit: Su et al. PLOS Computational Biology 6(12):e1001020, (2010)

Page 31: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 31

Combination of methods• Performance change of method pairs

Last update: 3-Oct-2015

Image credit: Su et al. PLOS Computational Biology 6(12):e1001020, (2010)

Page 32: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

PREDICTION OF CONTEXT-SPECIFIC ENHANCERS

Part 2b

Page 33: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 33

Context-specific enhancer activities• The above prediction methods can only

predict whether a genomic region is an enhancer in some context, but not its actual activity in a context

• High-throughput sequencing made it possible to obtain a lot of useful context-specific data– Protein binding– Chromatin accessibility– Histone modifications– Enhancer RNA

Last update: 3-Oct-2015

Page 34: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 34

Chromatin accessibility• Analysis of a known

enhancer (HS2 of the -globin LCR) and other DNase I hypersensitive sites (DHSs) with similar patterns across cell types– 14 of the 20 displayed

enhancer activity

Last update: 3-Oct-2015

Image credit: Thurman et al. Nature 489(7414):75-82, (2012)

Page 35: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 35

Histone modifications• Typical signatures of predicted enhancers

Last update: 3-Oct-2015

Image credit: Heintzman et al. Nature 459(7423):108-112, (2009)

Page 36: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 36

Enhancer RNA• Bi-directional, non-coding transcripts around

active enhancers– May play a functional role in gene regulation

Last update: 3-Oct-2015

Image credit: Andersson et al., Nature 507(7493):455-461, (2014)

Page 37: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 37

Unsupervised predictions• Whole-genome

segmentation (ChromHMM) using hidden Markov models based on histone marks– Manual

interpretation of the resulting states

Last update: 3-Oct-2015

Image credit: Ernst and Kellis, Nature Methods 9(3):215-216, (2012)

Page 38: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 38

Unsupervised predictions• Whole-genome segmentation using Segway

– E: enhancer; GM: gene middle

Last update: 3-Oct-2015

Image credit: Hoffman et al., Nature Methods 9(5):473-476, (2012)

Page 39: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 39

Unsupervised predictions• Rule-based filtering

Last update: 3-Oct-2015

Human genome grch37

Divide into 100bp binsDivide into 100bp bins

Remove blacklisted regionsRemove blacklisted regions

Remove bins with K562 BAR score <= 0.9Remove bins with K562 BAR score <= 0.9

Remove bins with K562 promoter score > 0.8Remove bins with K562 promoter score > 0.8

Remove bins within +/- 2000bp from Gencode TSSRemove bins within +/- 2000bp from Gencode TSS

Remove bins that intersect Gencode exonsRemove bins that intersect Gencode exons

Remove bins with phastCons primate score < 0.1Remove bins with phastCons primate score < 0.1

Merge adjacent bins into longer intervalsMerge adjacent bins into longer intervals

30,956,951 bins

30,840,726 bins (116,225 bins filtered)

461,722 bins

412,000 bins

257,666 bins

243,951 bins

97,193 bins

55,857 intervals

0

1

2

3a

3b

4

5

6

Remove intervals with no binding motifs of expressed (IDR < 0.05) TFsRemove intervals with no binding motifs of expressed (IDR < 0.05) TFs7

59,425 intervals

Page 40: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 40

Supervised predictions• If we want to train a supervised model, we need

positive and negative examples• Difficulties:

– Even one of the biggest databases of validated enhancers, VISTA contains only 1203 positive examples and 1076 negative examples as of Mar 2015

– We do not have complete knowledge of the context-specific activities of these examples

• Positive: We do not whether in which contexts they are positive• Negative: We do not whether they are always negative

– They are biased – Usually the most confident former predictions with high sequence conservation

Last update: 3-Oct-2015

Page 41: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 41

Constructing positive examples• Use some features to define positive examples• These features should not be used in the

prediction process (except in some special settings), otherwise– They will dominate the resulting models– Prediction accuracy cannot be evaluated

Last update: 3-Oct-2015

Page 42: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 42

Constructing negative examples• Sampling genomic regions likely to be negatives

– Random regions• Too negative, decision boundary can be fuzzy

– Other types of sequence element• Resulting model is for distinguishing between these element types

rather than identifying enhancers• Hard to ensure mutual exclusiveness

– Enhancers known to be active only in other contexts• Hard to obtain

– Combination of the above• In general, it is always a tough decision as to what

properties of the positive examples the negative examples should match

Last update: 3-Oct-2015

Page 43: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 43

Supervised predictions• RFECS (Random Forest based Enhancer identification

from Chromatin States)– Positives: gene-distal P300 binding sites overlapping DHSs– Negatives:

• TSSs overlapping DHS• Random regions distal from P300 binding sites or TSSs

– Features:• 24 histone marks• For each histone mark, average signal of 20 bins around the

target region (to capture signal pattern)

– Machine learning model:• Random Forest

Last update: 3-Oct-2015

Page 44: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 44

Supervised predictions• RFECS: Patterns around P300 binding sites

Last update: 3-Oct-2015

Image credit: Rajagopal et al., PLOS Computational Biology 9(3):e1002968, (2013)

Page 45: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 45

Supervised predictions• RFECS:

Prediction accuracy

Last update: 3-Oct-2015

Image credit: Rajagopal et al., PLOS Computational Biology 9(3):e1002968, (2013)

Page 46: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 46

Supervised predictions• RFECS:

Feature importance and co-occurrence

Last update: 3-Oct-2015

Image credit: Rajagopal et al., PLOS Computational Biology 9(3):e1002968, (2013)

Page 47: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 47

Semi-supervised predictions• Several main approaches

– Mainly unsupervised, but using known examples to bias the clustering process (e.g., requesting certain regions must receive the same state)

– Mainly supervised, but using global distribution of regions in the feature space to discover sub-classes

– Optimizing a function that includes:• Prediction accuracy of known examples• Likelihood/posterior probability of data• Model complexity

Last update: 3-Oct-2015

Page 48: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

PREDICTION OF ENHANCER TARGETS

Part 3

Page 49: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 49

Enhancer targets• In theory, enhancers can be upstream or

downstream of, and either near or far away from their targets

• Features useful for identifying enhancer targets:– Distance

• Closest gene(s)

– Activity correlations– Co-conservation/co-evolution

Last update: 3-Oct-2015

Page 50: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 50

Distance• It is true that

– Enhancers can be far away from their targets– The gene closest to an enhancer may not be its

target• However,

– In general the closer an enhancer is from a gene in the DNA sequence, the higher chance that the gene is the target of the enhancer

Last update: 3-Oct-2015

Page 51: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 51

Activity correlation• Main idea:

– If an enhancer regulates a gene, the activity of the enhancer should be correlated with the expression of the gene

• Using this idea:1. Compute correlation for all enhancer-target pairs within a

certain maximum distance2. Return the ones with significant correlations

• Issues:– Quantification of enhancer activity– Multiple hypothesis testing– An enhancer may only regulate a gene in some contexts– A gene may have more than one regulating enhancer– Cannot identify context-specific regulation

Last update: 3-Oct-2015

Page 52: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 52

Co-conservation/co-evolution• Main idea: If an enhancer regulates a gene,

they will– Co-occur in genomes– Mutate together

Last update: 3-Oct-2015

Page 53: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 53

Identifying enhancer targets

Last update: 3-Oct-2015

Image credit: He et al., PNAS 111(21):E2191-E2199, (2014)

Page 54: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

EXPERIMENTAL VALIDATIONSPart 4

Page 55: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 55

Validating enhancers• Reporter assay

– Put a construct with an enhancer candidate, a reporter gene, and a weak promoter

– If the enhancer is active, the reporter gene will be transcribed

• Limitations: Not the natural context– Distance– Chromatin state– Presence of relevant TFs

Last update: 3-Oct-2015

Page 56: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 56

Reporter assay

Last update: 3-Oct-2015

ID Enhancer activity Tissues with patternsENH_DISCR_2 Positive Tectum, FinENH_DISCR_38 NegativeENH_DISCR_16 Negative Not consistentENH_DISCR_18 Positive TelencephalonENH_DISCR_37 Positive EpidermisENH_DISCR_14 Negative Not consistentENH_DISCR_19 Negative Not consistentENH_DISCR_34 Positive EpidermisENH_DISCR_41 Positive Blood_heartENH_DISCR_44 Weak BloodENH_DISCR_24 Negative Not consistentENH_DISCR_1 Negative Not consistent/heartENH_DISCR_17 Positive TelencephalonENH_DISCR_32 Positive TelencephalonENH_DISCR_47 Positive EpidermisENH_DISCR_35 Positive Blood, earENH_DISCR_45 Weak EpidermisENH_DISCR_21 Positive Epidermis, lateENH_DISCR_12 NegativeENH_DISCR_13 Positive TelencephalonENH_DISCR_22 Weak Epidermis_bloodENH_DISCR_26 Negative Not consistentENH_DISCR_31 Weak BloodENH_DISCR_40 Positive BloodENH_DISCR_48 Positive Epidermis, bloodENH_DISCR_25 Positive TectumENH_DISCR_29 Positive Telencephalon

Image credit: The ENCODE Project Consortium, Nature 489(7414):57-74, (2012)

Page 57: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 57

Massively parallel reporter assay• Investigating the

impact of mutations to enhancers

Last update: 3-Oct-2015

Image credit: Melnikov et al., Nature Biotechnology 30(3):271-277, (2012)

Page 58: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 58

STARR-seq• Including candidate enhancer downstream of

reporter gene, so that it becomes part of the transcript and its activity can be determined by RNA-seq

Last update: 3-Oct-2015

Image credit: Arnold et al., Science 339(6123):1074-1077, (2013)

Page 59: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 59

Enhancer knock-out• Current technology (e.g., CRISPR) allows for precise deletion

of an enhancer candidate. Effect on gene expression can then be determined

Last update: 3-Oct-2015

Image credit: Hsu et al., Cell 157(6):1262-1278, (2014)

Page 60: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 60

DNA long-range interactions• Hi-C/TCC: Not specific to transcription regulation• ChIA-PET: Requires a relevant protein

Last update: 3-Oct-2015

Image credit: Zeng and Mortazavi, Nature Immunology 13(9):802-807, (2012)

Page 61: Lecture 5. Topics in Gene Regulation and Epigenomics (Prediction of Enhancers and Enhancer Targets) The Chinese University of Hong Kong CSCI5050 Bioinformatics

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 61

Summary• Enhancer is one important type of transcriptional

regulatory elements• There are many imperfect signatures of enhancers• Activities of some enhancers are context-specific• Target gene of an enhancer can be far away from

it, but not too far in general• (Qin has written a review on enhancer and

enhancer target predictions: http://www.cse.cuhk.edu.hk/~kevinyip/papers/EnhancerReview_CBBGR2015.pdf)

Last update: 3-Oct-2015