Upload
dinah-carmel-peters
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Evolutionary and genomic approaches to find gene regulatory sequences
Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller, Francesca Chiaromonte, Anton Nekrutenko, Kateryna Makova, Stephan Schuster, Ross Hardison
University of California at Santa Cruz: David Haussler, Jim Kent
Children’s Hospital of Philadelphia: Mitch WeissNimbleGen: Roland Green
University of Nebraska, Lincoln February 14. 2007
Major goals of comparative genomics
• Identify all DNA sequences in a genome that are functional– Selection to preserve function– Adaptive selection
• Determine the biological role of each functional sequence
• Elucidate the evolutionary history of each type of sequence
• Provide bioinformatic tools so that anyone can easily incorporate insights from comparative genomics into their research
Known types of gene regulatory regions
G.A. Maston, S.K. Evans, M.R. Green (2006) Ann. Rev. Genomics & Human Genetics 7:29-59.
Regulatory regions tend to be clusters of transcription factor
binding sites
Sequence-specific
SV40 promoters and enhancer
Properties of known regulatory regions
• Binding sites for transcription factors, many with sequence specificity
• Clusters of binding sites• Conventional promoters encompass major start sites for transcription
• Conserved over evolutionary time???
Structures involved in transcription are probably more
complex
Peter R. Cook, Oxford University, http://users.path.ox.ac.uk/~pcook/images/Images.html
Middle image: Green: active transcription (Br-UTP label) Red: all nucleic acids HeLa cellSides: EM spreads of transcripts
Domain opening is associated with movement to non-heterochromatic regions
Schubeler, Francastel, Cimbora, Reik, Martin, Groudine (2000) Genes & Dev. 14: 940-950
Other possible activities for sequences involved in gene
regulation• Opening or closing a chromosomal domain• Move a gene to or away from a transcription factory
• Control how long a gene is in a transcription factory– Long association
• High level expression• Really long gene
– Short association• Lower level expression• Rapid regulation
• Are these conserved over evolutionary time?
3 modes of evolution
Sequence matches at longer phylogenetic distances could reflect purifying selectionSequence differences at closer phylogenetic distances could reflect adaptive evolution.
Conservation vs. Constraint
• Conserved sequences are those that align between two species thought to be descended from a common ancestor
• Constrained sequences show evidence in their alignments of negative (purifying) selection– E.g. change at a rate significantly slower than “neutral” DNA
Ideal cases for interpretation
Neutral DNASimilarity
Human vs mouse
Position along chromosome
DNA segments with a function common to divergent species.
DNA segments in which change is beneficial to at least one of the two species.
Negative selection(purifying)
P (not neutral)Neutral DNA
Similarity
Positive selection(adaptive)
Neutral DNA
Human vs rhesus
Messages about evolutionary approaches to predicting regulatory
regions• Regulatory regions are conserved, but not all to the same phylogenetic distance.
• Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity.
• In vivo occupancy by GATA-1 suggests other activities in addition to enhancers.
• Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection.
Finding all gene regulatory regions is a challenge for comparative
genomics
• Known regulatory regions for the HBB complex• 23 total• 19 conserved (align) between human and mouse• Many others show no significant difference in a measure of constraint (phastCons) from the bulk or neutral DNA
Two extremes of
constraint in TRRs
ENCODE projects
• ENCODE (ENCyclopedia Of DNA Elements): consortium aiming to find function for all human DNA sequences– Phase I focused on 1% of human DNA– 30 Mb, 44 regions
• About 10 regions had known genes of interest (CFTR, HOX)
• Others were chosen to get a sampling of regions varying in gene density and alignability with mouse
• Major areas– Genes and transcripts– Transcriptional regulation– Chromatin structure– Multiple sequence alignment– Variation in human populations
Biochemical assays for protein-binding sites in DNA
Purified protein& Naked DNA
Chromatin Immunoprecipitation:DNA sites occupied by a protein inside cells.
ChIP-on-chip to examine many sites
Putative transcriptional regulatory regions = pTRRs
• Antibodies vs 10 sequence-specific factors: – Sp1, Sp3, E2F1, E2F4, cMyc, STAT1, cJun, CEBPe, PU1, RA Receptor A
– High resolution ChIP-chip platforms: Affymetrix and NimbleGen
– Data from several different labs in ENCODE consortium
• High likelihood hits for ChIP-chip– 5% false discovery rate
• Supported by chromatin modification data– Modified histones in chromatin: H4Ac, H3Ac, H3K4me, H3K4me2, H3K4me3, etc.
– DNase hypersensitive sites (DHSs) or nucleosome depleted sites
• Result: set of 1369 pTRRs
A small fraction of cis-regulatory modules are conserved from human to
chicken
310
450
91
173
Millions ofyears
• About 4% of pTRRs, 4% of DNase HSs, 4-7% of promoters active in multiple cell lines
• Tend to regulate genes whose products control transcription and development
David King
Most pTRRs are conserved in eutherian mammals
310
450
91
173
Millions ofyears
Within aligned noncoding DNA of eutherians, need to distinguish constrained DNA (purifying selection) from neutral DNA.
Percentage of class that align no further than:
Primates: 3%
Eutherians: 71%
Marsupials: 21%
Tetrapods: 4%
Vertebrates: 1%
pTRRs DNase HSs Promoters
11%
70%
14%
4%
1%
1-13%
63%
16-28%
4-7%
2-4%
Measures of conservation and constraint capture only a subset of
pTRRs
Fraction overlappingan MCS
phastCons (background rate corrected)
Composite alignability (background rate corrected)
Stringent constraint Allows a range of constraint
Aligns, but no inference about purifying selection
Different measures perform better on specific functional regions
Sensitivity
1-Specificity
Examples of clade-specific pTRRs
Messages about evolutionary approaches to predicting regulatory
regions• Regulatory regions are conserved, but not all to the same phylogenetic distance.
• Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity.
• In vivo occupancy by GATA-1 suggests other activities in addition to enhancers.
• Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection.
Regulatory potential (RP) to distinguish functional classes
Good performance of ESPERR for gene regulatory regions (RP)
-
James TaylorFrancesca Chiaromonte
Messages about evolutionary approaches to predicting regulatory
regions• Regulatory regions are conserved, but not all to the same phylogenetic distance.
• Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity.
• In vivo occupancy by GATA-1 suggests other activities in addition to enhancers.
• Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection.
Conservation of predicted binding sites for
transcription factorsBinding site for GATA-1
Genes Co-expressed in Late Erythroid Maturation
G1E-ER cells: proerythroblast line lacking the transcription factor GATA-1. Can rescue by expressing an estrogen-responsive form of GATA-1Rylski et al., Mol Cell Biol. 2003
Predicted cis-Regulatory Modules (preCRMs) Around Erythroid Genes
B:Yong Cheng, Ross, Yuepin Zhou, David KingF:Ying Zhang, Joel Martin, Christine Dorman, Hao Wang
preCRMs with conserved consensus GATA-1 BS tend to be active on transfected
plasmids
preCRMs with conserved consensus GATA-1 BS tend to be active after integration into a chromosome
Examples of validated preCRMs
Correlation of Enhancer Activity with RP Score
Validation status for 99 tested fragments
preCRMs with High RP and Conserved Consensus GATA-1 Tend To Be
Validated
Compare the outputs
C C N C M C C C W
Consensus for EKLF binding site:
All validated preCRMs
All nonvalidated preCRMs
Same parameters
CCNCMCCCWCCNCMCCCW
CACC box helps distinguish validated from nonvalidated preCRMs
Ying Zhang
Messages about evolutionary approaches to predicting regulatory
regions• Regulatory regions are conserved, but not all to the same phylogenetic distance.
• Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity.
• In vivo occupancy by GATA-1 suggests other activities in addition to enhancers.
• Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection.
preCRMs with conserved consensus GATA-1 binding sites are usually occupied by
that protein: ChIP assay
Design of ChIP-chip for occupancy by GATA-1
1. Non-overlapping tiling array with 50bp probe and 100bp resolution (NimbleGen)
2. Cover range Mouse chr7:57225996-123812258 (~70Mbp)3. Antibody against the ER portion of
GATA-1-ER protein in rescued G1E-ER4 cells
50 50
100
Yong Cheng, with Mitch Weiss & Lou Dore (CHoP), Roland Green (NimbleGen)
Signals in known occupied sites in Hbb LCR
1) Cluster of high signals2) “hill” shape of the signals
HS1 HS2 HS3
Peak Finding Programs
• TAMALPAISMark Bieda from Peggy Farmham’s lab Focus more on the cluster of the signals4 thresholds based on number of consecutive probes with signals in the 98th or 95th percentiles
• MPEAKBing Ren’s labFocus more one the “hill” shape of the signal4 thresholds, for a series of probes with at least one that is 3, 2.5, 2 or 1 standard deviations above the mean
ChIP-chip hits for GATA-1 occupancy
Mpeak TAMALPAIS
275 hits in both 276 hits in both216 6059
321 total ChIP-chip hits
Technical replicates of ChIP-chip with antibody against GATA1-ER
ChIP-chip hits validate at a high rate
Validation determined by quantitative PCR.19 of the 321 hits were tested.13 (~70%) were validated.
9 regions were “hits” in only one of the two technical replicates.None were validated.
Validation rate is similar at different thresholds
ChIP DNA
Association of WGATAR and conservation with ChIP-chip
Hits
1. 249 out of the 321 (78%) have WGATAR motifs, binding site for GATA-1
2. Of the GATA-1 binding motifs in those 249 hits, 112 (45%) are conserved between mouse and at least one non-rodent species.
Expected and unexpected ChIP-chip hits
Distribution of ChIP-chip hits on 70Mb of mouse chr7
Yong Cheng, Yuepin Zhou and Christine Dorman
Almost half the GATA-1 ChIP-chip hits increase expression of a
transgene, K562 cells
0
1
2
3
4
GHP181GHP10GHP7GHP182GHP309
GHP1GHP186GHP205
GHP4GHP314GHP172GHP167GHP74GHP193GHP27GHP9
GHP170GHP18GHP16GHP243GHP15GHP28GHP17GHP31GHP11GHP198GHP169GHP14GHP173GHP29GHP199GHP12GHP3GHP24GHP164GHP13GHP30GHP19GHP26GHP161GHP191GHP197GHP183GHP184GHP6GHP23GHP206GHP194GHP202
GHP0GHP200
GHP8GHP185GHP118GHP20GHP204GHN534GHN006GHN133GHN037GHN322
YC3
GHN213
Fold change over parent
GATA-1 occupied sites by ChIP-chip No GATA-1
15 6 6
24 validated out of 56 fragments with ChIP-chip hits tested 43%
Conserved and nonconserved ChIP-chip hits can be active
as enhancers
Conserved, active
Conserved, not active Not conserved, active
Messages about evolutionary approaches to predicting regulatory
regions• Regulatory regions are conserved, but not all to the same phylogenetic distance.
• Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity.
• In vivo occupancy by GATA-1 suggests other activities in addition to enhancers.
• Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection.
Polymorphism as a transient phase of evolution
Slide from Dr. Hiroshi Akashi
Test of neutrality using polymorphism and divergence data
Test for recent selection in human noncoding DNA
• McDonald-Kreitman test• Use ancestral repeats as neutral model (MKAR test)• Count polymorphisms in human using dbSNP126• Count divergence of human from
– Chimpanzee (great Ape, diverged from human lineage 6 Myr ago)
– Rhesus macaque (Old World Monkey, diverged from human lineage 23 Myr ago)
• Tiled windows, most analysis on 10kb windows• Compute p-value for neutrality by chi-square test• Ratio of polymorphism to divergence ratios gives
indication of direction of inferred selection
Heather Lawson, Anthropology, PSU
pTRR apparently under positive selection
A promoter distal to the beta-like globin genes has a signal for recent
purifying selection
Selection on a primate-specific promoter
The distal promoter is close to the locus control region for beta-globin
genes
Messages about evolutionary approaches to predicting regulatory
regions• Regulatory regions are conserved, but not all to the same phylogenetic distance.
• Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity.
• In vivo occupancy by GATA-1 suggests other activities in addition to enhancers.
• Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection.
Many thanks …
B:Yong Cheng, Ross, Yuepin Zhou, David KingF:Ying Zhang, Joel Martin, Christine Dorman, Hao Wang
PSU Database crew: Belinda Giardine, Cathy Riemer, Yi Zhang, Anton Nekrutenko
Alignments, chains, nets, browsers, ideas, …Webb Miller, Jim Kent, David Haussler
RP scores and other bioinformatic input:Francesca Chiaromonte, James Taylor, Shan Yang, Diana Kolbe, Laura Elnitski
Funding from NIDDK, NHGRI, Huck Institutes of Life Sciences at PSU
Computing Regulatory Potential (RP)
Alignment seq1 G T A C C T A C T A C G C A seq2 G T G T C G - - A G C C C A seq3 A T G T C A - - A A T G T ACollapsed alphabet 1 2 1 3 4 5 7 7 6 8 3 6 3 9
• A 3-way alignment has 124 types of columns. Collapse these to a smaller alphabet with characters s (for example, 1-9).
•Train two order t Markov models for the probability that t alignment columns are followed by a particular column in training sets:
–positive (alignments in known regulatory regions)–negative (alignments in ancestral repeats, a model for neutral DNA)–E.g. Frequency that 3 4 is followed by 5:
0.001 in regulatory regions0.0001 in ancestral repeats•RP of any 3-way alignment is the sum of the log likelihood ratios of
finding the strings of alignment characters in known regulatory regions vs. ancestral repeats.
€
RP = logpREG (sa | sa−1...sa−t )
pAR (sa | sa−1...sa−t )
⎛
⎝ ⎜
⎞
⎠ ⎟
a in segment
∑
Stage 1: Reduced representations
G
T
gap
ESPERR: Evolutionary Sequence and Pattern Extraction using Reduced Representations
Stage 2: Improve encoding
Train models for classification
Note that many different columns are reduced to single “encoding” (a number in the figure). E.g. Four different columns are each called “3”.
6 6 2 may occur frequently in positive training set and rarely in the negative training set, and thus contribute to discrimination.If the positive training set is known regulatory regions, this would contribute to a positive RP.
Categories of Tested DNA Segments
Example that suggests turnover
GATA-1 BSs
All validated preCRMs All nonvalidated preCRMs
Background:
Mouse chr 19 (42.8% C+G) - NCBI Build 30
CLOVER (Zlab)
EKLF PWM(Dr. Perkins)
ELPH (UMaryland)
Hexamer Counting
Motif P(mm_chr19.m)EKLF 0.0008
Motif P(mm_chr19.m)
none none
Output for validated preCRMs
Output for nonvalidated preCRMs
validated non-validated6-mer TTATYT GGCAGR7-mer CCWCAGM RGRCAGR8-mer CASCCWGC CAGGGAWR9-mer CCWGGCWGM CWGRGAWRA
counts validated nonvalidatedNCACCC 60 32CACCCW 56 27expected validated nonvalidatedNCACCC 16.31 5.81CACCCW 11.74 4.36
Additional methods find CACC box as distinctive for validation
Using Galaxy to find predicted CRMs