Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Functional Importance andSelective Constraints in Human
non-coding DNA
Shamil Sunyaev
Is it all evolutionary junkyard?… and probably is an evolutionary junkyard
Most of the Genome is Non-coding
acgtcttcccttaggatc
gcatcttcccttaggcgc
However, many genomic regions
are highly conserved!
Definition:
Conservation \Con`ser*va"tion\, n. [L. conservatio:cf. F. conservation.] The preservation of a genetic
sequence over time due to natural selection.
Natural selection
Phenotype
Effect on molecular
function
Conservation as a
functional genomics assay
Practical aspect of conservation
• Prediction of new non-coding functionalelements
• Search for allelic variation underlying humanphenotypes (even if we do not know function)
Theoretical aspect of conservation
• Estimation of genomic rate of deleteriousmutations (U > 1 already implies synergistic epitasis;
estimates including non-coding sequence are much larger).
• Genetics of common phenotypes (Why do they exist
in spite of purifying selection?).
Individual human genome is a targetfor deleterious mutations
Common disease / Common variant
Trade off (antagonistic pleiotropy)
Balancing selection
Recent positive selection
Reverse in direction of selection
Examples
APOE Alzheimer’s disease
AGT Hypertension
CYP3A Hypertension
Multiple mostly rare variants
Many deleterious alleles in mutation-
selection balance
Examples
Plasma level of HDL-C
Plasma level of LDL-C
Colorectal adenomas
• How many nucleotides in the human genome areselectively constrained?
• How are selectively constrained nucleotidesdistributed along the genome?
• How strong is selective pressure in non-codingregions?
• Do comparative and functional genomics dataagree?
Questions:
Selection coefficient
Wild type New mutation
N1= 4 N2= 3
N1
N2= 1 – s
Selection coefficient
Selection coefficient - is a measure of reproductive disadvantage(or advantage) associated with a mutation.
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 . 6
0 . 7
0 . 8
0 . 9
1 . 0
Divergence (K)
Every new mutation eventually will be either fixed or lost
s – selection coefficient
Ne - effective population size
For humans estimated to be ~ 10 000
Selection coefficient, s
K/K0
CompleteconservationNeutral
behavior
10-6 10-5 10-4 10-3
For mice estimated to be ~ 85 000
• Mutation rate
• Phylogenetic distance (mostly unknown)
• Selection (negative or positive)
Divergence (conservation) is not informative
about strength of selection.
Divergence depends on…
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 . 6
0 . 7
0 . 8
0 . 9
1 . 0
Nucleotide diversity ( )
Database SNP density is proportional to
Selection coefficient, s10-6 10-5 10-4 10-3
• Mutation rate
• Coalescent variance
• Background selection
• Hitchhikings
• Selection (negative or positive)
Nucleotide diversity depends on…
0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0
0 . 0 0
0 . 0 1
0 . 0 2
0 . 0 3
0 . 0 4
0 . 0 5
0 . 0 6
0 . 0 7
0 . 0 8
0 . 0 9
0 . 1 0
0 . 1 1
0 . 1 2
Allele frequency spectrum
The higher selection coefficient s (stronger selection) –the higher proportion of low frequency alleles
F0.05
– the proportion of new alleles with frequency below 5%
New allele frequency
Proportion of new alleles with particular frequency
s = 0.0003
s = 0.0001
s = 0
• Demographic history
• Biased gene conversion
• Background selection (if associated with inefficientnegative selection)
• Hitchhikings
• Selection (negative or positive)
Allele frequency spectrum dependson…
Relationships between K, , F0.05
and s
• Conservation (very low fixation rate of new mutations) doesnot necessary mean high selection coefficients
• Conservation is not enough to estimate the level of selectivepressure
(s) / (s=0)
Kh(s) / K
h(s=0)
F0.05
(s)
Km
(s) / Km
(s=0)
Genome analysis
Km K
h
conserved protein coding
conserved intronic
conserved intergenic region>90% in 50 nt window
F0.05
mouserat
chimpanzee human
Divergence in the mouse lineage (Km) inconserved genomic regions
0
0.2
0.4
0.6
0.8
1
Neutral Conserved
protein-
coding
Conserved
intronic
Conserved
intergenic
Divergence in the human lineage (Kh) inconserved genomic regions
0
0.2
0.4
0.6
0.8
1
Neutral Conserved
protein-
coding
Conserved
intronic
Conserved
intergenic
Nucleotide diversity ( ) in
conserved genomic regions
0
0.2
0.4
0.6
0.8
1
Neutral Conserved
protein-
coding
Conserved
intronic
Conserved
intergenic
Selective pressure in conserved genomicregions: theory
Fraction of rare new alleles (F0.05
) inconserved genomic regions
0%
5%
10%
15%
20%
25%
4-fold degenerate sites
in protein coding
(neutral standard)
Nondegenerate sites in
protein coding
Conserved intronic
Selective constraints in conservedgenomic regions
• A genome-wide relaxation of selective constraint inthe primate lineage
• This relaxation most likely resulted from a smaller
human effective population size
• Relaxation is much more profound in conserved
non-coding regions than in protein-coding regions
• Mutations at a large proportion of sites in conserved
non-coding regions are associated with very small
fitness effect
Distribution of selective coefficients
Protein coding Introns Intergenic
Evolution of non-coding regions
What can explain this staggering enrichmentin sites at the borderline of neutrality inconserved non-coding regions?
? ? ??
?
Evolution of non-coding regions
What can explain this staggering enrichmentin sites at the borderline of neutrality inconserved non-coding regions?
? ? ??
?
...synergistic epistasis!
Model of non-coding regions evolution
• Non-coding region “is trying” to maintain anoverall similarity to some optimal sequence
• All nucleotides are of equal importance
• Mutations on average tend to “move away”
region from the optimal sequence
• With each new deleterious mutation the
remaining nucleotides become more
important
SPECULATIONS!
Non-coding regions evolution
New deleterious mutations are not fixedin population any more
X
Smallselection
coefficient (s)
High selectioncoefficient (s)
NOFIXATION
NEUTRALFIXATIONRATE
Sequence evolution in the framework of birth-death processes
The recursive formula for an equilibrium distribution
Transition probabilities from state to state
fitness dependence on number of deleterious mutations
Sequence evolution in the framework of birth-death processes
Model of non-coding regions evolution
This model predicts mutations of mostly smalleffect even in very conserved non-coding regions
Conserved non-coding region
Conserved coding
All nucleotides have small selection coefficient
Some nucleotides have very high selection coefficient
• How many nucleotides in the human genome areselectively constrained?
• How are selectively constrained nucleotides distributedalong the genome?
• How strong is selective pressure in non-coding regions?
• Do comparative and functional genomics data agree?
Questions:
Chimp
Dog
Mouse
Rat
4GCBs
Humans
Model
• All non-4GCBs are neutral (this is the mostconservative assumption)
• 4GCBs are a mixture of neutral and functionalsites
• All functional 4GCBs are associated with thesame selection coefficient (this is the mostconservative assumption)
( )( )
( )( )
( )( )
+
=1
0
1
0
221
11
12
11
%1
dxxxx
dxxxmm
xmxx
F
mm
mm
neutral
( )
=
=1
1
12
3%1
m
i
neutral
i
F
Fraction of rare neutral alleles
( )( ) ( )
neutralfunctional
neutralfunctional
mixturenn
nnF
+
+=
%1%1%1
( )( )( )
( )( )( )
( )( )+=
1
0
221
2
12
12
11
11
1%1 dxxx
mmxmx
exx
en
mm
sN
xsN
functionale
e
n functional =e 2Nes 1 x( ) 1( )
x 1 x( ) e 2Nes 1( )1 xm 1 x( )
m( ) dx0
1
Mixture of neutral and functionalsites
Direct simulation
1001001100111101010010010111010100001111001100011100010111001
How many functional sites are
needed to produce observed allele
frequency shift?
Polymorphism to divergence ratio
= 4Neμ2Nes+ e
2Ne s 1
Nes 1 e 2Ne s( )+ 4Neμ
D = 2Neμt1 e 2Nes
1 e 4Nes+ tμ
R =
2Nes+ e 2Nes 1
Nes 1 e 2Nes( )+
2Ne
1 e 2Nes
1 e4Nes+
Selective constraints in non-codingregions of the genome
• Selectively constrained bases are diffusely distributed alongthe genome rather than condensed to highly conservedregions
• At least ~20% of 4GCBs are electively constrained (2% of
the genome sequence)
• Probably additional constrained positions in non-alignable
regions
• How many nucleotides in the human genome areselectively constrained?
• How are selectively constrained nucleotides distributedalong the genome?
• How strong is selective pressure in non-coding regions?
• Do comparative and functional genomics data agree?
Questions:
Regions selected for the ENCODE
project have 22 mammalian species
sequenced
… and a lot of functional genomics data
AGC ACC AGC
AGC ACC
AGC
Human Chimp Baboon
Mutation rates are modeled as
asymmetric and context
specific.
The model incorporates
insertions and deletions
Instantaneous rate matrix of transitions Q
P(t) = eQt
• Ignores mutation rate heterogeneity along the genome
• Assumes uniformity between species
• Computes Bayesian estimate of evolutionary rate at the
site
•Computes p-value via simulations
SCONE (Sequence CONservationEvaluation)
SCONE vs. ENCODE SNPs
Conservation of functional features
Clustering of conserved positions
Conservation of functional feature
• Conservation of most of ENCODE functional elements is dueto a small number of positions.
• Most of conserved positions are outside of long conserved
elements
• Conserved positions tend to cluster along the sequence
Acknowledgments
The lab: Saurabh Asthana,Gregory Kryukov, Steffen Schmidt
University of Washington: John Stamatoyannopoulos, William Noble