32
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT CCCTGTTTCCAGGTTTGTTGTCCCAAAATAGTGACCATTTCATATGTATA Comparative Genomics

TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

  • Upload
    madison

  • View
    15

  • Download
    0

Embed Size (px)

DESCRIPTION

Comparative Genomics. TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT CCCTGTTTCCAGGTTTGTTGTCCCAAAATAGTGACCATTTCATATGTATA. Overview. I. Comparing genome sequences Concepts and terminology - PowerPoint PPT Presentation

Citation preview

Page 1: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATATTCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCAGAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTTCCCTGTTTCCAGGTTTGTTGTCCCAAAATAGTGACCATTTCATATGTATA

Comparative Genomics

Page 2: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Overview

I. Comparing genome sequences• Concepts and terminology• Methods

- Whole-genome alignments- Quantifying evolutionary conservation (PhastCons, PhyloP)- Identifying conserved elements

• Available datasets at UCSC

II. Comparative analyses of function• Evolutionary dynamics of gene regulation• Case studies• Insights into regulatory variation within and across species

Page 3: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Distribution of evolutionary constraint in the human genome

Lindblad-Toh et al. Nature 478:476 (2011)

4.2% of genome is putatively constrained~1 million putative regulatory elements

Page 4: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

•Infer the course of past evolution using statistical models of sequence evolution

•Identify sequence elements evolving more slowly or more rapidly than neutral

•Evaluate the precise degree of constraint on specific positions

•Predict the functional effects of nucleotide or amino acid mutations in constrained sequences

Goals of comparative genomics

Page 5: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Vertebrate genomes available for comparative studies

Prim

ates

Mam

mal

s

Tetra

pods

Verte

brat

es

Page 6: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Commonly used (and misused) terms

Mutation vs. Substitution• Mutations occur in individuals, segregate in populations• Substitutions are mutations that have become fixed• Mutations = within species; substitutions = between species

Conservation vs. Constraint• Conservation = an observation of sequence similarity• Constraint = a hypothesis about the effect of purifying selection

Homology, Orthology and Paralogy• Homologous sequences = derived from a common ancestor• Orthologous sequences = homologous sequences separated by a speciation event

(e.g., human HOXA and mouse Hoxa)• Paralogous sequences = homologous sequences separated by gene duplication

(e.g., human HOXA and human HOXB)

Page 7: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Basic premises in comparative sequence analysis

Most mutations that affect function are eliminated by purifying selection• Constrained elements have lower substitution rates than expected from the neutral rate• Contingent on the effect of the mutation and degree of constraint on the function• Manifests as sequence conservation, even among distant species

Beneficial mutations may be driven to fixation by positive selection• May be detected as “faster-than-neutral” substitution rate• Expected to be rare

Most sequence differences among genomes are neutral• Involve substitutions with minimal or no functional impact• Fixed by random genetic drift• Fixation rate is equal to mutation rate• Genomes become more dissimilar with greater phylogenetic distance

Page 8: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Phylogenies

Phylogenetic trees show two things:• Evolutionary relationships among species or sequences: branching order• Evolutionary distance (e.g., degree of similarity or divergence): branch length

Internalnode

Terminalnode

Branch

Page 9: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Phylogenies

Phylogenetic trees show two things:• Evolutionary relationships among species or sequences: branching order• Evolutionary distance (e.g., degree of similarity or divergence): branch length

Species tree Gene tree

Page 10: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Orthologs and paralogs in gene trees

Capra et al. 2013

HMGCS1

HMGCS2

Page 11: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Orthologs and paralogs in gene trees

Capra et al. 2013

Orth

olog

sOr

thol

ogs

Para

logsDuplication

Page 12: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Orthologs and paralogs in gene trees

Capra et al. 2013

1:1 Orthologs

1:1 Orthologs

Human HMGCS1Human HMGCS21:2

Page 13: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Ortholog assignments at Ensembl

Page 14: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Ortholog assignments at Ensembl

Page 15: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Ortholog assignments at Ensembl

Page 16: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Steps in sequence comparisons

Sequence alignment• Global vs. local• Whole-genome vs. genome segments (e.g., genes)• Identify sites that are homologous (not necessarily identical)

Measure similarity and divergence of sequences• Sequence similarity – level of conservation• Rates of change among sequences - divergence

Infer degree of evolutionary constraint• Are the sequences more conserved than expected from neutral evolution?

Page 17: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Rates of sequence change are estimated using models of the substitution process

Transition probabilities:

Page 18: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Phylogeny

Substitution rates are calculated for each lineage in a sequence phylogeny

Page 19: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Conserved sequences identified by local reductionsin substitution rate

aligned position

aligned position

localneut

Page 20: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Tools for quantifying evolutionary conservation acrossgenomes

Alignment: Multiz• Generates multiple species alignment relative to a base genome• Constructed from pairwise alignment of individual genomes to reference• 46-way and 100-way alignment to hg19, 30-way to mm9; 60-way to mm10

Page 21: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

100-way Multiz alignment in hg19

Green = level of sequence similarity at each site

Page 22: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Conservation of synteny: “net” alignments

• Conservation of genome segments• Order and orientation of genes and regulatory sequences

Page 23: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Conservation of synteny: “net” alignments

• Synteny is frequently conserved on megabase scales

Page 24: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Tools for quantifying evolutionary conservation acrossgenomes

PhastCons• Estimates the probability that a nucleotide belongs to a conserved element• Sensitive to ‘runs’ of conserved sites – effective for identifying conserved blocks• For hg19, elements are calculated at three phylogenetic scopes

(Vertebrate, Placental Mammal, Primate)

PhyloP• Measures conservation independently at individual positions• Provides per-base conservation scores: (-log p value under hypothesis of neutrality)• Positive scores suggest constraint; negative scores suggest accelerated evolution

Alignment: Multiz• Generates multiple species alignment relative to a base genome• Constructed from pairwise alignment of individual genomes to reference• 46-way and 100-way alignment to hg19, 30-way to mm9; 60-way to mm10

Page 25: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Identifying conserved elements: PhastCons

PhastCons scores

PhastCons elements

lod score: log probability under conserved model – log probability under neutral modelScore: normalized lod score on 0-1000 scale

Use scores to rank elements by estimated constraint

lod: 882Score: 694

Page 26: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

PhastCons elements estimated at 3 phylogenetic scopes

PrimatePlacentalVertebrate

Page 27: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Level of conservation decays with increasing evolutionary distance

Page 28: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

PhyloP: measuring basewise conservation

PhyloPscores

• Scores are calculated independently for each base• Scores are –log P values under hypothesis of neutral evolution• Positive scores = constraint• Negative scores = acceleration

Page 29: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Per-site phyloP conservation scores

4.49 1.77 -0.96

Use PhastCons to identify conserved elementsUse phyloP to evaluate individual sites within elements

Page 30: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Accessing conservation data

Page 31: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Multiple genome alignments and conservation metrics are calculated independently for each reference genome

Orthologous region in mouse:

30-way multiz alignment

Page 32: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT

Conservation identifies critical binding sites in regulatory elementsRe

gula

tory

info

(ENC

ODE)

Cons

erva

tion

Important binding sites and variants that affect function will be here