35
What is Comparative Genomics? Insights gained through comparison of genomes from different species

What is Comparative Genomics? Insights gained through comparison of genomes from different species

  • View
    231

  • Download
    2

Embed Size (px)

Citation preview

Page 1: What is Comparative Genomics? Insights gained through comparison of genomes from different species

What is Comparative Genomics?

Insights gained through comparison of

genomes from different species

Page 2: What is Comparative Genomics? Insights gained through comparison of genomes from different species

How did it all start?• We needed some genomes to start comparing

• Many Bacteria sequenced first

• Model organisms • Yeast

• Worm

• Fruit fly

• Thale cress

• Finally, Human

• Comparative genomics did not just happen • Enough data had to be accumulated

• Development of new computational methods to meet the challenges

of

processing large amounts of data

• “Informatics” techniques from applied math, computer science and

statistics were adapted for biological sequences

Page 3: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Comparing sequenced genomes

• Comparison of genomic sequences from

different species can help identify the

following:

• Gene structure

• Gene function

• Interaction between gene products

• Non-coding RNAs

• Regulatory sequences

Page 4: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Evolution and sequence conservation

• Genome comparisons are based on simple premise:

conservation = functional importance

• If there are no constraints on DNA sequence, random

mutations will occur

• Over large evolutionary times (millions of years), these

random mutations make two related sequences different

• Sequences from different genomes will be conserved if:

• They code for proteins

• They are important for regulation (protein binding)

Page 5: What is Comparative Genomics? Insights gained through comparison of genomes from different species

No-hypothesis-driven approach

• Hypothesis-driven approaches • Develop goals based on available hypothesis

• Design initial experiments (and backups if those fail)

• When it yields results, go to NIH, NSF, DOE, ONR for funding

• No hypothesis-driven approaches

• Start with a general knowledge of the biological system

• Collect large amount of data (usually high-throughput methods) and try

extracting and/or amplifying signal from noisy data

• Sometimes it works for reasons that are obvious

• Sometimes it works for reasons that are NOT obvious

• Sometimes it doesn’t work because the data is too noisy

• Funding agencies are not likely to fund this kind of research

Page 6: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Finding DNA regulatory motifs (protein binding sites)

• Experimental approaches • Promoter Trapping

• DNA Footprinting

• In-vitro binding site selection (SELEX)

• Computational approaches

• Searching databases of known sites

• Finding over-represented motifs in a group of sequences

(Gibbs sampling, Expectation Maximization)

• In promoters of homologous genes

• In promoters of functionally linked genes

• In promoters of interacting proteins

• Ab initio methods

• Positional conservation of (pseudo)palindromic DNA motifs

Page 7: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Finding motifs in promoters of homologous genes

• Perform all-versus-all proteomes BLAST search

• Pool together promoters of related genes

• Find conserved motifs (Gibbs sampling, Expectation

Maximization)

• Only DNA motifs in related genes can be identified

Page 8: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Finding DNA motifs by positional conservation of palindromes

• The approach targets sites for dimeric proteins and is particularly

suited for helix-turn-helix proteins of Bacteria and Archea

• HTH proteins bind as dimers usually with variable sequence spacing

• Binding sites are palindromic with poorly conserved middle

GGATTnnnAATCC GGATTnnAATCC GGATTnnAAGCC

• Starting from a complete set of promoter sequences, we find

imperfect palindromes of variable length

• Remove sequence bias (A/T or G/C content > 80%)

• Search all-versus-all and identify similar motifs

YES

Page 9: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Many potential binding sites are found ...

• The role of found motifs is difficult to predict

RNA Pol KRibosomalproteins

Transposons

GTP-bindingATPase

Sulfatemetabolism

Shorthypothetical

proteins

Page 10: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Finding DNA motifs - the summary

• In promoters of homologous genes • Easy to perform and interpret results

• Works only for proteins with sequence homology

• In promoters of interacting proteins • General approach, works even in the absence of sequence

homology

• Needs better coverage of interactions; High-throughput studies of

species other than yeast will enable comparative analysis

• Ab initio methods • General approach, requires no prior knowledge

• Complementary approaches (experimental or computational) are

needed to link the found sites to their DNA-binding proteins

Page 11: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Evolution and sequence conservation

• Genome comparisons are based on simple premise:

conservation = functional importance

• If there are no constraints on DNA sequence, random

mutations will occur

• Over large evolutionary times (millions of years), these

random mutations make two related sequences different

• Sequences from different genomes will be conserved if:

• They code for proteins

• They are important for regulation (protein binding)

• Comparative genomics is needed to identify conservation

Page 12: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Comparative genomics helps genome annotations

• In prokaryotes, finding genes is relatively

easy based on open reading frames (ORFs)

• In eukaryotes, we have to look for ORFs,

exons, introns, splice sites, polyA sites • Bad news: Predicted exons sometimes do not exist

• More bad news: Pseudogenes

• Bad news keep coming: Alternative splicing

• Good news: In different species, the genes

normally have similar exon-intron structure

Page 13: What is Comparative Genomics? Insights gained through comparison of genomes from different species

RNApolymerase

Case 1:Cellular concentration of metabolite is too low to occupy the riboswitch binding site.

Transcription and …

3 421 RNApolymerase

Courtesy of R. Breaker, Yale U.

Page 14: What is Comparative Genomics? Insights gained through comparison of genomes from different species

UUUUU AUGRNA

polymerase

Case 1:Cellular concentration of metabolite is too low to occupy the riboswitch binding site.

Transcription and intramolecular RNA folding continue.

34

21 3 421

Courtesy of R. Breaker, Yale U.

Page 15: What is Comparative Genomics? Insights gained through comparison of genomes from different species

UUUUU AUG

Case 1:Cellular concentration of metabolite is too low to occupy the riboswitch binding site.

Translation is initiated.

Ribosome

Typically the new mRNA codes for a biosynthetic or transport protein that raises the intracellular level of the metabolite.

Gene regulation (next case) is accomplished by variations in the interactions of the regions highlighted in orange.

Transcription and intramolecular RNA folding continue.

34

21

Courtesy of R. Breaker, Yale U.

Page 16: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Case 2:Cellular concentration of metabolite (X) is high.

Intramolecular folding can lead to an alternate conformation.RNA polymerase produces the long untranslated leader region.

The alternate riboswitch conformation is stable when metabolite is bound.

X X

X

X

X

RNApolymerase

X

Nascent RNA

DNA template

Courtesy of R. Breaker, Yale U.

Page 17: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Case 2:Cellular concentration of metabolite (X) is high.

Intramolecular folding can lead to an alternate conformation.RNA polymerase produces the long untranslated leader region.

The alternate riboswitch conformation is stable when metabolite is bound.

X

X X

X

X

X

Transcription continues.

UUUUURNA

polymerase

3 421

Courtesy of R. Breaker, Yale U.

Page 18: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Case 2:Cellular concentration of metabolite (X) is high.

X X

X

X

X

Transcription continues.

RNApolymerase

Now, RNA folding leads to formation of an intrinsic terminator.

UUUUU

X X

3 421 3 421

Courtesy of R. Breaker, Yale U.

Page 19: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Case 2:Cellular concentration of metabolite (X) is high.

X X

X

X

X

Transcription continues.

RNApolymerase

Now, RNA folding leads to formation of an intrinsic terminator.

UUUUU

X

The transcript is never completed and the metabolite biosynthetic or transport protein is not produced.

3 421

Courtesy of R. Breaker, Yale U.

Page 20: What is Comparative Genomics? Insights gained through comparison of genomes from different species

What does this ncRNA bind?

Page 21: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Can we predict functions without

strict measure of significance

(no sequence or structural similarity)?

This is done by machine-trained (objective)

jury-like system using inference

Page 22: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Comparative genomics predicts protein interactions (Rosetta Stone)

• In yeast, topoisomerase II has

two domains that correspond

to gyrases A and B

• Sequence comparisons show

that these two domains are

individual proteins in E. coli

• The implication is that these

two proteins interact, and

that their fusion was favored

during the evolution

Page 23: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Predicting protein function by genome context

Page 24: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Krr1/Rrp20

Rio1/Rio2

Tif11Spo11

What does gene colinearity mean?

Page 25: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Not much, unless supported by phylogeny and function

Page 26: What is Comparative Genomics? Insights gained through comparison of genomes from different species

The case of Fibrillarin/Nop56 colinearity

Page 27: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Fibrillarin and Nop56 DO interact

Page 28: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Functional clues for hypothetical

proteins based on genomic context

analysis

Page 29: What is Comparative Genomics? Insights gained through comparison of genomes from different species

High-throughput approaches

• Had to be developed quickly to match the speed of genome sequencing

• As a general rule, most experimental approaches can be adapted for high-throughput– Protein interactions (two hybrid, TAP)– Protein localizations– Gene regulations (microarray)– Structure determination (more recent,

still gaining speed)

Page 30: What is Comparative Genomics? Insights gained through comparison of genomes from different species

What is a high-throughput experiment?

• Usually done at the level of whole organism (whole genome) under different conditions

• HT experiments are aided by:– Equipment miniaturization– Robotics– Other automated procedures

• In almost all instances, heavy data analysis and processing is required

Page 31: What is Comparative Genomics? Insights gained through comparison of genomes from different species

General properties of HT experiments

• Collect large amounts of data under many different conditions– Err on the side of collecting too much data,

disk storage is cheap

• Process raw data (computers)• Analyze data (computers)• Integrate data from various sources

(computers)• Identify patterns and cluster the results

based on similarity (computers)

Page 32: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Integrating heterogonous data to predict protein interactions

Page 33: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Analysis of different data types is usually based on

Bayesian inference

Example protein interactions:

● Proteins more likely to interact if they are co-expressed

● Proteins more likely to interact if they are co-localized in cell

● Proteins more likely to interact if they are co-localized in genome

● Proteins more likely to interact if they are parts of the same

cellular process

Page 34: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Predicting large protein complexes from individual parts

Page 35: What is Comparative Genomics? Insights gained through comparison of genomes from different species

Beware of erroneous annotations