Comparative genomics: Overview & Tools Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. [email protected]

Comparative genomics: Overview & Tools

Urmila Kulkarni-Kale

Bioinformatics Centre

University of Pune, Pune 411 007.

[email protected]

October 2K5 © UKK, Bioinformatics Centre, University of Pune.

2

Genome sequence: Fact file

• 1995: The first complete genome sequence of Haemophilus infuenzae Rd-was published

• Biological systems are dynamic and evolving• The forth dimension: Time• Genome sequence is a snapshot of evolution• Correlation between Phenotypic properties and

Genomic region is not straightforward as phenotypic properties are result of many to many interactions


3

Genomes: the current status

• Published complete genomes: 303

– Archaeal: 24

– Bacterial: 240

– Eukaryal: 39

• Completed Viral genomes: >5000

• Prokaryotic ongoing genomes: 755

• Eukaryotic ongoing genomes: 531

As of October 11, 2005


4

Genome databases• Genomes at NCBI, EBI, TIGR


5

H. influenzae Complete Genome


6

Function information clock of E. coli

Generated on March 2K4

http://columba.ebi.ac.uk:8765/gq/r2h?filename=/ebi/genequiz/2001/ecr0104/sum/ECR0104.function.rdb&nodetail=1


7

Genome analyses

• Variation in – Genome size– GC content – Codon usage– Amino acid composition– Genome organisation

• Single circular chromosomes

• Linear chromosome + extra chromosomal elements

G, A, P, R: GC richI, F, Y, M, D: AT rich

E. coli: 4.6MbpM. pneumoniae: 0.81Mbp

B. subtilis: 4.20Mbp

B. burgdorferi: 29%M. tuberculosis: 68%


8

CG: Comparisons between genomes

• The stains of the same species

• The closely related species

• The distantly related species– List of Orthologs – Evolution of individual genes – Evolution of organisms


9

CG helps to ask some interesting questions

• Identification similarities/differences between genomes may allow us to understand :– How 2 organisms evolved?– Why certain bacteria cause diseases while

others do not?– Identification and prioritization of drug targets


10

CG: Unit of comparison• Unit of comparison: Gene/Genome

– Number– Content (sequence)– Location (map position)– Gene Order– Gene Cluster (Genes that are part of a known metabolic

pathway, are found to exist as a group)– Colinearity of gene order is referred as synteny– A conserved group of genes in the same order in two

genomes as a syntenic groups or syntenic clusters– Translocation: movement of genomic part from one

position to another


11

Comparison of the coding regions

• Begins with the gene identification algorithm: infer what portions of the genomic sequence actively code for genes.

• There are four basic approaches.


12

Knowledge of Full Genome sequence: Solutions or new questions…?

• Still struggling with the gene counters…

Correct # of

genes…?


13

Structure of tryptophan operon • Numbers: Gene number• Arrows: Direction of transcription• //: Dispersion of operon by 50 genes

Domain fusiontrpD and trpGtrpF and trpC

trpB and trpAgenetically linked

separate genes

Dan

deka

r et

al.,

199

8


14

Important observations with regard to Gene Order

• Order is highly conserved in closely related species but gets changed by rearrangements

• With more evolutionary distance, no correspondence between the gene order of orthologous genes

• Group of genes having similar biochemical function tend to remain localized– Genes required for synthesis of tryptophan (trp

genes) in E. coli and other prokaryotes


15

Synteny

• Refers to regions of two genomes that show considerable similarity in terms of – sequence and – conservation of the order of genes

• likely to be related by common descent.


16

COGs: Phylogenetic classification of proteins

encoded in complete genomes


17

Genome analyses@NCBIPairwise genome comparison of protein

homologs (symmetrical best hits)

http://www.ncbi.nlm.nih.gov/sutils/geneplot.cgi






18

Integr8: CG site at EBIhttp://www.ebi.ac.uk/integr8

http://www.ebi.ac.uk/integr8/OrganismSearch.do?action=searchBySuperregnum&superregnum=bacteria&pageContext=207






19

Comparative Genomics Tools

• BLAST2 • MUMmer• Comparisons and analyses at both

– Nucleic acid and protein level

• Comparative genomics of Parasites @ TIGR• Microbial Genome Database (MDG) in Japan• Comparative Genome analysis in P. Borks lab

@embl-heidelberg• Comprehensive Microbial Resource page@TIGR


20

Genome Alignment Algorithm:MUMmer

• Developed by – Dr. Steven Salzberg’s group at TIGR– NAR (1999) 27:2369-2376– NAR (2002) 30:2478-2483

• Availability– Free– TIGR site


21

Features of MUMmer• The algorithm assumes that sequences are closely

related• Can quickly compare millions of bases• Outputs:

– Base to base alignment– Highlights the exact matches and differences in the

genomes– Locates

• SNPs• Large inserts• Significant repeats• Tandem repeats and reversals


22

Definitions are drawn from biology• SNP: Single mutation surrounded by two

matching regions– Regions of DNA where 2 sequences have diverged by

more than one SNP

• Large inserts: regions inserted into one of the genomes – Sequence reversals, lateral gene transfer

• Repeats: the form of duplication that has occurred in either genome.

• Tandem repeats: regions of repeated DNA in immediate succession but with different copy number in different genomes.– A repeat can occur 2.5 times


23

Techniques used in the MUMmer Algorithm

Compute Suffix trees for every genome

Longest Increasing Subsequence (LIS)

Alignment using Smith & Waterman algorithm

Integration ofthese techniques

for genome alignment


24

MUMmer: Steps in the alignment process

Read two genomes

Perform Maximum Unique Match (MUM) of genomes

Sort and order the MUMs using LIS

Close the gaps in the

Alignment

Using SNPs, mutation regions, repeats, tandem

repeats

Output alignment

• MUMs• regions that do not match exactly


25

MUMmer steps

• Locating MUMs

• Sorting MUMs

• Closure with gaps

G1: ACTGATTACGTGAACTGGATCCA

G2: ACTCTAGGTGAAGTGATCCA


26

Genome1: ACTGATTACGTGAACTGGATCCAGenome2: ACTCTAGGTGAAGTGATCCA

Genome1: ACTGATTACGTGAACTGGATCCA

Genome2: ACTCTAGGTGAAGTGATCCA

ACTGATTACGTGAACTGGATCCA

ACTC--TAGGTGAAGT-GATCCA


27

What is a MUM?• MUM is a subsequence that occurs exactly once in

both genomes and is NOT part of any longer sequence

• Two characters that bound a MUM are always mismatches

• Principle: if a long matching sequence occurs exactly once in each genome, it is certainly to be part of global alignment

GenA: tcgatcGACGATCGCCGCCGTAGATCGAATAACGAGAGAGCATAAcgacttaGenB: gcattaGACGATCGCCGCCGTAGATCGAATAACGAGAGAGCATAAtccagag

Similar to BLAST & FASTA!!


28

Sorting & ordering MUMs• MUMs are sorted according to their position in

Genome A• The order of matching MUMs in Genome B is

considered

• LIS algorithm to locate longest set of MUMs which occur in ascending order in both genomes

2 4

MUM5:transposition

MUM3:Random matchInexact repeat

Leads to Global MUM-alignment


29

MUMmer Results

• 2 strains of M. tuberculosis– H37Rv & CDC1551 – Genome size: 4Mb– Time: 55 s

• Generating suffix tree: 5 s

• Sorting MUMs: 45s

• S&W alignment: 5 s


30

Alignment of M. tuberculosis strainsCDC1551 (Top) & H37Rv (bottom)

Single green lines indicate SNPs

Blue lines indicate insertions


31

Comparison of 2 Mycoplasma genomescousins that are distantly related

• M. genitalium: 580 074 nt• M. pneumoniae: 816 394 (+226 000)• Analysis of proteins tell us that all M.g.

proteins are present in P.m. • Alignment was carried using

– FASTA (dividing each genome into 1000 bp)– All-against-all searches– Fixed length of pattern (25)– Using MUMmer (length = 25)


32

Comparison of 2 Mycoplasma genomes

Using FASTA

Fixed length patterns: 25mers

MUMmer


33

Post-sequencing challenges • Genome sequencing is just the beginning to

appreciate biocomplexity • Sequence-based function assignment approaches

fail as the sequence similarity drops …• Structure-based function prediction approaches are

limited by the availability of structures, association of structural motifs & associated functional descriptor

• As a result, in any genome,

Genes with unknown function: ~60%

Genes with known function: ~ 40%

Documents

Comparative genomics: Overview & Tools Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. [email protected]