Microbial Genome Profile Analysis - Genome Profile DataBasealpha.life.nthu.edu.tw/SysBio/gpdb_1001.pdf · microbial systems in a whole-genome scale. Integrated information from this

Microbial Genome Profile Analysis- Genome Profile DataBase

An integrated system for complete microbial genome analysis

P.C. P.C. LyuLyuDepartment of Life Sciences &

Institute of Bioinformatics and Structural Biology, National Tsing Hua University

Why study microbial genomes?until whole genome analysis became viable, life sciences have been based on a reductionist principle – dissecting cell and systems into fundamental components for further studystudies on whole genomes and whole genome sequences in particular give us a complete genomic blueprint for an organismwe can now begin to examine how all of these parts operate cooperatively to influence the activities and behavior of an entire organism – a complete understanding of the biology of an organismmicrobes provide an excellent starting point for studies of this type as they have a relatively simple genomic structure compared to higher, multicellular organismsstudies on microbial genomes may provide crucial starting points for the understanding of the genomics of higher organisms

Why study microbial genomes?

analysis of whole microbial genomes also provides insight into microbial evolution and diversity beyond single protein or gene phylogeniesin practical terms analysis of whole microbial genomes is also a powerful tool in identifying new applications in for biotechnology and new approaches to the treatment and control of pathogenic organisms

History of microbial genome sequencing

1977 - first complete genome to be sequenced was bacteriophage φX174 - 5386 bpfirst genome to be sequenced using random DNA fragments -Bacteriophage λ - 48502 bp1986 - mitochondrial (187 kb) and chloroplast (121 kb) genomes of Marchantia polymorpha sequencedearly 90’s - cytomegalovirus (229 kb) and Vaccinia (192 kb) genomes sequenced1995 - first complete genome sequence from a free living organism - Haemophilus influenzae (1.83 Mb)late 1990’s - many additional microbial genomes sequenced including Archaea (Methanococcus jannaschii - 1996) and Eukaryotes (Saccharomyces cerevisiae - 1996)

Most of complete genomes are microbes.

163 Complete Microbial Genomes

Does whole genome information reflect on environmental biodiversity ?

Phylogenomics- whole genome scale.

Genome Composition Bias in Extremophile :Hyperthermophile(80-110℃), Psychrophile(<15℃),Acidophile(pH<2), Alkaliphile(pH>10)Halophile(0.2M salt) …

What's GPDB ?(Genome Profile DataBase)

Information derived from both nucleotide and protein sequence in a genome-wide scale.Provide and compare features of the fully sequenced organisms in a graphic and easy-reading way.

http://gpdb.life.nthu.edu.tw

145 organisms(17 Archaea, 128 Bacteria)

223 complete sequence (157 Chromosomes, 66 Plasmids)

Total - 429,177 protein (ORFs)

GPDB Current Status

Definition of Genome Profile

Basic information -Species name, taxonomy, # of chromosome/plamid , genome size, orf number…

Nucleotide composition -ATGC composition, GC/AT content, N-nucleotide frequency (n=2,3), Codon usage…

Amino acid composition -Amino acid group composition, N-peptide frequency distribution (n=1,2), Proteome length, Mw, pIdistribution…

GPDB flowchart

Genome ProfilePipeline

MySQL - GPDB

Apache + PHP +GD2 + JpGraph

COGNCBI RefSeq GTOP

Whole GenomeSequence Information

OrthologousProtein Information

PSI-Blast 3D StructureInformation

Genome ProfileData

Source

Process

GPDB

Browse Compare Virtual 2D

Linux PC Cluster

GPDB database schema

NCBIGeneBank / RefSeq

Whole Genome Data

Grab Whole GenomeDNA/Protein Sequence

including Annotation

Genome Profile

Data analysis

2D gel

Compare

Browse

Browse – Helicobacter pylori 26695

Basic Information

AT/GC Content & Skew

di-, tri-Nucleotide composition

Codon Usage

Amino acid composition

Proteome Distribution

TMHMM prediction


Whole Genome Data



Genome Profile

Data analysis

2D gel

Compare

Browse

Virtual 2D Gel

Simulate 2D gel by pI & mw

Search by different pH & mw range to simulate the real 2D gel.by spot range to guess the possible spot, ex:

pI = 5.7 +/- 5%Mw = 100K +/- 5%

Filter out transmembrane proteins.TMHMM program

E. L.L. Sonnhammer, G. von Heijne, and A. Krogh.In J. Glasgow et al., eds., Proc. Sixth Int. Conf. on Intelligent Systems for Molecular Biology , 175-182. AAAI Press, 1998.A. Krogh, B. Larsson, G. von Heijne, and E. L. L. Sonnhammer. Journal of Molecular Biology, 305(3):567-580, January 2001.Moller S, Croning MD, Apweiler R. Bioinformatics 2002 Jan;18(1):218

Virtual 2D Gel Flowchart

Ref Sequencefrom NCBI

Calcaulate pI and MW

MySQL

Protein records

PHP / GD Library

•ftp://ftp.ncbi.nih.gov/genomes/Bacteria/

•EMBOSS package – pepstat•Available at http://www.hgmp.mrc.ac.uk/Software/EMBOSS/

pI and MWTheoritical vs. Experimental

Comparison between calculated and experimentally obtained pIand MW values for 47 randomly selected proteins from Pseudomonas aeruginosa.

Nucleic Acids Research, 2003, Vol. 31, No. 13 3862-3865

Search by Gel Range without TMHMM Filter

pI 3-10Mw 10000 - 15000

Transmembrane Helix

Annotation

Search by Gel Range with TMHMM Filter

Search by Spot Range without TMHMM Filter

pI = 7 +/- 10%Mw = 100000 +/- 10%

Search by Spot Range with TMHMM Filter


Whole Genome Data



Genome Profile

Data analysis

2D gel

Compare

Browse

ex: pI distribution

4 strain E. coli

2 strain H. pylori

Lactococcus lactis subsp. lactis bv. Diacetylactis

Bordetella pertussis

Wigglesworthia glossinidia

Halobacterium sp. NRC-1

Pyrococcus abyssi

pI Distribution

0%

2%

4%

6%

8%

10%

12%

0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5

11.5

12.5

13.5

pI

ecoli0ecoli1ecoli2ecoli3

4 strain E. coli

Similar species, similar pI distribution

pI Distribution

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5

11.5

12.5

13.5

pI

hpylo0hpylo1

2 strain H. pylori

Different species, different pI distribution

pI Distribution

0%

5%10%

15%

20%25%

30%

35%

40%45%

50%

0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5

11.5

12.5

13.5

pI

ecoli0halob0hpylo0wglos0

pI Distribution

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5

pI

145 Species

How to compare ?

Interactive on-line analysis to help us to explore the different combination.

Easy-reading.

Transform

[Bacteria] - Aquifex aeolicus VF5

Transform

Hierarchical Clustering

Clustering

There are many clustering methods.We use Euclidean distances for Hierarchical Clustering.It just a easy way to read, not the only solution!

On-line comparison

On-line Clustering

All 429177 ORFs pI distribution

0

10000

20000

30000

40000

50000

60000

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10 10.5 11 11.5 12 12.5 13 13.5 14

pI

Num

ber

pI DistributionGPDB 429,177 proteins

Extreme pI DistributionpI Distribution

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5

11.5

12.5

13.5

pI

baphi0baphi1baphi2cbloc0mgeni0mpulm0uurea0wglos0

pI Distribtution

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5

11.5

12.5

13.5

pI

blong0halob0mkand0mther0synec0synec1

Endosymbiont?

Halophile, Methanophile?

ConclusionWe have constructed the database (GPDB), which provides many whole-genome scale features.A perl package called “ Genome Profile Pipeline”which can automatically analyze data was programmed.GPDB can help us to compare and analyze different microbial systems in a whole-genome scale.Integrated information from this database may be useful for data mining, comparative genomics and systems biology research.

Thanks for your attention

Documents

Microbial Genome Profile Analysis - Genome Profile DataBasealpha.life.nthu.edu.tw/SysBio/gpdb_1001.pdf · microbial systems in a whole-genome scale. Integrated information from this