Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Microbial Genome Profile Analysis- Genome Profile DataBase
An integrated system for complete microbial genome analysis
P.C. P.C. LyuLyuDepartment of Life Sciences &
Institute of Bioinformatics and Structural Biology, National Tsing Hua University
Why study microbial genomes?until whole genome analysis became viable, life sciences have been based on a reductionist principle – dissecting cell and systems into fundamental components for further studystudies on whole genomes and whole genome sequences in particular give us a complete genomic blueprint for an organismwe can now begin to examine how all of these parts operate cooperatively to influence the activities and behavior of an entire organism – a complete understanding of the biology of an organismmicrobes provide an excellent starting point for studies of this type as they have a relatively simple genomic structure compared to higher, multicellular organismsstudies on microbial genomes may provide crucial starting points for the understanding of the genomics of higher organisms
Why study microbial genomes?
analysis of whole microbial genomes also provides insight into microbial evolution and diversity beyond single protein or gene phylogeniesin practical terms analysis of whole microbial genomes is also a powerful tool in identifying new applications in for biotechnology and new approaches to the treatment and control of pathogenic organisms
History of microbial genome sequencing
1977 - first complete genome to be sequenced was bacteriophage φX174 - 5386 bpfirst genome to be sequenced using random DNA fragments -Bacteriophage λ - 48502 bp1986 - mitochondrial (187 kb) and chloroplast (121 kb) genomes of Marchantia polymorpha sequencedearly 90’s - cytomegalovirus (229 kb) and Vaccinia (192 kb) genomes sequenced1995 - first complete genome sequence from a free living organism - Haemophilus influenzae (1.83 Mb)late 1990’s - many additional microbial genomes sequenced including Archaea (Methanococcus jannaschii - 1996) and Eukaryotes (Saccharomyces cerevisiae - 1996)
Most of complete genomes are microbes.
163 Complete Microbial Genomes
Does whole genome information reflect on environmental biodiversity ?
Phylogenomics- whole genome scale.
Genome Composition Bias in Extremophile :Hyperthermophile(80-110℃), Psychrophile(<15℃),Acidophile(pH<2), Alkaliphile(pH>10)Halophile(0.2M salt) …
What's GPDB ?(Genome Profile DataBase)
Information derived from both nucleotide and protein sequence in a genome-wide scale.Provide and compare features of the fully sequenced organisms in a graphic and easy-reading way.
http://gpdb.life.nthu.edu.tw
145 organisms(17 Archaea, 128 Bacteria)
223 complete sequence (157 Chromosomes, 66 Plasmids)
Total - 429,177 protein (ORFs)
GPDB Current Status
Definition of Genome Profile
Basic information -Species name, taxonomy, # of chromosome/plamid , genome size, orf number…
Nucleotide composition -ATGC composition, GC/AT content, N-nucleotide frequency (n=2,3), Codon usage…
Amino acid composition -Amino acid group composition, N-peptide frequency distribution (n=1,2), Proteome length, Mw, pIdistribution…
GPDB flowchart
Genome ProfilePipeline
MySQL - GPDB
Apache + PHP +GD2 + JpGraph
COGNCBI RefSeq GTOP
Whole GenomeSequence Information
OrthologousProtein Information
PSI-Blast 3D StructureInformation
Genome ProfileData
Source
Process
GPDB
Browse Compare Virtual 2D
Linux PC Cluster
GPDB database schema
NCBIGeneBank / RefSeq
Whole Genome Data
Grab Whole GenomeDNA/Protein Sequence
including Annotation
Genome Profile
Data analysis
2D gel
Compare
Browse
Browse – Helicobacter pylori 26695
Basic Information
AT/GC Content & Skew
di-, tri-Nucleotide composition
Codon Usage
Amino acid composition
Proteome Distribution
TMHMM prediction
NCBIGeneBank / RefSeq
Whole Genome Data
Grab Whole GenomeDNA/Protein Sequence
including Annotation
Genome Profile
Data analysis
2D gel
Compare
Browse
Virtual 2D Gel
Simulate 2D gel by pI & mw
Search by different pH & mw range to simulate the real 2D gel.by spot range to guess the possible spot, ex:
pI = 5.7 +/- 5%Mw = 100K +/- 5%
Filter out transmembrane proteins.TMHMM program
E. L.L. Sonnhammer, G. von Heijne, and A. Krogh.In J. Glasgow et al., eds., Proc. Sixth Int. Conf. on Intelligent Systems for Molecular Biology , 175-182. AAAI Press, 1998.A. Krogh, B. Larsson, G. von Heijne, and E. L. L. Sonnhammer. Journal of Molecular Biology, 305(3):567-580, January 2001.Moller S, Croning MD, Apweiler R. Bioinformatics 2002 Jan;18(1):218
Virtual 2D Gel Flowchart
Ref Sequencefrom NCBI
Calcaulate pI and MW
MySQL
Protein records
PHP / GD Library
•ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
•EMBOSS package – pepstat•Available at http://www.hgmp.mrc.ac.uk/Software/EMBOSS/
pI and MWTheoritical vs. Experimental
Comparison between calculated and experimentally obtained pIand MW values for 47 randomly selected proteins from Pseudomonas aeruginosa.
Nucleic Acids Research, 2003, Vol. 31, No. 13 3862-3865
Search by Gel Range without TMHMM Filter
pI 3-10Mw 10000 - 15000
Transmembrane Helix
Annotation
Search by Gel Range with TMHMM Filter
Search by Spot Range without TMHMM Filter
pI = 7 +/- 10%Mw = 100000 +/- 10%
Search by Spot Range with TMHMM Filter
NCBIGeneBank / RefSeq
Whole Genome Data
Grab Whole GenomeDNA/Protein Sequence
including Annotation
Genome Profile
Data analysis
2D gel
Compare
Browse
ex: pI distribution
4 strain E. coli
2 strain H. pylori
Lactococcus lactis subsp. lactis bv. Diacetylactis
Bordetella pertussis
Wigglesworthia glossinidia
Halobacterium sp. NRC-1
Pyrococcus abyssi
pI Distribution
0%
2%
4%
6%
8%
10%
12%
0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5
11.5
12.5
13.5
pI
ecoli0ecoli1ecoli2ecoli3
4 strain E. coli
Similar species, similar pI distribution
pI Distribution
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5
11.5
12.5
13.5
pI
hpylo0hpylo1
2 strain H. pylori
Different species, different pI distribution
pI Distribution
0%
5%10%
15%
20%25%
30%
35%
40%45%
50%
0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5
11.5
12.5
13.5
pI
ecoli0halob0hpylo0wglos0
pI Distribution
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5
pI
145 Species
How to compare ?
Interactive on-line analysis to help us to explore the different combination.
Easy-reading.
Transform
[Bacteria] - Aquifex aeolicus VF5
Transform
Hierarchical Clustering
Clustering
There are many clustering methods.We use Euclidean distances for Hierarchical Clustering.It just a easy way to read, not the only solution!
On-line comparison
On-line Clustering
All 429177 ORFs pI distribution
0
10000
20000
30000
40000
50000
60000
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10 10.5 11 11.5 12 12.5 13 13.5 14
pI
Num
ber
pI DistributionGPDB 429,177 proteins
Extreme pI DistributionpI Distribution
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5
11.5
12.5
13.5
pI
baphi0baphi1baphi2cbloc0mgeni0mpulm0uurea0wglos0
pI Distribtution
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5
11.5
12.5
13.5
pI
blong0halob0mkand0mther0synec0synec1
Endosymbiont?
Halophile, Methanophile?
ConclusionWe have constructed the database (GPDB), which provides many whole-genome scale features.A perl package called “ Genome Profile Pipeline”which can automatically analyze data was programmed.GPDB can help us to compare and analyze different microbial systems in a whole-genome scale.Integrated information from this database may be useful for data mining, comparative genomics and systems biology research.
Thanks for your attention