Upload
feng
View
16
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Luciano Antonio Digiampietri João Carlos Setubal Cláudia Maria Bauzer Medeiros. PhD Student: Advisor: Co-advisor:. A data model for Comparative Genomics. Laboratory for Bioinformatics (LBI), Institute of Computing (IC) - UNICAMP. History. In 2002 the following genomes: - PowerPoint PPT Presentation
Citation preview
A data model for Comparative Genomics
Laboratory for Bioinformatics (LBI), Institute of Computing (IC) - UNICAMP
Luciano Antonio Digiampietri
João Carlos Setubal
Cláudia Maria Bauzer Medeiros
PhD Student:
Advisor:
Co-advisor:
History
In 2002 the following genomes:– Agrobacterium tumefaciens– Mesorhizobium loti– Ralstonia solanacearum– Sinorhizobium meliloti– Xanthomonas axonopodis pv. citri– Xanthomonas campestris pv. campestris– Xylella fastidiosa cvc– Xylella fastidiosa Temecula1
Were compared by the following people:– M. A. Van Sluys, C. B. Monteiro-Vitorello, L. E. A. Camargo, C.
F. M. Menck, A. C. R. da Silva, J. A. Ferro, M. C. Oliveira, J. C. Setubal, J. P. Kitajima, A.J. Simpson.
Plant associated-bacteria
To help the comparison a database was created: => PAB database
Main author: J. P. Kitajima
Publication: M. A. van Sluys, C. B. Monteiro-Vitorello, L. E. A. Camargo, C. F. M. Menck, A. C. R. da Silva, J. A. Ferro,M. C. Oliveira, J. C. Setubal, J. P. Kitajima, and A. J. G. Simpson. Comparative genomic analysis of plant-associated bacteria. Annual Review of Phytopathology, 40, 169-189, 2002.
This publication presents analysis results, not database description
This work
– PAB database overhaul• Redesign• Repopulation (data reload)• Incusion of new query and visualization tools
– PAB database description (there was none)– Results
• It is now much more flexible– can be used as building block of larger information systems
• Scalable– Much easier to include more genomes
Motivation for the work
Growing number of complete genomes of bacteria:– Today there are about 130 complete genomes– In few years there will be more than 1000
The genomes of several species of a genus or indeed the genomes of of several strains of the same species have been sequenced.
This data growth has made necessary the development of new systems and tools for comparative genomics.– The new systems must be:
• Flexible• Scalable
Scopestrains
species
small sets of genomes
large sets of genomes
Xylella fastidiosa citrusgrapealmondoleander
Xanthomonas axonopodis pv. citri campestris pv. campestris oryzae vesicatoria
Plant associated bacteria:
All microbial
Agrobacterium tumefaciensSinorhizobium melilotiXanthomonas axonopodis pv. citriXylella fastidiosa cvc
Basic concepts: Replicon
Any kind of cell unit that contains genetic information (e.g. chromosomes, plasmids and mitochondria)
plasmid pSYSM
chromosome
plasmid pSYSA
Synechocystis sp. PCC 6803
plasmid pSYSX
Basic concepts: Homology
Homology: two genes are homologous if they share a common ancestor.
homologous genes
homologous genes
organism1
organism2
Basic concepts: Homology (II)
Paralogous genes are two (or more) genes homologous in the same organisms.
Orthologous genes are homologous genes belong to different organisms.
paralogous genes
orthologousgenes
Basic concepts: gene familygene_id genome_id gene_category gene_product
Atu0324 At III.A.1 chromosomal replication initiator protein dnaA
SMc01167 Sm III.A.1 chromosomal replication initiator protein
Mll5581 Ml III.A.1 chromosomal replication initiator protein dnaA
XCC0001 Xcc III.A.1 chromosomal replication initiator
XAC0001 Xac III.A.1 chromosomal replication initiator
PD0001 Xfpd III.A.1 chromosomal replication initiator
XF0001 Xfcvc III.A.1 chromosomal replication initiator
RSc3442 Rs III.A.1 probable chromosomal replication initiator protein dnaA
I - Intermediary metabolism– Degradation
• Degradation of polysaccharides and oligosaccharides • Degradation of small molecules • Degradation of lipids
– Central intermediary metabolism – Energy metabolism, carbon– Regulatory functions
II - Biosynthesis of small molecules III - Macromolecule metabolism IV - Cell structure V - Cellular processes VI - Mobile genetic elements VII - Pathogenicity, virulence, and adaptation VIII - Hypothetical
Basic concepts: functional category
Motivation queries
– Given two or more genomes, what are the genes shared between them and to what families do they belong?
– Given two or more genomes, what are the genes specific to one in relation to the others, and to what families do they belong?
– Given a gene x from an organism not in the system, does it have homologous in the system? If so, how many?
G1genomes
replicons
genes
R1
G2 Gk
R2 R3 R4 R5 RpRp-1
gxgxgxgxgxgxgx gxgxgxgxgxgxgz
gxgxgxgxgxgxgxgxgx
gxgxgxgxgxgxgw gxgxgxgxgxgxgy
Family1
gc gx
gw gr
Family2
go gzgw
Category
Attributes
Attributes based in GenBank data– Genome:
• id, strain, source, taxid, description
– Replicon:• id, genome_id, description, sequence
– Genes:• id, replicon_id, start_pos, end_pos, gene_synonym,
orientation, product, name, gi, category
Conceptual model
BLAST Hits
Category2
:
N
Genome Replicon Gene1 . . N N . . N Gene Family
1 . . N
1
:
N
Tables and relationships
category_tbl
categ_id categ_description
gene_blast_tbl
gene_idblast_typeblast_dbblast_orderblast_gene_idblast_tax_idblast_qu_coverblast_sj_coverblast_idtyblast_description
replicon_tbl
replicon_id genome_idreplicon_description replicon_sequence
gene_tbl
gene_idgene_start_pos gene_end_posreplicon_idgene_synonymgene_orientation gene_product gene_name gene_category gene_category_sec gene_gi
family_tbl
family_iddescription
gene_family_tbl
family_idgene_idgenome_id
genome_tbl
genome_id genome_straingenome_sourcegenome_taxidgenome_descriptiongenome_pab
PABdb information system
Plant Associated Bacteria Database Main objectives
– management of genome data;– comparison among genomes;– clustering of genes in gene families and in
categories– Allow easy inclusion of new comparison
tools
converters of data
DBMS DBMSStructured
filesDBMS
System overview
LOCAL
DBMS
BLAST, category and family operations
User tools
Gene Families and Categories
Gene families were created based on BLAST results and on an undirected graph model G.– the connected components of G are the families;
Gene categories were assigned by– automatic methods;– human curator;
PABdb – tools
Queries tools:– Query facilitators;
Visualization tools:– Genome overview;– Comparison of orthologous genes of two
genomes;
Search mechanismWhat are the genes in Xanthomonas axonopodis pv. citri and Xylella fastidiosa cvc and not in Xanthomonas campestris pv. campestris and Xylella fastidiosa Temecula1?
Query facilitator
DBMSSQL query
result table XML result file
Browser
Screenshot (1) – search tool
Genes in Xanthomonas axonopodis pv. citri and Xylella fastidiosa cvc and not in Xanthomonas campestris pv. campestris and Xylella fastidiosa Temecula1family_id gene_id categ_id product
2288 Xac-chromosome I.D.2 transcriptional regulator
2288 Xfcvc-chromosome I.D transcriptional regulator
2730 Xac-chromosome VI.B plasmid stability protein
2730 Xfcvc-chromosome VI.B plasmid stabilization protein
2739 Xac-chromosome VIII.A conserved hypothetical protein
2739 Xfcvc-pXF51 VIII.A conserved hypothetical protein
3402 Xac-chromosome I.C.3 cytochrome like B561
3402 Xfcvc-chromosome I.C.3 cytochrome B561
4520 Xac-chromosome VI.A phage-related integrase
4520 Xfcvc-chromosome VI.A phage-related integrase
5376 Xac-chromosome V.B chromosome partitioning related protein
5376 Xfcvc-chromosome V.B chromosome partitioning related protein
5377 Xac-chromosome VIII.A conserved hypothetical protein
5377 Xfcvc-chromosome VIII.A hypothetical protein
5377 Xfcvc-chromosome VIII.A hypothetical protein
5378 Xac-chromosome VIII.A conserved hypothetical protein
5378 Xfcvc-chromosome VIII.A conserved hypothetical protein
5379 Xac-chromosome VIII.A conserved hypothetical protein
5379 Xfcvc-chromosome VIII.A hypothetical protein
5380 Xac-chromosome VIII.A conserved hypothetical protein
5380 Xfcvc-chromosome VIII.A hypothetical protein
family_id gene_id categ_id product
5381 Xac-chromosome VIII.A conserved hypothetical protein
5381 Xfcvc-chromosome VIII.A hypothetical protein
5382 Xac-chromosome III.A.2 single-stranded DNA binding protein
5382 Xfcvc-chromosome III.A.2 single-stranded DNA binding protein
5383 Xac-chromosome III.A.5 cytosine-specific DNA methyltransferase
5383 Xfcvc-chromosome III.A.5 DNA methyltransferase
5384 Xac-chromosome VIII.A conserved hypothetical protein
5384 Xfcvc-chromosome VIII.A hypothetical protein
5385 Xac-chromosome VIII.A conserved hypothetical protein
5385 Xfcvc-chromosome VIII.A hypothetical protein
5386 Xac-chromosome VIII.A conserved hypothetical protein
5386 Xfcvc-chromosome VIII.A hypothetical protein
5387 Xac-chromosome VIII.A conserved hypothetical protein
5387 Xfcvc-chromosome VIII.A hypothetical protein
5388 Xac-chromosome VIII.A conserved hypothetical protein
5388 Xfcvc-chromosome VIII.A hypothetical protein
5389 Xac-chromosome VI.B plasmid-related protein
5389 Xfcvc-chromosome VI.B conserved plasmid protein
5390 Xac-chromosome VIII.A conserved hypothetical protein
5390 Xfcvc-chromosome VIII.A hypothetical protein
5391 Xac-chromosome VIII.A conserved hypothetical protein
5391 Xfcvc-chromosome VIII.A hypothetical protein
5413 Xac-chromosome VIII.A conserved hypothetical protein
5413 Xfcvc-chromosome VIII.A hypothetical protein
5414 Xac-chromosome VIII.A conserved hypothetical protein
5414 Xfcvc-chromosome VIII.A hypothetical protein
Search mechanismGiven the genomes Xanthomonas axonopodis pv. citri and Xanthomonas campestris pv. campestris , what are the genes shared between them (orthologous genes)? What are the genes specific to one genome in relation to the other?
Query facilitator
DBMS
SQL query
SQL query
result tablesXML result file
SVG result file
Visualization tool
Screenshot (2) – search tool
Xanthomonas axonopodis pv. citri chromosome compared with Xanthomonas campestris pv. campestris chromosome
Search mechanismGiven the genomes Xanthomonas axonopodis pv. citri and Xanthomonas campestris pv. campestris, what are the genes shared between them (orthologous genes)?
Query facilitator
DBMSSQL query
result table XML result file
SVG result file
Visualization tool
Screenshot (3) – visualization tool
Comparison of orthologous genes of Xanthomonas axonopodis pv. citri and Xanthomonas campestris pv. campestris
Distribution of genes of each genome by category
Conclusions
The information systems for genomic management must be scalable and allow exchange of data and operations;
This work presented a simple but flexible and extensible data model for comparative genomics. A first step in the design of a large information system;
The data model was used in a real application (PABdb system).
Future work
Extend the data model to a richer context (e.g. metabolic pathways);
Extend the model to include subdivisions between “family” and “category”;
Use of metadata to describe services and data;
Use of different methods to generate the gene families.
Thank you!Laboratory for Bioinformatics www.lbi.ic.unicamp.br
Institute of Computation (IC) www.ic.unicamp.br
University of Campinas (UNICAMP) www.unicamp.br
Luciano Antonio Digiampietri [email protected]