34
A data model for Comparative Genomics Laboratory for Bioinformatics (LBI), Institute of Computing (IC) - UNICAMP Luciano Antonio Digiampietri João Carlos Setubal Cláudia Maria Bauzer Medeiros PhD Student: Advisor: Co-advisor:

A data model for Comparative Genomics

  • Upload
    feng

  • View
    16

  • Download
    0

Embed Size (px)

DESCRIPTION

Luciano Antonio Digiampietri João Carlos Setubal Cláudia Maria Bauzer Medeiros. PhD Student: Advisor: Co-advisor:. A data model for Comparative Genomics. Laboratory for Bioinformatics (LBI), Institute of Computing (IC) - UNICAMP. History. In 2002 the following genomes: - PowerPoint PPT Presentation

Citation preview

Page 1: A data model for Comparative Genomics

A data model for Comparative Genomics

Laboratory for Bioinformatics (LBI), Institute of Computing (IC) - UNICAMP

Luciano Antonio Digiampietri

João Carlos Setubal

Cláudia Maria Bauzer Medeiros

PhD Student:

Advisor:

Co-advisor:

Page 2: A data model for Comparative Genomics

History

In 2002 the following genomes:– Agrobacterium tumefaciens– Mesorhizobium loti– Ralstonia solanacearum– Sinorhizobium meliloti– Xanthomonas axonopodis pv. citri– Xanthomonas campestris pv. campestris– Xylella fastidiosa cvc– Xylella fastidiosa Temecula1

Were compared by the following people:– M. A. Van Sluys, C. B. Monteiro-Vitorello, L. E. A. Camargo, C.

F. M. Menck, A. C. R. da Silva, J. A. Ferro, M. C. Oliveira, J. C. Setubal, J. P. Kitajima, A.J. Simpson.

Plant associated-bacteria

Page 3: A data model for Comparative Genomics

To help the comparison a database was created: => PAB database

Main author: J. P. Kitajima

Publication: M. A. van Sluys, C. B. Monteiro-Vitorello, L. E. A. Camargo, C. F. M. Menck, A. C. R. da Silva, J. A. Ferro,M. C. Oliveira, J. C. Setubal, J. P. Kitajima, and A. J. G. Simpson. Comparative genomic analysis of plant-associated bacteria. Annual Review of Phytopathology, 40, 169-189, 2002.

This publication presents analysis results, not database description

Page 4: A data model for Comparative Genomics

This work

– PAB database overhaul• Redesign• Repopulation (data reload)• Incusion of new query and visualization tools

– PAB database description (there was none)– Results

• It is now much more flexible– can be used as building block of larger information systems

• Scalable– Much easier to include more genomes

Page 5: A data model for Comparative Genomics

Motivation for the work

Growing number of complete genomes of bacteria:– Today there are about 130 complete genomes– In few years there will be more than 1000

The genomes of several species of a genus or indeed the genomes of of several strains of the same species have been sequenced.

This data growth has made necessary the development of new systems and tools for comparative genomics.– The new systems must be:

• Flexible• Scalable

Page 6: A data model for Comparative Genomics

Scopestrains

species

small sets of genomes

large sets of genomes

Xylella fastidiosa citrusgrapealmondoleander

Xanthomonas axonopodis pv. citri campestris pv. campestris oryzae vesicatoria

Plant associated bacteria:

All microbial

Agrobacterium tumefaciensSinorhizobium melilotiXanthomonas axonopodis pv. citriXylella fastidiosa cvc

Page 7: A data model for Comparative Genomics

Basic concepts: Replicon

Any kind of cell unit that contains genetic information (e.g. chromosomes, plasmids and mitochondria)

plasmid pSYSM

chromosome

plasmid pSYSA

Synechocystis sp. PCC 6803

plasmid pSYSX

Page 8: A data model for Comparative Genomics

Basic concepts: Homology

Homology: two genes are homologous if they share a common ancestor.

homologous genes

homologous genes

Page 9: A data model for Comparative Genomics

organism1

organism2

Basic concepts: Homology (II)

Paralogous genes are two (or more) genes homologous in the same organisms.

Orthologous genes are homologous genes belong to different organisms.

paralogous genes

orthologousgenes

Page 10: A data model for Comparative Genomics

Basic concepts: gene familygene_id genome_id gene_category gene_product

Atu0324 At III.A.1 chromosomal replication initiator protein dnaA

SMc01167 Sm III.A.1 chromosomal replication initiator protein

Mll5581 Ml III.A.1 chromosomal replication initiator protein dnaA

XCC0001 Xcc III.A.1 chromosomal replication initiator

XAC0001 Xac III.A.1 chromosomal replication initiator

PD0001 Xfpd III.A.1 chromosomal replication initiator

XF0001 Xfcvc III.A.1 chromosomal replication initiator

RSc3442 Rs III.A.1 probable chromosomal replication initiator protein dnaA

Page 11: A data model for Comparative Genomics

I - Intermediary metabolism– Degradation

• Degradation of polysaccharides and oligosaccharides • Degradation of small molecules • Degradation of lipids

– Central intermediary metabolism – Energy metabolism, carbon– Regulatory functions

II - Biosynthesis of small molecules III - Macromolecule metabolism IV - Cell structure V - Cellular processes VI - Mobile genetic elements VII - Pathogenicity, virulence, and adaptation VIII - Hypothetical

Basic concepts: functional category

Page 12: A data model for Comparative Genomics

Motivation queries

– Given two or more genomes, what are the genes shared between them and to what families do they belong?

– Given two or more genomes, what are the genes specific to one in relation to the others, and to what families do they belong?

– Given a gene x from an organism not in the system, does it have homologous in the system? If so, how many?

Page 13: A data model for Comparative Genomics

G1genomes

replicons

genes

R1

G2 Gk

R2 R3 R4 R5 RpRp-1

gxgxgxgxgxgxgx gxgxgxgxgxgxgz

gxgxgxgxgxgxgxgxgx

gxgxgxgxgxgxgw gxgxgxgxgxgxgy

Family1

gc gx

gw gr

Family2

go gzgw

Category

Page 14: A data model for Comparative Genomics

Attributes

Attributes based in GenBank data– Genome:

• id, strain, source, taxid, description

– Replicon:• id, genome_id, description, sequence

– Genes:• id, replicon_id, start_pos, end_pos, gene_synonym,

orientation, product, name, gi, category

Page 15: A data model for Comparative Genomics

Conceptual model

BLAST Hits

Category2

:

N

Genome Replicon Gene1 . . N N . . N Gene Family

1 . . N

1

:

N

Page 16: A data model for Comparative Genomics

Tables and relationships

category_tbl

categ_id categ_description

gene_blast_tbl

gene_idblast_typeblast_dbblast_orderblast_gene_idblast_tax_idblast_qu_coverblast_sj_coverblast_idtyblast_description

replicon_tbl

replicon_id genome_idreplicon_description replicon_sequence

gene_tbl

gene_idgene_start_pos gene_end_posreplicon_idgene_synonymgene_orientation gene_product gene_name gene_category gene_category_sec gene_gi

family_tbl

family_iddescription

gene_family_tbl

family_idgene_idgenome_id

genome_tbl

genome_id genome_straingenome_sourcegenome_taxidgenome_descriptiongenome_pab

Page 17: A data model for Comparative Genomics

PABdb information system

Plant Associated Bacteria Database Main objectives

– management of genome data;– comparison among genomes;– clustering of genes in gene families and in

categories– Allow easy inclusion of new comparison

tools

Page 18: A data model for Comparative Genomics

converters of data

DBMS DBMSStructured

filesDBMS

System overview

LOCAL

DBMS

BLAST, category and family operations

User tools

Page 19: A data model for Comparative Genomics

Gene Families and Categories

Gene families were created based on BLAST results and on an undirected graph model G.– the connected components of G are the families;

Gene categories were assigned by– automatic methods;– human curator;

Page 20: A data model for Comparative Genomics

PABdb – tools

Queries tools:– Query facilitators;

Visualization tools:– Genome overview;– Comparison of orthologous genes of two

genomes;

Page 21: A data model for Comparative Genomics

Search mechanismWhat are the genes in Xanthomonas axonopodis pv. citri and Xylella fastidiosa cvc and not in Xanthomonas campestris pv. campestris and Xylella fastidiosa Temecula1?

Query facilitator

DBMSSQL query

result table XML result file

Browser

Page 22: A data model for Comparative Genomics

Screenshot (1) – search tool

Page 23: A data model for Comparative Genomics

Genes in Xanthomonas axonopodis pv. citri and Xylella fastidiosa cvc and not in Xanthomonas campestris pv. campestris and Xylella fastidiosa Temecula1family_id gene_id categ_id product

2288 Xac-chromosome I.D.2 transcriptional regulator

2288 Xfcvc-chromosome I.D transcriptional regulator

2730 Xac-chromosome VI.B plasmid stability protein

2730 Xfcvc-chromosome VI.B plasmid stabilization protein

2739 Xac-chromosome VIII.A conserved hypothetical protein

2739 Xfcvc-pXF51 VIII.A conserved hypothetical protein

3402 Xac-chromosome I.C.3 cytochrome like B561

3402 Xfcvc-chromosome I.C.3 cytochrome B561

4520 Xac-chromosome VI.A phage-related integrase

4520 Xfcvc-chromosome VI.A phage-related integrase

5376 Xac-chromosome V.B chromosome partitioning related protein

5376 Xfcvc-chromosome V.B chromosome partitioning related protein

5377 Xac-chromosome VIII.A conserved hypothetical protein

5377 Xfcvc-chromosome VIII.A hypothetical protein

5377 Xfcvc-chromosome VIII.A hypothetical protein

5378 Xac-chromosome VIII.A conserved hypothetical protein

5378 Xfcvc-chromosome VIII.A conserved hypothetical protein

5379 Xac-chromosome VIII.A conserved hypothetical protein

5379 Xfcvc-chromosome VIII.A hypothetical protein

5380 Xac-chromosome VIII.A conserved hypothetical protein

5380 Xfcvc-chromosome VIII.A hypothetical protein

Page 24: A data model for Comparative Genomics

family_id gene_id categ_id product

5381 Xac-chromosome VIII.A conserved hypothetical protein

5381 Xfcvc-chromosome VIII.A hypothetical protein

5382 Xac-chromosome III.A.2 single-stranded DNA binding protein

5382 Xfcvc-chromosome III.A.2 single-stranded DNA binding protein

5383 Xac-chromosome III.A.5 cytosine-specific DNA methyltransferase

5383 Xfcvc-chromosome III.A.5 DNA methyltransferase

5384 Xac-chromosome VIII.A conserved hypothetical protein

5384 Xfcvc-chromosome VIII.A hypothetical protein

5385 Xac-chromosome VIII.A conserved hypothetical protein

5385 Xfcvc-chromosome VIII.A hypothetical protein

5386 Xac-chromosome VIII.A conserved hypothetical protein

5386 Xfcvc-chromosome VIII.A hypothetical protein

5387 Xac-chromosome VIII.A conserved hypothetical protein

5387 Xfcvc-chromosome VIII.A hypothetical protein

5388 Xac-chromosome VIII.A conserved hypothetical protein

5388 Xfcvc-chromosome VIII.A hypothetical protein

5389 Xac-chromosome VI.B plasmid-related protein

5389 Xfcvc-chromosome VI.B conserved plasmid protein

5390 Xac-chromosome VIII.A conserved hypothetical protein

5390 Xfcvc-chromosome VIII.A hypothetical protein

5391 Xac-chromosome VIII.A conserved hypothetical protein

5391 Xfcvc-chromosome VIII.A hypothetical protein

5413 Xac-chromosome VIII.A conserved hypothetical protein

5413 Xfcvc-chromosome VIII.A hypothetical protein

5414 Xac-chromosome VIII.A conserved hypothetical protein

5414 Xfcvc-chromosome VIII.A hypothetical protein

Page 25: A data model for Comparative Genomics

Search mechanismGiven the genomes Xanthomonas axonopodis pv. citri and Xanthomonas campestris pv. campestris , what are the genes shared between them (orthologous genes)? What are the genes specific to one genome in relation to the other?

Query facilitator

DBMS

SQL query

SQL query

result tablesXML result file

SVG result file

Visualization tool

Page 26: A data model for Comparative Genomics

Screenshot (2) – search tool

Page 27: A data model for Comparative Genomics

Xanthomonas axonopodis pv. citri chromosome compared with Xanthomonas campestris pv. campestris chromosome

Page 28: A data model for Comparative Genomics

Search mechanismGiven the genomes Xanthomonas axonopodis pv. citri and Xanthomonas campestris pv. campestris, what are the genes shared between them (orthologous genes)?

Query facilitator

DBMSSQL query

result table XML result file

SVG result file

Visualization tool

Page 29: A data model for Comparative Genomics

Screenshot (3) – visualization tool

Page 30: A data model for Comparative Genomics

Comparison of orthologous genes of Xanthomonas axonopodis pv. citri and Xanthomonas campestris pv. campestris

Page 31: A data model for Comparative Genomics

Distribution of genes of each genome by category

Page 32: A data model for Comparative Genomics

Conclusions

The information systems for genomic management must be scalable and allow exchange of data and operations;

This work presented a simple but flexible and extensible data model for comparative genomics. A first step in the design of a large information system;

The data model was used in a real application (PABdb system).

Page 33: A data model for Comparative Genomics

Future work

Extend the data model to a richer context (e.g. metabolic pathways);

Extend the model to include subdivisions between “family” and “category”;

Use of metadata to describe services and data;

Use of different methods to generate the gene families.

Page 34: A data model for Comparative Genomics

Thank you!Laboratory for Bioinformatics www.lbi.ic.unicamp.br

Institute of Computation (IC) www.ic.unicamp.br

University of Campinas (UNICAMP) www.unicamp.br

Luciano Antonio Digiampietri [email protected]