51
WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie Evolutive Université Claude Bernard - Lyon 1 on Penel, Julien Grassot, Manolo Gouy, Guy Perrière Laurent Duret. Pôle Bio-Informatique Lyonnais

WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

WP 12 Contribution to Integr8

The  HoGenom  database :Families of homologous genes from

complete genomes

Work Package 12:

Equipe Bioinformatique et Génomique Evolutive

Laboratoire de Biométrie et Biologie EvolutiveUniversité Claude Bernard - Lyon 1

Simon Penel, Julien Grassot, Manolo Gouy, Guy Perrière,Laurent Duret.

Pôle Bio-Informatique Lyonnais

Page 2: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

• Introduction :– Databases for phylogenomics

• Achievements of WP12 Milestones – Automatised updating procedure– Development of a database of homologous genes from

complete genomes

• Perspectives

Page 3: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Databases of homologous gene families for comparative genomics

• Goal :– Provide an easy access to all the information that can be

drawn from the comparison of homologous sequences

• General approach :– Search for sequence similarities– Clustering of homologous sequences into families– Analysis of sequence families (multiple alignment, profile,

phylogenetic tree, ...)– Query software and user interface to retrieve and display

relevant information

Page 4: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Domain vs. gene families

• Modular evolution of protein genes

Families of homologous protein domains: - Evolution by domain shuffling (duplication, loss, translocation)Gene families: - Evolution of homologous genes by speciation or by gene duplication - Sequences are homologous over their entire length (or almost)

Page 5: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Different databases for different purposes

• Databases of protein domains (InterPro, etc.) WP5– Prediction of the biochemical activity of proteins:

Does this protein have a kinase catalytic site ? Does it contain a DNA binding domain ? …

– Prediction of protein structures Does this protein contain a domain homologous to an already

known 3D structure ?

– …

Page 6: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Different databases for different purposes

• Databases of gene families (WP12): identify orthologues or paralogues within a given set of taxa

• Example of typical queries:– Identify all orthologues between human, mouse and zebrafish

• Prediction of gene function• Phylogenetics• Comparative mapping

– Identify all paralogous genes originating from a duplication in the last common ancestor of vertebrates

• Evolution of the function of duplicated genes• Analysis of genome duplications

– Identify all the genes that are specific to a pathogenic strain of E. coli– …

Page 7: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Orthology/Paralogy

Homology: two genes are homologous if they share a common ancestor

Ancestral insulin gene

RodentsPrimates

INS2INS1

Human Rat Mouse Rat MouseINS INS1 INS1 INS2 INS2

Speciation

Duplication

Orthologs: homologs that have diverged after a speciation

Paralogs: homologs that have diverged after a duplication

Page 8: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Data for Phylogenomics

• Search for homologues and homologies interpretation (orthology, paralogy) require:– To find similarities.

– To compute multiple alignments.

– To build phylogenetic trees.

– To have reference taxonomic data.

– To access sequence databanks annotations.

Page 9: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Databases content

• Database of homologous genes– HOVERGEN: vertebrates (Duret et al., 1994)– HOBACGEN: bacteria and archea (Perrière et al., 2000)– HOGENOM : fully sequenced organisms

• Protein sequences from SWISS-PROT/ TrEMBL.• Nucleotide sequences from EMBL.• Taxonomic data (NCBI). • Homologous genes classified into families.• Multiple alignments.• Phylogenetic trees.

Page 10: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

WP 12 Milestones

Milestone 12 Automatised updating procedure

Milestone 24 Development of a database of homologous genes from complete genomes: HoGenom

Milestone 36 Development of tools for automatic analysis of phylogenetic trees

Page 11: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Building of HoGenom : general viewSelection of fully sequenced organisms protein

sequences on the EBI proteome site.

Sequence comparison with BLAST on the whole sequences dataset

Clustering of the sequences in genes family on the basis of sequence similarity (transitive

association)

Add the gene family info in the protein sequence annotations

Protein Alignments

Phylogenetic trees

For each family

Page 12: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Hogenprot: Q9DCD0ID Q9DCD0 PRELIMINARY; PRT; 483 AA.AC Q9DCD0;DT 01-JUN-2001 (TrEMBLrel. 17, Created)DT 01-JUN-2001 (TrEMBLrel. 17, Last sequence update)DT 01-MAR-2002 (TrEMBLrel. 20, Last annotation update)DE 0610042A05RIK PROTEIN.GN 0610042A05RIK.OS Mus musculus (Mouse).OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;OC Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.OX NCBI_TaxID=10090RN [1]RP SEQUENCE FROM N.A.RC STRAIN=C57BL/6J; TISSUE=KIDNEY;RX MEDLINE=21085660; PubMed=11217851;RA Kawai J., Shinagawa A., Shibata K., Yoshino M., Itoh M., Ishii Y., ----RA Hayashizaki Y.;RT "Functional annotation of a full-length mouse cDNA collection.";RL Nature 409:685-690(2001).CC -!- CATALYTIC ACTIVITY: 6-PHOSPHO-D-GLUCONATE + NADP(+) = D-RIBULOSECC 5-PHOSPHATE + CO(2) + NADPH.CC -!- PATHWAY: HEXOSE MONOPHOSPHATE SHUNT.CC -!- SIMILARITY: BELONGS TO THE 6-PHOSPHOGLUCONATE DEHYDROGENASECC FAMILY.CC -!- GENE_FAMILY: HBG000005 [ FAMILY / ALN / TREE ]DR EMBL; AK002894; BAB22439.1; -.DR HSSP; P00349; 2PGD.DR MGD; MGI:1914101; 0610042A05Rik.DR InterPro; IPR001744; 6PGD.DR Pfam; PF00393; 6PGD; 1.DR PRINTS; PR00076; 6PGDHDRGNASE.DR PROSITE; PS00461; 6PGD; 1.DR PRODOM; Q9DCD0.DR SWISS-2DPAGE; Q9DCD0.KW NADP; Oxidoreductase; Pentose shunt.FT DOMAIN 5 60 PRODOM:2001.3:PD001594 134FT DOMAIN 63 296 PRODOM:2001.3:PD001025 91FT DOMAIN 316 469 PRODOM:2001.3:PD001549 79SQ SEQUENCE 483 AA; 53247 MW; CD0A3F72EEC2831E CRC64;

Page 13: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Building of HoGenom : general viewSelection of fully sequenced organisms protein

sequences on the EBI proteome site.

Sequence comparison with BLAST on the whole sequences dataset

Clustering of the sequences in genes family on the basis of sequence similarity (transitive

association)

Add the gene family info in the protein sequence annotations

EMBL cross references calculations, nucleotide sequences selection

Add gene family info in the EMBL/GenBank nucleotide annotations

Protein Alignments

Phylogenetic trees

ACNUCProtein database

For each family

Page 14: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Hogennucl: AK002894.PE1AK002894.PE1 Location/QualifiersFT CDS_pept 76..1527FT /codon_start=1FT /db_xref="MGD:MGI:1914101"FT /db_xref="SWISS-PROT:Q9DCD0"FT /note="data source:SPTR, source key:P52209, evidence:ISS"FT /note="homolog to 6-PHOSPHOGLUCONATE DEHYDROGENASE,FT DECARBOXYLATING (EC 1.1.1.44)"FT /note="putative"FT /transl_table=1FT /gene_family="HBG000005"FT /protein_id="BAB22439.1"FT /translation="MAQADIALIGLAVMGQNLILNMNDHGFVVCAFNRTVSKVDDFLANFT EAKGTKVVGAQSLKDMVSKLKKPRRVILLVKAGQAVDDFIEKLVPLLDTGDIIIDGGNSFT EYRDTTRRCRDLKAKGILFVGSGVSGGEEGARYGPSLMPGGNKEAWPHIKAIFQAIAAKFT VGTGEPCCDWVGDEGAGHFVKMVHNGIEYGDMQLICEAYHLMKDVLGMRHEEMAQAFEEFT WNKTELDSFLIEITANILKYRDTDGKELLPKIRDSAGQKGTGKWTAISALEYGMPVTLIFT GEAVFARCLSSLKEERVQASQKLKGPKVVQLEGSKKSFLEDIRKALYASKIISYAQGFMFT LLRQAATEFGWTLNYGGIALMWRGGCIIRSVFLGKIKDAFERNPELQNLLLDDFFKSAVFT DNCQDSWRRVISTGVQAGIPMPCFTTALSFYDGYRHEMLPANLIQAQRDYFGAHTYELLFT TKPGEFIHTNWTGHGGSVSSSSYNA" atggcccaag ctgacattgc actgatcgga ctggctgtca tgggccagaa cttaattttg 60 aacatgaatg atcatggatt tgtggtctgt gctttcaata ggacagtctc caaagtcgat 120

….

ccctgcttca ctactgccct ctccttctat gatgggtaca gacacgagat gctgccagca 1320 aacctcatcc aggctcaacg ggattacttt ggggctcaca cctatgaact cttaaccaaa 1380 ccgggagaat ttatccacac caactggacg ggccacgggg gcagtgtgtc atcctcttca 1440 tacaatgcct ag 1452//

Page 15: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Building of HoGenom : general viewSelection of fully sequenced organisms protein

sequences on the EBI proteome site.

Sequence comparison with BLAST on the whole sequences dataset

Clustering of the sequences in genes family on the basis of sequence similarity (transitive

association)

Add the gene family info in the protein sequence annotations

EMBL cross references calculations, nucleotide sequences selection

Add gene family info in the EMBL/GenBank nucleotide annotations

Protein Alignments

Phylogenetic trees

ACNUCProtein database

ACNUC Nucleotide database

For each family

Page 16: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Automatised updating procedure:

Sequences in the database

Iterative sequence comparison with BLAST

compare new sequences with themself

compare new sequences with old sequences

release 1

release 2

old

old

new

new

Seq

uenc

es in

the

data

base

Page 17: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Family construction 1:similarity search

BLASTP

BLOSUM62E ≤ 10-4

Filtering (SEG)

SWISS-PROT +TrEMBL

New x New + New x Old

Local pairwise alignments

Automatised updating procedure:

Page 18: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

S2 S4S1S3Seq. A

Seq. B

S2S1’

∆lg1 lgHSP1 ∆lg2 ∆lg3lgHSP2

Seq. A

Seq. B

Family construction 2: Selection of consistent HSPs

Automatised updating procedure:

Page 19: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Family construction 3: Clustering into families

A

B

A

C

HSP ≥ 80 % lengthSimilarity ≥ 50 %

C

B

A

Cluster A, B, C

1 : Clustering of complete sequences into families2 : Including partial sequences to the families defined previously

Automatised updating procedure:

Page 20: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Protein family

ABCDEFG

BIONJ

Neighbor joining,Observed divergence

Partial sequences: distance matrix with missing values

Multiple alignment

ABCDEFG

Rooting: mid-point

Phylogenetic treeG

F

E

D

C

B

A

CLUSTAL W

Default parameters

Family construction 4: Alignments and trees

Automatised updating procedure:

Page 21: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

The HoGenom database:Families of homologous genes

from complete genomes

Month 24 Deliverable

Page 22: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Improvements in computation time

• Collaboration with IN2P3 Computing Center (Lyon) – CPU: about 1000 processors (Sun, Linux)– Disk storage: about 700 Tb– Batch queuing system (BQS)

• Building of HOGENOM (September 2003):– Total BLAST real time (800 Linux processors): 30h– 310, 000 new sequences– 112, 000 old sequences

parallelisation~ 2 monthsLocal ressources

~ 1 dayIN2P3 ressources

Page 23: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

HoGenom ACNUC contents8th September 2003

HoGenom Proteins 423,577 sequences

HoGenom Nucleotide Sequences 448,582 cds

117 fully sequenced organisms

Data SourceProtein data from EBI: non-redondant complete proteome sets(SWISS-PROT, TrEMBL, TrEMBLnew) http://www.ebi.ac.uk/proteome, June 2003

Genomic data from EMBL , June 2003

Page 24: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

117 organisms

423 577 protein sequences

1016

91

Arabidopsis thaliana (plant) Caenorhabditis elegans (nematod) Drosophila melanogaster (fly) Encephalitozoon cuniculi (microsporidia) Guillardia theta (alguae) Homo sapiens (man) Mus musculus (mouse) Rattus norvegicus (rat) Saccharomyces cerevisiae (yeast) Schizosaccharomyces pombe (fungus)

31%

9%

60%

Page 25: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

41 907 families423 577 protein sequences

Sequences belonging to a family 305 514 (72%)

305 514

115 373 Orphan Sequences (27%)

115 373

Page 26: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Access to HoGenom is available at the PBIL: http://pbil.univ-lyon1.fr/

Web page of HoGenom : http://pbil.univ-lyon1.fr/databases/hogenom.html

Page 27: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Databases access on the Web (Perrière et al. 2003)

Two main www interfaces• WWW Query

– Multiple query on sequences (Guy Perrière)– Multiple query on families– http://pbil.univ-lyon1.fr/search/query_fam.php

• Cross Taxa – Search of families in function of complex taxonomic criteria– Selection of families– http://pbil.univ-lyon1.fr/search/cross_fam.php

• Cross-references with external databases, integration to Integr8 (WP2)

Page 28: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Cross Taxa: Selection of gene families example : selecting families of animal specific genes

Page 29: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique
Page 30: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

display familydisplay family

Page 31: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Family Page

Page 32: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique
Page 33: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique
Page 34: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Example:

sequence Q8ZY16 in NiceProt : cross-references to HAMAP-ACNUC and HOBACGEN

Cross-references with external databases and integration (WP2)

1 sequence associated family

Display the family, alignment and phylogenetic tree associated to an sequence accession number via a URL link.

http

http://pbil.univ-lyon1.fr/cgi-bin/acnuc-link-ac2fam?db=HAMAPprot&query=Q8ZY16

Page 35: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Next steps

• Milestone 36– Development of tools for phylogenetic tree analysis,

automatic orthology and paralogy relationship assignment (J.F. Dufayard)

– Phylogenetic profiles

– Collaboration with WP3 : cross-references between genome CDS and complete proteome

Page 36: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Acknowledgements

People from BBE: SWISS-PROT group

Laurent Duret Alexandre Gattiker (S, HAMAP)

Manolo Gouy

Julien Grassot INRIA

Simon Penel Jean-François Dufayard

Guy Perrière

Page 37: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Databases of homologous genes

• Databases of homologous genes at PBIL:– HOVERGEN (1994): vertebrates

– HOBACGEN (2000): prokaryotes

– HOGENOM: complete genomes– RTKdb: receptor tyrosine kinase (J. Grassot, G. Mouchiroud)

– NuReBase: nuclear receptors (M. Robinson, V. Laudet)

• Goals

• Database content and updating

• Query software

Page 38: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Databases for comparative genomics

• Databases of homologous protein domains– PROSITE– PFAM– PRODOM– ...– InterPro

• Databases of gene families– COG– HOBACGEN, HOVERGEN– ...

Page 39: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Comparative genomics

• Functional genomics:– Prediction of gene function, protein structure– Identification of functional constraints– Identification of regulatory elements– ...

• Molecular evolution studies:– Search for horizontal transfers– Species-specific metabolic pathways– Ancestral genome content– Gene, genome duplication and acquisition of novel

functions– …

Page 40: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Gene duplication and evolution of function

Gene duplication

... ...Time

Pseudogene

Ancient paralogs Specific function

e.g. expression pattern, subcellular localisation, biochemical activity, ...

Page 41: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Phylogenomic approach for function prediction

2) Align sequences

3) Compute phylogenetic tree

2A1A 2B1B3A 3B

5) Infer the likely function of other genes

2A1A 2B1B3A 3B

4) Place known functions in the tree

2A1A 2B

1B3A 3B

1) Identify homologs

2A

1A2B

1B3A3B

Species: 1, 2, 3

gene duplication

Page 42: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Hogennucl: AK002894.PE1AK002894.PE1 Location/QualifiersFT CDS_pept 76..1527FT /codon_start=1FT /db_xref="MGD:MGI:1914101"FT /db_xref="SWISS-PROT:Q9DCD0"FT /note="data source:SPTR, source key:P52209, evidence:ISS"FT /note="homolog to 6-PHOSPHOGLUCONATE DEHYDROGENASE,FT DECARBOXYLATING (EC 1.1.1.44)"FT /note="putative"FT /transl_table=1FT /gene_family="HBG000005"FT /protein_id="BAB22439.1"FT /translation="MAQADIALIGLAVMGQNLILNMNDHGFVVCAFNRTVSKVDDFLANFT EAKGTKVVGAQSLKDMVSKLKKPRRVILLVKAGQAVDDFIEKLVPLLDTGDIIIDGGNSFT EYRDTTRRCRDLKAKGILFVGSGVSGGEEGARYGPSLMPGGNKEAWPHIKAIFQAIAAKFT VGTGEPCCDWVGDEGAGHFVKMVHNGIEYGDMQLICEAYHLMKDVLGMRHEEMAQAFEEFT WNKTELDSFLIEITANILKYRDTDGKELLPKIRDSAGQKGTGKWTAISALEYGMPVTLIFT GEAVFARCLSSLKEERVQASQKLKGPKVVQLEGSKKSFLEDIRKALYASKIISYAQGFMFT LLRQAATEFGWTLNYGGIALMWRGGCIIRSVFLGKIKDAFERNPELQNLLLDDFFKSAVFT DNCQDSWRRVISTGVQAGIPMPCFTTALSFYDGYRHEMLPANLIQAQRDYFGAHTYELLFT TKPGEFIHTNWTGHGGSVSSSSYNA" atggcccaag ctgacattgc actgatcgga ctggctgtca tgggccagaa cttaattttg 60 aacatgaatg atcatggatt tgtggtctgt gctttcaata ggacagtctc caaagtcgat 120

….

ccctgcttca ctactgccct ctccttctat gatgggtaca gacacgagat gctgccagca 1320 aacctcatcc aggctcaacg ggattacttt ggggctcaca cctatgaact cttaaccaaa 1380 ccgggagaat ttatccacac caactggacg ggccacgggg gcagtgtgtc atcctcttca 1440 tacaatgcct ag 1452//

Nucleotide sequence annotations

Page 43: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Hogenprot: Q9DCD0ID Q9DCD0 PRELIMINARY; PRT; 483 AA.AC Q9DCD0;DT 01-JUN-2001 (TrEMBLrel. 17, Created)DT 01-JUN-2001 (TrEMBLrel. 17, Last sequence update)DT 01-MAR-2002 (TrEMBLrel. 20, Last annotation update)DE 0610042A05RIK PROTEIN.GN 0610042A05RIK.OS Mus musculus (Mouse).OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;OC Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.OX NCBI_TaxID=10090RN [1]RP SEQUENCE FROM N.A.RC STRAIN=C57BL/6J; TISSUE=KIDNEY;RX MEDLINE=21085660; PubMed=11217851;RA Kawai J., Shinagawa A., Shibata K., Yoshino M., Itoh M., Ishii Y., ----RA Hayashizaki Y.;RT "Functional annotation of a full-length mouse cDNA collection.";RL Nature 409:685-690(2001).CC -!- CATALYTIC ACTIVITY: 6-PHOSPHO-D-GLUCONATE + NADP(+) = D-RIBULOSECC 5-PHOSPHATE + CO(2) + NADPH.CC -!- PATHWAY: HEXOSE MONOPHOSPHATE SHUNT.CC -!- SIMILARITY: BELONGS TO THE 6-PHOSPHOGLUCONATE DEHYDROGENASECC FAMILY.CC -!- GENE_FAMILY: HBG000005 [ FAMILY / ALN / TREE ]DR EMBL; AK002894; BAB22439.1; -.DR HSSP; P00349; 2PGD.DR MGD; MGI:1914101; 0610042A05Rik.DR InterPro; IPR001744; 6PGD.DR Pfam; PF00393; 6PGD; 1.DR PRINTS; PR00076; 6PGDHDRGNASE.DR PROSITE; PS00461; 6PGD; 1.DR PRODOM; Q9DCD0.DR SWISS-2DPAGE; Q9DCD0.KW NADP; Oxidoreductase; Pentose shunt.FT DOMAIN 5 60 PRODOM:2001.3:PD001594 134FT DOMAIN 63 296 PRODOM:2001.3:PD001025 91FT DOMAIN 316 469 PRODOM:2001.3:PD001549 79SQ SEQUENCE 483 AA; 53247 MW; CD0A3F72EEC2831E CRC64;

Protein sequence annotations

Page 44: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Previous computation time: (Sun Sparc Ultra 900 MHz, 2 Gb RAM)

• Updating the HOVERGEN database (April 2002)

– 137,000 old + 33,000 new sequences (51 106 aa)

– BLAST comparison (new x old + new x new): 23 days

– Multiple alignments (Clustalw): 4 days

– Phylogenetic trees (BioNJ, no bootstrap): 0.5 day

– Total: 28 days (1 processor)

Improvements in computation time

Calculation time was a bottleneck for frequent updates of several databases

Page 45: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

• General overview on WP12 research – Phylogenomics– Databases of homologous gene families– Family construction

• The HOGENOM database– Building– Results

• Access to databases– Database query via a web server– Database cross-references via URLs

Page 46: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

• Proteome/genome comparative analysis• Phylogenetic studies• Orthology/Paralogy relationship assignments• Development of generalist databases, specialised databases

– HOVERGEN: families of homologous vertebrate genes– HOBACGEN: families of homologous bacterial genes– HOGENOM: families of homologous from complete genomes– NureBase, RTKdb, Hoppsigen, Mitalib,..

Important regions identification in genomic sequencesEvolution at the molecular levelSpecies phylogenyFunction prediction

WP 12 Research fields:

Page 47: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

• General overview on WP12 research – Phylogenomics– Databases of homologous gene families– Family construction

• The HOGENOM database– Building– Results

• Access to databases– Database query via a web server– Database cross-references via URLs

Page 48: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

•General overview on WP12 research

–Phylogenomics–Databases of homologous gene families–Family construction

•The HOGENOM database–Building–Results

•Access to databases–Database query via a web server–Database cross-references via URLs

Page 49: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique
Page 50: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Application to other databases

Any sequence database can be structured under ACNUC and queried with WWW-QueryCurrently available :• SWISS-PROT,• EMBL,• GenBank,• etc.

Any family database can be structured under ACNUC and queried with WWW-Query and Cross-Taxa

For example, an ACNUC version of the HAMAP database developed by SWISS-PROT is currently available at the PBIL

Page 51: WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique

Ortholog ≠ Functional equivalent !!

Orthology: not necessarily one-to-one relationship (one-to-many or many-to-many)

e.g.: the human INS gene has two orthologs in rodents (Ins1 and Ins2)

The rodent Ins1 gene is more closely related to its paralog Ins2 than to its human ortholog INS.

Ancestral insulin gene

RodentsPrimates

INS2INS1

Human Rat Mouse Rat MouseINS INS1 INS1 INS2 INS2

Speciation

Duplication