Upload
joel-fields
View
217
Download
0
Embed Size (px)
Citation preview
Sonia AbdelhakInstitut Pasteur Tunis
Ahmed RebaïCentre of Biotechnology Sfax
Fredj Tekaia
Institut Pasteur Paris
Genomes Databases and Open Access Bibliographic
Resources
Outline
• General introduction and overview of complete genome sequences
• Genomes databases and where to find them
• Comparative Genomics Databases
• Other Omics resources
• Bibliographic/Open access resources
Why databases?
• In the genomic era we have billions of data that need to be stored, curated and made accessible for analysis and knowledge discovery
• Databases are essential resources for both experimental and computational biologists
• We have crossed the Terabyte threshold of genomic data (Huge, massive, explosion!)
Chronology of completely sequenced genomes
• 1977: first viral genome (5386 base pairs; encoding 11 genes). Sanger et al. sequence bacteriophage X174.
• 1981: Human mitochondrial genome. 16,500 base pairs (encodes 13 proteins, 2 rRNA, 22 tRNA)
• 1986: Chloroplast genome. 156,000 base pairs (most are 120 kb to 200 kb)
1995: first genome of a free-living organism, the bacterium Haemophilus influenzae, by TIGR, 1830 Kb, 1713 genes.
1996: first genome of an archaeal genome: Methanococcus jannaschii DSM 2661, by TIGR, 1664 Kb, 1773 genes.
1997: first eukaryotic genome : Saccharomyces cerevisiae S288C; International collaboration; 16 Chromosomes; 12,057 Kb, ~6000 genes.
1998: first multicellular organism Nematode Caenorhabditis elegans; 97 Mb; ~19,000 genes.
1999: first human chromosome: Chromosome 22 (49 Mb, 673 genes))
• 2000: Fruitfly Drosophila melanogaster (137 Mb; ~13,000 genes)
• 2000 first plant genome: Arabidopsis thaliana (115,428 Mb; 22670 genes• 2001: draft sequence of the human genome (3300 Mb; ~28000 genes)
• 2002: Plasmodium falciparum (22,9 Mb; 5334 genes)
• 2002: mouse genome (2700 Mb; ~28000 genes)
• 2004: Fish draft Tetraodon nigroviridis genome (x Mb; ~28000 genes);
• 2005: Dog (41Mb, 33651 genes) and chicken genomes ( 18031 genes)
http://www.genomesonline.org/
Tree of life
Complete genomes
2467 projects
• 524 published
(03-17-07)
• 1091 Bacteria
• 59 Archaea
• 720 eukaryotes• 3 phylogenetic domains;
• Lifestyles: mesophiles; (hyper)thermophiles; psychrophiles;extreme conditions,...
Genome sequencing projects
There are several web-based resources that document the progress of completely sequenced genomes and their reference publication, including:
GOLD Genomes Online Databasehttp://www.genomesonline.org/
gold.cgi
How big are genome sizes?
Viral genomes: 1 kb to 360 kb (Canarypox virus) Note: Mimivirus: 1.2 Mbhttp://www.giantvirus.org/top.html (Top 100 largest viral genome sequences)
Bacterial genomes: 0.5 Mb to 13 Mb;
Eukaryotic genomes: 8 Mb to 670 Gb;
Database of Genome sizes:http://www.cbs.dtu.dk/databases/DOGS/index.php
Genome Sizes (MegaBases)
0
100000
200000
300000
400000
500000
600000
Fly Fugu Human Wheat Amoeba
Size
0
500
1000
1500
2000
2500
3000
3500
E.coli Yeast Worm Fly Fugu Human
Size
BIOLOGICAL DATABASE CATEGORIES
•Databases of nucleic acid sequences (RNA, DNA)•Databases of protein sequences•Databases of protein motifs and protein domains•Databases of structures•Databases of genomes•Databases of genes•Databases of expression profiles•Databases of SNPs and mutations•Databases of metabolic pathways and protein
associations•Databases of taxonomy•…
Can we find a list of ‘clean’ databases ?
The NAR Database issue
• The 2007 update includes 968 databases, 110 more than the previous one.
• 68 new databases• updates of 106 existing databases• The complete database list and summaries are
available online on the Nucleic Acids Research web site http://nar.oxfordjournals.org/
NAR Database Category List• Nucleotide Sequence Databases • RNA sequence databases • Protein sequence databases • Structure Databases • Genomics Databases (non-vertebrate) • Metabolic and Signaling Pathways • Human and other Vertebrate Genomes • Human Genes and Diseases • Microarray Data and other Gene Expression Databases • Proteomics Resources • Other Molecular Biology Databases • Organelle databases • Plant databases • Immunological databases
• Genomics Databases (non-vertebrate) – MGD - Mouse Genome Database ?????– TIGR Gene Indices ?????– Genome annotation terms, ontologies and
nomenclature – Taxonomy and identification – General genomics databases – Viral genome databases – Prokaryotic genome databases – Unicellular eukaryotes genome databases – Fungal genome databases – Invertebrate genome databases
Three type of Genome database• Databases which collect data of all
sequenced genomes (Entrez_Genomes; EBI_genomes)
• Databases which collect data of a category of organisms with sequenced genomes (Microbial Genomes at TIGR)
• Databases specific for one organism with sequenced genomes (Flybase, MGD, Ensembl)
• Genome databases contain genomic information collected from many sources.– Genome assembly– Gene predictions– Known genes, mRNA, ESTs, proteins– Genetic maps, markers and polymorphisms– Gene expression and phenotypes– Annotations– Interspecies homologues
What kind of information you find there?
Resources for genomesThere are two main resources for genomes:
EBI European Bioinformatics Institutehttp://www.ebi.ac.uk/genomes/
NCBI National Center for Biotechnology Informationhttp://www.ncbi.nlm.nih.gov/Genomes/
But many others resources from sequencing Institutions:
Sanger The welcome Trust Sanger Institute
http://www.sanger.ac.uk/
TIGR The Institute for Genomic Researchhttp://cmr.tigr.org/tigr-scripts/CMR/shared/Genomes.cgi
Genolevures http://cbi.labri.fr/Genolevures/index.php
Eucaryotic genomes:
http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi
Bacteria, fungi genomes: http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi?p3=11:Fungi&taxgroup=11:Fungi|12:
Insects: http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi?p3=12:Insects&taxgroup=11:|12:Insects
Plant genomes: http://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html
...
Databases by phylogenetic groups
The (ever expanding) Entrez System
EntrezEntrez
PopSet
Structure
PubMed
Books
3D Domains
Taxonomy
GEO/GDS
UniGene
Nucleotide
Protein
Genome
OMIM
CDD/CDART
Journals
SNP
UniSTS
PubMed Central
Mouse Assembly
RefSeq ContigRefSeq Contig
BACBAC
WGSWGS
OtherGenBankOtherGenBank
RefSeq TranscriptRefSeq Transcript
UniGene TranscriptUniGene Transcript
Maps and Options
Common features of genomic database
• Possibility to download all the sequences of the genome or part of them (chromosomes, clones, genes, CDS,..)
• Most of them have a corresponding protein resource (the set of proteins obtained by translating all CDS)
• Example: Entrez-Genome of the NCBI Genpept
Comparative Genomics databases
Comparative genomics
Analyses of the genetic material of different species help understanding the similarity and differences between genomes, their evolution and the evolution of their genes.
•Intra-genomic comparisons help understanding the degree of duplication (genome regions; genes) and genes organization,...
•Inter-genomic comparisons help understanding the degree of similarity between genomes; degree of conservation between genes;
•understanding gene and genome evolution
COGs: Clusters of Orthologous Groups:
http://www.ncbi.nlm.nih.gov/COG/
Internet resources for whole-genome comparative analysis and associated tools
Resource URL UCSC Genome4 Bioinformatics http://genome.ucsc.edu/ Ensembl http://www.ensembl.org/ MapViewer http://www.ncbi.nlm.nih.gov/mapview/ VISTA Genome Browser http://pipeline.lbl.gov/ K-BROWSER http://hanuman.math.berkeley.edu/cgi-bin/kbrowser2 Comparative Regulatory Genomics http://corg.molgen.mpg.de/ GALA http://www.bx.psu.edu/ EnsMart http://www.ensembl.org/EnsMart/ ETOPE http://www.bx.psu.edu/ PipMaker and MultiPipMaker http://www.bx.psu.edu/ VISTA server http://www-gsd.lbl.gov/vista/ MAVID server http://baboon.math.berkeley.edu/mavid/ zPicture server http://zpicture.dcode.org/ rVISTA server http://rvista.dcode.org/
UCSC Comparative Genomics
NCBIHomo sapiens Genome:Statistics -- Build 36 version 2 Genes 28,961
Some considerations
• Organism specific databases can be more up-to-date than general databases
• Genome databases are not a one stop shop for all information, other databases like UniProt are still needed!
Bibliographic Databasesand Open Access resources
Pubmedhttp://www.pubmed.org/
• An access to more than 12 millions papers since 1950 (3790 jounals)
• Simple and advanced literature Search with keywords, author name, MESH terms, journals, single citation,..
• Some papers are free from the journal website or through the editors
Pubmed central http://www.pubmedcenral.com/
Free access journals
• Authors pay to allow readers to get the papers free
• The BMC initiative
• The Plos initiative
• Other initiatives: some journals are giving immediate free online access and others after few (1-12) months from publication
Biomedcentral (BMC) http://www.biomedcentral.com/
The PLOS initiativehttp://www.plos.org/
Highwirehttp://highwire.stanford.edu/
The HINARI initiative• The Health InterNetwork Access to Research Initiative
(HINARI) provides free or very low cost online access to the major journals in biomedical and related social sciences to local, not-for-profit institutions in developing countries.
• HINARI was launched in January 2002, with some 1500 journals from 6 major publishers. 22 additional publishers joined in May 2002, bringing the total number of journals to over 2000.
• Today more than 70 publishers are offering their content in HINARI and others will soon be joining the programme.
And also books!
If you want to learn
Just try and RTM
• General genomics databases – Animal Genome Size Database – BacMap – COG - Clusters of Orthologous Groups of proteins – CoGenT++ – DEG - Database of Essential Genes – EBI Genomes – Entrez Gene – Entrez Genomes – ERGO-Light – GenDiS – GeneNest – Genome information broker – Genome Project Database – Genome Reviews – GOLD – GtRDB - Genomic tRNA Database – Inparanoid – Integr8 (formerly Proteome Analysis Database) – INVHOGEN – KaryotypeDB – KEGG - Kyoto Encyclopedia of Genes and Genomes – MBGD - Microbial Genome Database – MeGX – MetaCyc – NegProt - Negative Proteome database