30
Overview of current biological databases Qi Sun Computational Biology Service Unit Cornell University

Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Overview of current biological databases

Qi Sun

Computational Biology Service Unit

Cornell University

Page 2: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics
Page 3: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Web Server Database Server

SOAP

HTTP

FTP

SQL

Platforms for Bioinformatics

Page 4: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

LinuxApacheMysqlPerl/Python/PHP

WindowsASP.NETSQL ServerC#

Open source Micorsoft

Platforms for Bioinformatics

Page 5: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Archival database (GenBank, GenPept)

vs

Computer algorithm generated database (Unigene)

vs

Manually curated database (RefSeq)

Public Database - 1

NCBI Sequence Data Model

Page 6: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

The NCBI Data Model

Genbank- A DNA centered database

Page 7: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

1. LOCUS (obsolete)2. Accession (version)3. GI

Identifier:

Page 8: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Features

Page 9: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

GenPept- A protein centered database

Page 10: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

FTP sites:

GenBank: ftp://ftp.ncbi.nih.gov/genbank/

GenPept: ftp://ftp.ncifcrf.gov/pub/genpept/

Page 11: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Problems with Genbank and Genpept

• It does not distinguish the sequence categories.

• Lot of redundancy.• Same gene could be deposited into the database many times with different names

• Different version of the same gene could be submitted many times with different accession number.

• The features of genbank record could be chaotic.

Page 12: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Archival database (GenBank, GenPept)

vs

Computer algorithm generated database (Unigene)

vs

Curated database (RefSeq, Locuslink ...)

Public Database - 1

NCBI Sequence Databases

Page 13: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

UniGenea non-redundant set of gene-oriented clusters

GenBankmRNAs

GenBank genomic CDSs

dbESTESTs

Unigene

Page 14: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Hs for humanMm for mouseRn for ratBt for cowDr for zebrafishDm for fruitflyAga for mosquitoXl for frogAt for cressHv for barleyOs for riceTa for wheatsZm for maize

Unigene identifier

Examples:

Mm.213407

Hs.13303

At.138

Page 15: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Archival database (GenBank, GenPept)

vs

Computer generated database (Unigene)

vs

Curated database (RefSeq, Gene ...)

NCBI Sequence Databases

Public Database - 1

Page 16: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

NCBI human genome annotation pipeline

The refseq incorporate the predicted transcript and protein sequences, experimentally identified mRNA sequences, EST sequences.

Page 17: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Refseq Accession Numbers:

NT_123456 constructed genomic contigs

NM_123456 mRNAs

NP_123456 proteins

NC_123456 chromosomes

XM_123456 predicted mRNA

XP_123456 predicted protein

Page 18: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Genome sequence available

Refseqacc: NP_123456, et al

EST sequence available

Unigeneacc: Hs.13303, et al

Genbankacc: AP33493, et al

Refseq? Unigene? Genbank?

Page 19: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Go to the web

Page 20: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Files that you can download from the NCBI gene database

gene_infogene2refseqgene2go

Page 21: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

NCBI Search engine

Entrez• boolean operators “AND” “OR” “NOT”• entrez tags• using limits• MeSH terms

Batch Entrez

search by accession list

Page 22: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Other Sequence Databases:

Genomic DNA: Ensembl Genome annotation database(http://www.ensembl.org, HTTP, FTP, MySQL interface)

Protein: Uniprot(http://www.pir.uniprot.org/ )

Page 23: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

KEGG database go to the web

Page 24: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Public Database - 2

GOGene Ontology

1. Molecular Function2. Biological Process3. Cellular Component

http://www.geneontology.org

Page 25: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Public Database - 2

Page 26: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Public Database - 2

Molecular Function 3674

Biological Process 8150

Cellular Component 5575

GO3673

Page 27: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

GO Example 1:

Biological Process

Page 28: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

GO Example 2:

Molecular Function

Page 29: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Smn: survival motor neuronGene ID: 39844

Gene Ontology Annotation

Page 30: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics

Public Database - 4

Species Specific Databases

•Arabidopsis – TAIR• Yeast – SGD• Fly – FLYBASE• Worm – WORMBASE• Mouse – MGD