24
Biological Database Systems Denis Shestakov, University of Turku/Tampere

Biological Database Systems

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Biological Database Systems

Biological Database Systems

Denis Shestakov, University of Turku/Tampere

Page 2: Biological Database Systems

BioDB-1, ShestakovPage 2

Course Information

• Course structure:– Lectures: approx. 12 (plus today’s

intro and review lecture in the end of the course)

– Project work: details will be given next time

– Exam: easy to pass if project is done

– URL:

Page 3: Biological Database Systems

BioDB-1, ShestakovPage 3

Course Information

• Dates:– Period 2: 27.11, 4.12, 11.12– Period 3: 10 meetings on

Mondays/Wednesdays

• Contact info:– Email: – ICT, B6019: at 15-18 on Tuesdays

Page 4: Biological Database Systems

BioDB-1, ShestakovPage 4

Course Information: Literature• Slides• References in the end of slides• Books:

– Bioinformatics: Managing Scientific Data by Lacroix & Critchlow, Morgan Kaufmann, 2003 ISBN-10: 155860829X

– Database Systems Concepts, 5th edition by Silbershatz, Korth & Sudarshan, McGraw-Hill, 2005 ISBN-10: 0072958863

• Articles:– Biological database design and implementation

by Birney & Clamp (the Ensembl project), Briefings in Bioinformatics, 5(1):31-38, 2004

Page 5: Biological Database Systems

Biological Database Systems

1.1. Course Content1.2. Course Objectives1.3. Database and DBMS1.4. Biological Databases

Page 6: Biological Database Systems

BioDB-1, ShestakovPage 6

Course content: main topics

1. Database concepts, database design process

2. Relational data model3. Introduction to SQL4. XML and XML-based databases5. Data structures for biological data:

storage and querying6. Model organism databases

Page 7: Biological Database Systems

BioDB-1, ShestakovPage 7

Course content: main topics

7. LIMS, BioPostgres8. Analysis workflows, web services9. Integration of biological data10.Integration of biological data,

example of integration system11.Research issues in scientific

databases12.* Project discussion, exam

preparation

Page 8: Biological Database Systems

BioDB-1, ShestakovPage 8

Course focus • Database issues:

– Biology-specific– Representation of biological data– Design of biological databases

• NOT about:– Usage of existing databases– Accessing/retrieving data from bio-

databases

Page 9: Biological Database Systems

BioDB-1, ShestakovPage 9

Course goal

Give basic knowledge of biological* database design

* - for molecular biology

Page 10: Biological Database Systems

BioDB-1, ShestakovPage 10

Do you need to know that?• Work in “wet” laboratory:

– One bioinformatician and many biologists– Likely to be IT guru for others– Expect to answer IT-related questions

• Work in bioinformatics lab:– Many bioinformaticians– Group may maintain several dbs– Basics are helpful

• Create/maintain biological databases– Start learning!– Ask for more information

Page 11: Biological Database Systems

BioDB-1, ShestakovPage 11

Database?From Merriam-Webster dictionary:(http://www.merriam-webster.com/dictionary/database)

Page 12: Biological Database Systems

BioDB-1, ShestakovPage 12

Database?• A collection of data:

– structured– searchable (i.e., indexable)– updated– cross-referenced

• Objective:– Transform “meaningless” raw data into useful

information which can be accessed and analysed in the best way

• Database Management System (DBMS):– software designed for the purpose of

managing databases (access, insert, delete, update, etc.)

Page 13: Biological Database Systems

BioDB-1, ShestakovPage 13

DBMS

• A set of tools that:– Store– Extract– Modify

DatabaseDatabase

StoreStore ExtractExtract ModifyModify

USERSUSERS

Page 14: Biological Database Systems

BioDB-1, ShestakovPage 14

Biological Databases?Explosive growth in biological data• E.g., tremendous increase in

nucleotide sequences (first increase in data due to the polymerase chain reaction (PCR) technique development in 1983)

• 1980: 80 genes fully sequenced• …

Page 15: Biological Database Systems

BioDB-1, ShestakovPage 15

Biological Databases?• EMBL Database Growth:

Total nucleotides

(Nov 07: 188,490,792,445)Number of entries

(Nov 07: 106,144,026)

Page 16: Biological Database Systems

BioDB-1, ShestakovPage 16

Biological Databases?

• Data (genomic sequences, 3D structures, 2D gel analysis, microarrays….) directly submitted to databases

• Essential tools for biological research, like reading relevant literature

Page 17: Biological Database Systems

BioDB-1, ShestakovPage 17

Biological Databases: History

• 1965– Margaret Dayhoff et al. publish “Atlas

of Protein Sequences and Structures”• 1982

– EMBL initiates DNA sequence databases, followed within a year by GenBank and in 1984 by the DNA Database of Japan

• 1988– EMBL/GenBank/DDBJ agree on common

format for data elements

Page 18: Biological Database Systems

BioDB-1, ShestakovPage 18

Biological Databases: some statistics• More than 1000 different databases

– 968 databases reported inThe Molecular Biology Database Collection: 2007 update by Galperin, Nucleic Acids Research, 2007, Vol. 35, Database issue D3-D4

– Metabase: database of biological databases, http://biodatabase.org/index.php/Main_Page

• Database sizes: <100kB to >100GB (EMBL >500GB)– DNA: >100GB– Protein: 1GB– 3D structure: 5GB

• Update frequency: daily to annyally• Freely accessible (as a rule)

Page 19: Biological Database Systems

BioDB-1, ShestakovPage 19

Some databases in the field of molecular biology

AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,BioMagResBank, BIOMDB, BLOCKS, BovGBASE,BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,GCRDB, GDB, GENATLAS, Genbank, GeneCards,Genline, GenLink, GENOTK, GenProtEC, GIFTS,GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD, YPM, etc …

Find more at http://biodatabase.org

Page 20: Biological Database Systems

BioDB-1, ShestakovPage 20

Categories of Biological Databases

1. Nucleotide sequences2. Genomics3. Mutation/polymorphism4. Protein seqiences5. Protein domain/family6. Proteomics (2D gel, MS)

Page 21: Biological Database Systems

BioDB-1, ShestakovPage 21

Categories of Biological Databases

7. Microarray8. Organism-specific9. 3D structure10.Metabolism11.Bibliography12.Others

Page 22: Biological Database Systems

BioDB-1, ShestakovPage 22

Categories of Biological Databases

7. Microarray8. Organism-specific9. 3D structure10.Metabolism11.Bibliography12.Others

Page 23: Biological Database Systems

BioDB-1, ShestakovPage 23

Biological Databases: special features

• Autonomous: many independent maintainers

• Heterogeneous data formats: e.g., various data formats for the same data elements

• Dynamic: frequent and continous changes in data content (and, more importnatly, in data schema)

• Broad domain knowledge • Workflow-oriented: databases + rich set of

analysis tools• Information integration is essential:

aggregate data from several databases

Page 24: Biological Database Systems

BioDB-1, ShestakovPage 24

Biological Databases: integration

Figure is taken from Bioinformatics: Managing Scientific Data by Lacroix & Critchlow, p.20