8
Introduction to Databases Shifra Ben-Dor Irit Orr Lecture Outline • Introduction – Data and Database types – Database components Data Formats Sample databases How to text search databases What “units of information” do we deal with in bioinformatics? • DNA • RNA • Protein • Sequence • Structure • Evolution • Pathways • Interactions • Mutations AAGTGCCACTGCATAAATGACCATGAGTGGGCACCGGTAAGGGAGGGTGATGCTATCTGGTCTGAAG Nucleotide sequence Genes mRNA Protein primary sequence Protein 3D structure Protein Function Acts as a tumor suppressor in many tumor types. induces growth arrest or apoptosis depending on the physiological circumstances or cell type, but both activities are involved in tumor suppression. Involved in the transport of chloride ions. Defects in CFTR are the cause of cystic fibrosis. It is the most common genetic disease in the caucasian population, with a prevalence of about 1 in 2000 live births. cf, an autosomal recessive disorder, is a common generalized disorder of exocrine gland function SNPs Slide provided by Dr. Vered Caspi

Lecture Outline Introduction to Databases¥Databases need to be well annotated . ¥Databases need to be easily searched . ¥Data found in databases should be easily retrieved. ¥Data

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture Outline Introduction to Databases¥Databases need to be well annotated . ¥Databases need to be easily searched . ¥Data found in databases should be easily retrieved. ¥Data

Introduction to Databases

Shifra Ben-Dor

Irit Orr

Lecture Outline

• Introduction

– Data and Database types

– Database components

• Data Formats

• Sample databases

• How to text search databases

What “units of information” do we deal

with in bioinformatics?

• DNA

• RNA

• Protein

• Sequence

• Structure

• Evolution

• Pathways

• Interactions

• Mutations

AAGTGCCACTGCATAAATGACCATGAGTGGGCACCGGTAAGGGAGGGTGATGCTATCTGGTCTGAAGNucleotidesequence

Genes

mRNA

Proteinprimarysequence

Protein 3Dstructure

ProteinFunction

Acts as a tumor suppressor in

many tumor types. induces growth

arrest or apoptosis depending on the

physiological circumstances or cell

type, but both activities are

involved in tumor suppression.

Involved in the transport of

chloride ions. Defects in CFTR

are the cause of cystic fibrosis.

It is the most common genetic disease

in the caucasian population, with a

prevalence of about 1 in 2000 live

births. cf, an autosomal recessive

disorder, is a common generalized

disorder of exocrine gland function

SNPs

Slide provided by Dr. Vered Caspi

Page 2: Lecture Outline Introduction to Databases¥Databases need to be well annotated . ¥Databases need to be easily searched . ¥Data found in databases should be easily retrieved. ¥Data

• What do we want from databases?

All of these have databases and tools

that were created to work with them

Information retrieval from

sequence databases

Biological databases contain enormousamounts of data.

• Databases need to be well annotated.

• Databases need to be easily searched.

• Data found in databases should be easilyretrieved.

• Data in databases should be in standardformats.

Integrated Information Retrieval

• Many databases contain logical relations betweenspecific entries.

• One interface - connecting many biologicaldatabases.

• For example: a database that connects betweenprotein sequence, protein domain, proteinstructure and reference databases. (Interpro)

• Another example: Connection between references,protein sequence, DNA sequence, and structuredatabases. (Entrez)

Slide provided by Dr. Vered Caspi

Page 3: Lecture Outline Introduction to Databases¥Databases need to be well annotated . ¥Databases need to be easily searched . ¥Data found in databases should be easily retrieved. ¥Data

Core Data and Annotation

Databases generally have (at least) two types of

data:

Core data: The data the database was generated

to organize

Annotation: Extra information that rounds out

our picture of the core data

For example in a genome database, the sequence

is the core data, and the location of genes is the

annotation

Database Issues

• Printed journals vs. databases

• Direct submission to databases (e.g.

GenBank, GDB, PDB)

• Archival vs. curated databases

• Databases that publish experimental

results of large genomic centers.

• Public vs. private databases.

For Example: Classification of Genomic Databases

Database

scope

Information

source

Information

type

Many genomes

One Genome

One Subject

One Gene

Direct submission from scientific community

Scientific literature

Genome center’s experimental results

Other databases

Mapping

Sequence & annotation

Protein structure & function

Variations

Comparative genomics

gene networks

Slide provided by Dr. Vered Caspi

User Interface

• Database search– free text

– field-specific

– sequence-based

• Database output– text

– graphics

– dynamic

Page 4: Lecture Outline Introduction to Databases¥Databases need to be well annotated . ¥Databases need to be easily searched . ¥Data found in databases should be easily retrieved. ¥Data

Data Formats

There are many data formats used for

sequences (both nucleic and amino acid)

• Fasta Format

• GenBank Format

• EMBL Format

• GCG Format

Fasta Format

• Simplest format

• Least information

• Starts with a > and sequence name

on one line

• The sequence in plain text follows

>OB2T2

GTGACAACATGTACAGCTGTGAGCGGTGTAAGAAGCTGCGGAACGGAGTGAAGTACTGCA

AAGTCCTGCGGTTGCCCGAGATCCTGTGCATTCACCTAAAGCGCTTTCGGCACGAGGTGA

TGTACTCATTCAAGATCAACAGCCACGTCTCCTTGCCCTCGAGGGGCTCGACCTGCGCCC

CTTCCTTGCCAAGGAGTGCACATCCCAGATCACCACCTACGACCTCCTCTCGGTCATCTG

CCACCACGGCACGGCAGGCA

>TNRC_HUMAN P36941 (tumor necrosis factor c receptor)

MLLPWATSAPGLAWGPLVLGLFGLLAASQPQAVPPYASENQTCRDQEKEYYEPQHRICCS

RCPPGTYVSAKCSRIRDTVCATCAENSYNEHWNYLTICQLCRPCDPVMGLEEIAPCTSKR

KTQCRCQPGMFCAAWALECTHCELLSDCPPGTEAELKDEVGKGNNHCVPCKAGHFQNTSS

PSARCQPHTRCENQGLVEAAPGTAQSDTTCKNPLEPLPPEMSGTMLMLAVLLPLAFFLLL

ATVFSCIWKSHPSLCRKLGSLLKRRPQGEGPNPVAGSWEPPKAHPYFPDLVQPLLPISGD

VSPVSTGLPAAPVLEAGVPQQQSPLDLTREPQLEPGEQSQVAHGTNGIHVTGGSMTITGN

IYIYNGPVLGGPPGPGDLPATPEPPYPIPEEGDPGPPGLSTPHQEDGKAWHLAETEHCGA

TPSNRGPRNQFITHD

>TNRC_MOUSE P50284 lymphotoxin-beta receptor precursor

MRLPRASSPCGLAWGPLLLGLSGLLVASQPQLVPPYRIENQTCWDQDKEYYEPMHDVCCS

RCPPGEFVFAVCSRSQDTVCKTCPHNSYNEHWNHLSTCQLCRPCDIVLGFEEVAPCTSDR

KAECRCQPGMSCVYLDNECVHCEEERLVLCQPGTEAEVTDEIMDTDVNCVPCKPGHFQNT

SSPRARCQPHTRCEIQGLVEAAPGTSYSDTICKNPPEPGAMLLLAILLSLVLFLLFTTVL

ACAWMRHPSLCRKLGTLLKRHPEGEESPPCPAPRADPHFPDLAEPLLPMSGDLSPSPAGP

PTAPSLEEVVLQQQSPLVQARELEAEPGEHGQVAHGANGIHVTGGSVTVTGNIYIYNGPV

LGGTRGPGDPPAPPEPPYPTPEEGAPGPSELSTPYQEDGKAWHLAETETLGCQDL

>TNR1_RAT P22934 tumor necrosis factor receptor 1 precursor (p60)

MGLPIVPGLLLSLVLLALLMGIHPSGVTGLVPSLGDREKRDNLCPQGKYAHPKNNSICCT

KCHKGTYLVSDCPSPGQETVCEVCDKGTFTASQNHVRQCLSCKTCRKEMFQVEISPCKAD

MDTVCGCKKNQFQRYLSETHFQCVDCSPCFNGTVTIPCKEKQNTVCNCHAGFFLSGNECT

PCSHCKKNQECMKLCLPPVANVTNPQDSGTAVLLPLVIFLGLCLLFFICISLLCRYPQWR

PRVYSIICRDSAPVKEVEGEGIVTKPLTPASIPAFSPNPGFNPTLGFSTTPRFSHPVSST

PISPVFGPSNWHNFVPPVREVVPTQGADPLLYGSLNPVPIPAPVRKWEDVVAAQPQRLDT

ADPAMLYAVVDGVPPTRWKEFMRLLGLSEHEIERLELQNGRCLREAHYSMLEAWRRRTPR

HEATLDVVGRVLCDMNLRGCLENIRETLESPAHSSTTHLPR

Page 5: Lecture Outline Introduction to Databases¥Databases need to be well annotated . ¥Databases need to be easily searched . ¥Data found in databases should be easily retrieved. ¥Data

Genbank sequence format

NM_000394. Homo sapiens crys...[gi:14043059]

LOCUS NM_000394 1114 bp mRNA PRI 15-MAY-2001DEFINITION Homo sapiens crystallin, alpha A (CRYAA), mRNA.ACCESSION NM_000394VERSION NM_000394.2 GI:14043059KEYWORDS .SOURCE human.ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata;Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini;Hominidae; Homo.REFERENCE 1 (bases 1 to 1114) AUTHORS Jaworski,C.J. and Piatigorsky,J. TITLE A pseudo-exon in the functional human alpha A-crystallin gene

Genbank sequence format

JOURNAL Nature 337 (6209), 752-754 (1989) MEDLINE 89143747PUBMED 2918909REFERENCE 2 (bases 1 to 1114) AUTHORS Jaworski,C.J.TITLE A reassessment of mammalian alpha A-crystallinsequences using DNA sequencing: implications for anthropoidaffinities of tarsier

FEATURES Location/Qualifiers source 1..1114 /organism="Homo sapiens" /db_xref="taxon:9606" /chromosome="21" /map="21q22.3" gene 1..1114 /gene="CRYAA" /note="CRYA1" /db_xref="LocusID:1409" /db_xref="MIM:123580" misc_feature 70..234 /note="crystallin; Region: Alphacrystallin A chain" CDS 70..591 /gene="CRYAA" /note="human alphaA-crystallin;crystallin, alpha-1" /codon_start=1

/db_xref="LocusID:1409" /db_xref="MIM:123580" /product="crystallin, alpha A"

/protein_id="NP_000385.1" /db_xref="GI:4503055"

/translation="MDVTIQHPWFKRTLGPFYPSRLFDQFFGEGLFEYDLLPFL SSTISPYYRQSLFRTVLDSGISEVRSDRDKFVIFLDVKHFSP EDLTVKVQDDFVEIHGKHNERQDDHGYISREFHRRYRLPS NVDQSALSCSLSADGMLTFCGPKIQTGLDATHAERAIPVSR EEKPTSAPSS" misc_feature 244..555 /note="HSP20; Region: Hsp20/alpha crystallin family" polyA_signal 1092..1097

Page 6: Lecture Outline Introduction to Databases¥Databases need to be well annotated . ¥Databases need to be easily searched . ¥Data found in databases should be easily retrieved. ¥Data

BASE COUNT 183 a 400 c 309 g 222 tORIGIN 1 acactgcgct gcccagaggc cccgctgact cctgccagcc tccaggtccc cgtggtacca 61 aagctgaaca tggacgtgac catccagcac ccctggttca agcgcaccct ggggcccttc 121 taccccagcc ggctgttcga ccagtttttc ggcgagggcc tttttgagta tgacctgctg 181 cccttcctgt cgtccaccat cagcccctac taccgccagt ccctcttccg caccgtgctg 241 gactccggca tctctgaggt tcgatccgac cgggacaagt tcgtcatctt cctcgatgtg 301 aagcacttct ccccggagga cctcaccgtg aaggtgcagg acgactttgt ggagatccac 361 ggaaagcaca acgagcgcca ggacgaccac ggctacattt cccgtgagtt ccaccgccgc 421 taccgcctgc cgtccaacgt ggaccagtcg gccctctctt gctccctgtc tgccgatggc 481 atgctgacct tctgtggccc caagatccag actggcctgg atgccaccca cgccgagcga 541 gccatccccg tgtcgcggga ggagaagccc acctcggctc cctcgtccta agcaggcatt 601 gcctcggctg gctcccctgc agccctggcc catcatgggg ggagcaccct gagggcgggg 661 tgtctgtctt cctttgcttc ccttttttcc tttccacctt ctcacatgga atgagggttt 721 gagagagcag ccaggagagc ttagggtctc agggtgtccc agaccccgac accggccagt 781 ggcggaagtg accgcacctc acactccttt agatagcagc ctggctcccc tggggtgcag 841 gcgcctcaac tctgctgagg gtccagaagg agggggtgac ctccggccag gtgcctcctg 901 acacacctgc agcctccctc cgcggcgggc cctgcccaca cctcctgggg cgcgtgaggc 961 ccgtggggcc ggggcttctg tgcacctggg ctctcgcggc ctcttctctc agaccgtctt 1021 cctccaaccc ctctatgtag tgccgctctt ggggacatgg gtcgcccatg agagcgcagc 1081 ccgcggcaat caataaacag caggtgatac aagc//Revised: October 24, 2001.

EMBL sequence format

• ID A4279484 standard; DNA; FUN; 581 BP.

• XX

• AC AJ279484;

• XX

• SV AJ279484.1

• XX

• DT 14-JAN-2000 (Rel. 62, Created)

• DT 14-JAN-2000 (Rel. 62, Last updated, Version 2)

• XX

• DE Unidentified ascomycota sp. 4/97-9 5.8S rRNA gene and ITS 1 and 2

• XX

• KW 5.8S ribosomal RNA; 5.8S rRNA gene; internal transcribed spacer 1;

EMBL sequence format

• KW internal transcribed spacer 2; ITS1; ITS2.

• XX

• OS ascomycota sp. 4/97-9

• OC Eukaryota; Fungi; Ascomycota.

• XX

• RN [1]

• RP 1-581

• RA Wirsel S.G.R.;

• RT ;

• RL Submitted (21-DEC-1999) to the EMBL/GenBank/DDBJ databases.

• RL Wirsel S.G.R., Fakultaet fuer Biologie, Universitaet Konstanz,

• RL Universitaetsstr. 10, Konstanz 78434, Germany.

• XX

EMBL sequence format

• RN [2]

• RA Wirsel S.G.R., Leibinger W., Mendgen K.W.;

• RT "Genetic diversity of fungi associated with common reed (Phragmites

• RT australis)";

• RL Unpublished.

• XX

• FH Key Location/Qualifiers

• FH

• FT source 1..581

• FT /db_xref="taxon:112223"

• FT /organism="ascomycota sp. 4/97-9"

• FT /isolate="4/97-9"

Page 7: Lecture Outline Introduction to Databases¥Databases need to be well annotated . ¥Databases need to be easily searched . ¥Data found in databases should be easily retrieved. ¥Data

EMBL sequence format

• FT misc_feature 64..226

• FT /note="internal transcribed spacer 1, ITS1"

• FT rRNA 227..385

• FT /gene="5.8S rRNA"

• FT /product="5.8S ribosomal RNA"

• FT misc_feature 386..529

• FT /note="internal transcribed spacer 2, ITS2"

• XX

• SQ Sequence 581 BP; 132 A; 164 C; 145 G; 140 T; 0 other;

ccatttagag gaagtaaaag tcgtaacaag gtctccgttg gtgaaccagggagggatc 60 ttacgagagt

gtcaccactc ccaacccact gtttacctac ccgtccaccg tgcttcggca 120 ggcagtcctg tgggacaggg

cctcgccccc ctccgggggg tgcctgccgc

EMBL entry

• Each line in the entry begins with a two-character line code, which

indicates the type of information contained in the line.

• The currently used line types, along with their respective line codes, are

listed below:

• ID - identification (begins each entry; 1 per entry)

• AC - accession number (>=1 per entry)

• SV - sequence version (1 per entry)

• DT - date (2 per entry)

• DE - description (>=1 per entry)

• KW - keyword (>=1 per entry)

EMBL entry

• OS - organism species (>=1 per entry)

• OC - organism classification (>=1 per entry)

• OG - organelle (0 or 1 per entry)

• RN - reference number (>=1 per entry)

• RC - reference comment (>=0 per entry)

• RP - reference positions (>=1 per entry)

• RX - reference cross-reference (>=0 per entry)

• RA - reference author(s) (>=1 per entry)

• RT - reference title (>=1 per entry)

• RL - reference location (>=1 per entry)

• DR - database cross-reference (>=0 per entry)

EMBL entry

• FH - feature table header (0 or 2 per entry)

• FT - feature table data (>=0 per entry)

• CC - comments or notes (>=0 per entry)

• XX - spacer line (many per entry)

• SQ - sequence header (1 per entry)

• bb - (blanks) sequence data (>=1 per entry)

• // - termination line (ends each entry; 1 per

entry )

Page 8: Lecture Outline Introduction to Databases¥Databases need to be well annotated . ¥Databases need to be easily searched . ¥Data found in databases should be easily retrieved. ¥Data

GCG Format

• Has space for comments and space

for data, separated by two dots ..• Can contain full sequence data like

GenBank or EMBL

• Has a minimum of sequence name,length, date, type (nucleic or aminoacid) and checksum

!!NA_SEQUENCE 1.0

5B3.seq Length: 744 March 18, 1999 10:43 Type: N Check: 2586 ..

1 TCTAGAGGAG AYATYGTWAT GACCCAGTCT CCATCCTCCC TGAGTGTGTC

51 AGCAGGAGAG AAGGTCACTA TGAGCTGCAA GTCCAGTCAG AGTCTGTTAA

101 ACAGTAGAAA TCAAAAGAAC TACTTGGCCT GGTACCAGCA GAAACCAGGA

151 CAGCCTCCTA AACTTTTGAT CTACGGGGTA TTTATTAGGG ATTCTGGGGT

201 CCCTGATCGC TTCACAGGCA GTGGATCTGG AACCGATTTC ACTCTTACCA

251 TCAGCAGTGT GCAGGCTGAA GACCTGGCAG TTTATTACTG TCAGAATGAT

301 CATATTTATC CGTACACGTT CGGAGGGGGC ACWAAGCTGG AAATTAAAGG

351 GTCGACTTCC GGTAGCGGCA AATCCTCTGA AGGCAAAGGT SAGGTSCAGC

401 TGCAGGAGTC TGGACCTGGC CTGGTGAAGC CTTCCCAGTC TCTGTCCCTC

451 ACCTGCTCTG TCACTGGTTA CTCAATCACC AGTGGTTATG CCTGGAACTG

501 GATCCGGCAG TTTCCAGGAA ACAAACTGGA GTGGATGGGC TACATAAGCT

551 ACAGTGGTTT CACTAGCTAC AACCCATCTC TCAGAAGTCG AATCTCTTTC