Upload
nigel-lynch
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
1
Assignment: Due Friday:
READ:
Ensembl API Tutorial:http://www.ensembl.org/info/software/core/
core_tutorial.html
READ:Encode Project paper (on Icon)
2
Databases
• Reference: Bioinformatics Computer Skills– Gibas and Jambeck
3
Database Intro (for Ensembl Modules)
Database Design for Mere Mortals,
M Hernandez, Addison-Wesley
4
Databases
• What is a database?
5
Databases• Vital part of genomics, bioinformatics, genetics, and biology
• Genomic data– sequence, genes, annotation, markers, literature, diseases, structures, SNPs, mutations, etc, etc, etc.
– Ensembl, UCSC, NCBI -- the big "three"
• Knowing how to find and download from central biological repositories is an important skill
• However, this is a tool building class, and databases on the web becomes more integral to sharing information– build our own DBs
6
Databases
• This portion of the course will not make you an expert in databases (there are other courses for that), nor will you be able to design and implement a major database
• However, you will have a basic understanding of the steps involved, and you will be able to instantiate rudimentary databases from which you may expand with efforts beyond this class
• Additionally, you will be better equipped to query existing databases and develop applications/interfaces to interact with DBs
7
Databases
• Steps– design a data model– select a database management system
– implement data model– interface
8
DBMS
• types– flat file index– relational– object-oriented
9
Database Introduction: terms
• Field, attribute, tuple, record– Individual data items within a database– Often “field” describes the “place holder” in a database table, and “record”
describes that actual data value
• Table– One or more attributes grouped together
• Relation, link, pointer– A connection between two fields (usually between tables)
• Other terms– meta data -- data that describes attributes (like the column titles in a
spread sheet)– data elements -- a "class" and attribute
• ex) subject, name• subject, address• subject, affection_status
10
Databases• Operational
– Used whenever there is need to collect, maintain, and modify data– Dynamic data
• Changes constantly• Reflects up-to-the minute info
– Examples (Biology/Genetics)• Ensembl*, NCBI (GenBank, PubMed, dBest, UniGene*, LocusLink*), UCSC*
• Analytical– Store and track historical and time-dependent data– Static
• Many of the biological databases (Ensembl, UniGene, UCSC) are the result of substantial analyses and are relatively static – with a release period of between 1 week and 6 months.
• These do not change between releases – something between operational and analytical
11
Flat File Databases
• an ordered collection of similar files• usually conforming to a standard format (but not always)
• a flat file database means that rules can be applied to find data within the files (rather than remembering the locations of all individual files)
• made searchable by indexing• an index is a pairing of an "attribute" from within a file and the file name/location
12
Flat File Example>gnl|UG|Hs#S593306 zr74d01.r1 Soares_NhHMPu_S1 Homo sapiens cDNA clone IMAGE:669121 5', mRNA sequence /clone=IMAGE:669121 /clone_end=5' /gb=AA234466 /gi=1858985 /ug=Hs.68879 /len=276ATTAAGATCTGAAAACTGTGATGCGTCCTTTCTGCAGAGACGCCTCTTTCTGAATCTGCCCGGAGCTTCGAGCCCCGGCGTCTGTCCCTCAGCCTGGCATGGCTTCTTCGGGGGTCTGCTTTGCATGGGGAGAGGGGCCACGCAGCGGACGGACTAGGTTTGGGGATTCTCGGTAATGGACCCGGAGCAATGACTAACAGCCGCTCCCTCTCACTTTCCCACAGCGATCACCCTCTAACACCCTCCCTCCCATTCCCGGCCCCGCGCGTGACAAGG
Index:
/HomoSapiens/EST/5-prime/zr74d01.r1 Hs.68879
13
LOCUS AA234466 276 bp mRNA linear EST 06-AUG-1997DEFINITION zr74d01.r1 Soares_NhHMPu_S1 Homo sapiens cDNA clone IMAGE:669121 5', mRNA sequence.ACCESSION AA234466VERSION AA234466.1 GI:1858985KEYWORDS EST.SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 276) AUTHORS Hillier,L., Allen,M., Bowles,L., Dubuque,T., Geisel,G., Jost,S., Kucaba,T., Lacy,M., Le,N., Lennon,G., Marra,M., Martin,J., Moore,B. , Schellenberg,K., Steptoe,M., Tan,F., Theising,B., White,Y., Wylie ,T., Waterston,R. and Wilson,R. TITLE WashU-Merck EST Project 1997 JOURNAL Unpublished (1997)COMMENT Contact: Wilson RK Washington University School of Medicine 4444 Forest Park Parkway, Box 8501, St. Louis, MO 63108 Tel: 314 286 1800 Fax: 314 286 1810 Email: [email protected] This clone is available royalty-free through LLNL ; contact the IMAGE Consortium ([email protected]) for further information. Insert Length: 871 Std Error: 0.00 Seq primer: -28m13 rev2 ET from Amersham High quality sequence stop: 266.FEATURES Location/Qualifiers source 1..276 /organism="Homo sapiens" /mol_type="mRNA"
14
Flat File Databases
• as a flat file database grows larger, the one-dimensional aspect of indices makes it difficult to make connections between attributes
• many biological databases began as flat file databases, and this legacy has influenced file formats and applications (PDB, ESTs, GeneBank)
• because of the ease of browsing a file system, many systems adopt a hybrid approach -- incorporating both a relational database and flat file system
15
Relational Databases• Just like flat file systems, relational databases are a way of assembling, storing, and retrieving information/data
• in a relational database, data is stored in "tables"
• a flat file database that describes protein structure is like a book– chapters about the origin of the sample, how the data was collected, the sequence, secondary structure, positions of the atoms
• in a relational database, this information is spread across tables– a table for experimental conditions, sequence, secondary structure, etc.
16
Relational Databases
• rows in the tables may refer to unique proteins• connections between the proteins can be made ( they aren't bound together like a book)
• the form of the tables follows rules that are uniform so that you can access different attributes from different tables
• you can freely access all entries for atomic position at once, or all attributes for a particular protein
• for one protein, not inconvenient to "look it up" and read the entry in the "book" (I.e., flat file DB)
• protein name, author who deposited it, can be put in an index (like a card catalogue) to help find the book
• but how do you extract the secondary structure out of every "book" in the library -- you can't!
17
Relational Databases
• a relational database management system (RDMS) allows you to "look" at the database from many different "views"
• a RDMS essentially allows you to dynamically "create" the contents of your "book" by querying for the information that you want
18
Table organization
• data in a table are organized in "rows"
• each row represents one "record"• a row may contain separate pieces of information -- "fields" or "attributes"
• tables are not glorified flat files -- although they may look that way if you examine them in their entirety
19
Example
• We want to keep track of genes for the purpose of mutation screening using SSCP or sequencing
• We want to keep track of:– lists of genes– sequence of genes (to design primers)– exons and introns– multiple transcripts?
• Since we are putting this together for Pfizer (who grossed $52 billion in 2006), we'll be cute and refer to genes as "targets" -- since they are looking for druggable "targets"
20
A first attemptproject (table name)
id project_name gene_name datedescription
4 glaucoma BBS4 12/12/04Glaucoma is…
5 glaucoma ABCA6 Glaucoma is…
target (table name)
id name sequence primerF primerR (fields)
5 BBS4 ATGCGG… GCTAGT ATGACCCT
6 BMP4 ATGCCC… GGTATG TGAATGAG
…
exon
id gene_id sequence
1 5 ATGC…
2 5 CCGG…
21
Observations• values in the "target" table are related to values in the "project" table
• however, it doesn't make sense to put all of the data into one big table (we could -- but it would be very redundant)
• why are there multiple entries in project for the same project???
• does it really make sense to put primer information in the "target" table???
• the two data types ("project" and "target") are related but "orthogonal"
• anywhere in a grouping of data where it is not sensible to include fields in a table, a new table should be created– normalization -- the process of separating complex data into a collection of mutually orthogonal related tables
22
What's wrong with this?
• No inherent flaws, but may not be the most efficient– duplication (exon sequences in "exon" table are subsets of the full gene sequence in "target' table)
– gene names are duplicated between "project" and "target" table
– this isn't a major issue -- but for the sake of argument -- what happens if the gene name for BMP4 is changed to BMO4?
– Description of "project"s is duplicated
23
exon
transcript_id
exon_num
sequence_start
sequence_stop
intron
transcript_id
intron_num
sequence_start
sequence_stop
primer_pair
id
transcript_id
left_primer_id
right_primer_id
transcript
id
sequence_id
source
source_id
sequence
id
target_id
type
sequence
chr_name
strand
genomic_start
genomic_stop
source
source_id
refresh
target
id
date
gene_name
description
accession
status
project
id
name
description
date
set_table
id
project_id
name
date
description
set_target_join
set_id
target_id
rank
cas_rank
cas_options
24
exon
transcript_id
exon_num = 3
sequence_start
sequence_stop
intron
transcript_id
intron_num = 3
sequence_start
sequence_stop
primer_pair
id
transcript_id
left_primer_id
right_primer_id
transcript
id
sequence_id
source = Ensembl
source_id
sequence
id
target_id
type = nucleotide
sequence = ATG…
chr_name = 15
strand = 1
genomic_start = 15,123,120
genomic_stop = 16,378,131
source
source_id
refresh
target
id
date
gene_name = BBS4
description
accession
status
project
id
name = pro1
description
date
set_table
id
project_id
name = testset
date
description
set_target_join
set_id
target_id
rank = 5
cas_rank
cas_options
Sample Data
25
Database Schema
• network of tables and relationships defines the "database schema"
• generally you would carefully develop a schema so that it keeps utility over time– test data– example queries
26
Relational Database Model (basis for modern databases)
• Codd, E. “A Relational Model of Data for Large Shared Databanks.” Communications of the ACM, 1970, 377-387.– Based on set theory, and predicate logic– “Relation” term is derived from set theory (not from
notion that tables are “related” to each other).– Physical order of records is immaterial– Each record in a table is identified by a field that
contains a unique value (sound like anything else we know???)
– User does not need to know the physical location of a record to retrieve data (unlike other models, hierarchical, network).
27
End
28
Database Intro: An early model – Hierarchical Database Model
Agents
Entertainers
Schedule
Clients
Engagements Payments
Features:- inverted tree with Root node- relationships are parent/child- each child has 1 parent, but parents may have multiple children- tables linked by a “pointer”- access to data requires familiarity with the structure
29
Database Intro: An early model – Hierarchical Database Model
Agents
Entertainers
Schedule
Clients
Engagements Payments
Advantages:- quick retrieval- referential integrity is built in and automatically enforced- example: record in child table must be linked to existing record in parent table
- if record deleted in parent table, all associated records in child tables are deleted as well
30
Database Intro: An early model – Hierarchical Database Model
Agents
Entertainers
Schedule
Clients
Engagements Payments
Disadvantages:- difficult to handle child entries that are initially unrelated to parent- example:
- Want to add a new entertainer, but cannot add entertainer until it is associated with an Agent- Have to insert a “dummy” record
- does not support complex relationships- redundant data: many to many relationship between Clients and Entertainers, so Schedule (date, time) will have same entries asEngagements (date, time)
31
Relational Database Model
Agents
EntertainersClients
Engagements
Advantages:-built in multilevel data integrity:
table: records are not duplicated, and detect missing primary keys
relationship: ensure connection between tables is valid-logical and physical data independence from database application:
changes in physical implementation of database would not (in theory) adversely affect applications built to use DB
-easy data retrieval: data may be retrieved from any table or from any number of related tables
Style
Music
32
Relational Database ModelAgents
EntertainersClients
Engagements
-Relationships -may be: one-to-one, one-to-many, and many-to-many-Established through matching values of a shared field
-Example: Agents are associated to Clients by an AgentsID field in Clients table
-Data may be accessed in an almost unlimited combination of ways
Styles
Music