1 Assignment: Due Friday: READ: Ensembl API Tutorial: READ: Encode Project paper (on Icon)

1

Assignment: Due Friday:

READ:

Ensembl API Tutorial:http://www.ensembl.org/info/software/core/

core_tutorial.html

READ:Encode Project paper (on Icon)

2

Databases

• Reference: Bioinformatics Computer Skills– Gibas and Jambeck

3

Database Intro (for Ensembl Modules)

Database Design for Mere Mortals,

M Hernandez, Addison-Wesley

4

Databases

• What is a database?

5

Databases• Vital part of genomics, bioinformatics, genetics, and biology

• Genomic data– sequence, genes, annotation, markers, literature, diseases, structures, SNPs, mutations, etc, etc, etc.

– Ensembl, UCSC, NCBI -- the big "three"

• Knowing how to find and download from central biological repositories is an important skill

• However, this is a tool building class, and databases on the web becomes more integral to sharing information– build our own DBs

6

Databases

• This portion of the course will not make you an expert in databases (there are other courses for that), nor will you be able to design and implement a major database

• However, you will have a basic understanding of the steps involved, and you will be able to instantiate rudimentary databases from which you may expand with efforts beyond this class

• Additionally, you will be better equipped to query existing databases and develop applications/interfaces to interact with DBs

7

Databases

• Steps– design a data model– select a database management system

– implement data model– interface

8

DBMS

• types– flat file index– relational– object-oriented

9

Database Introduction: terms

• Field, attribute, tuple, record– Individual data items within a database– Often “field” describes the “place holder” in a database table, and “record”

describes that actual data value

• Table– One or more attributes grouped together

• Relation, link, pointer– A connection between two fields (usually between tables)

• Other terms– meta data -- data that describes attributes (like the column titles in a

spread sheet)– data elements -- a "class" and attribute

• ex) subject, name• subject, address• subject, affection_status

10

Databases• Operational

– Used whenever there is need to collect, maintain, and modify data– Dynamic data

• Changes constantly• Reflects up-to-the minute info

– Examples (Biology/Genetics)• Ensembl*, NCBI (GenBank, PubMed, dBest, UniGene*, LocusLink*), UCSC*

• Analytical– Store and track historical and time-dependent data– Static

• Many of the biological databases (Ensembl, UniGene, UCSC) are the result of substantial analyses and are relatively static – with a release period of between 1 week and 6 months.

• These do not change between releases – something between operational and analytical

11

Flat File Databases

• an ordered collection of similar files• usually conforming to a standard format (but not always)

• a flat file database means that rules can be applied to find data within the files (rather than remembering the locations of all individual files)

• made searchable by indexing• an index is a pairing of an "attribute" from within a file and the file name/location

12

Flat File Example>gnl|UG|Hs#S593306 zr74d01.r1 Soares_NhHMPu_S1 Homo sapiens cDNA clone IMAGE:669121 5', mRNA sequence /clone=IMAGE:669121 /clone_end=5' /gb=AA234466 /gi=1858985 /ug=Hs.68879 /len=276ATTAAGATCTGAAAACTGTGATGCGTCCTTTCTGCAGAGACGCCTCTTTCTGAATCTGCCCGGAGCTTCGAGCCCCGGCGTCTGTCCCTCAGCCTGGCATGGCTTCTTCGGGGGTCTGCTTTGCATGGGGAGAGGGGCCACGCAGCGGACGGACTAGGTTTGGGGATTCTCGGTAATGGACCCGGAGCAATGACTAACAGCCGCTCCCTCTCACTTTCCCACAGCGATCACCCTCTAACACCCTCCCTCCCATTCCCGGCCCCGCGCGTGACAAGG

Index:

/HomoSapiens/EST/5-prime/zr74d01.r1 Hs.68879

13

LOCUS AA234466 276 bp mRNA linear EST 06-AUG-1997DEFINITION zr74d01.r1 Soares_NhHMPu_S1 Homo sapiens cDNA clone IMAGE:669121 5', mRNA sequence.ACCESSION AA234466VERSION AA234466.1 GI:1858985KEYWORDS EST.SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 276) AUTHORS Hillier,L., Allen,M., Bowles,L., Dubuque,T., Geisel,G., Jost,S., Kucaba,T., Lacy,M., Le,N., Lennon,G., Marra,M., Martin,J., Moore,B. , Schellenberg,K., Steptoe,M., Tan,F., Theising,B., White,Y., Wylie ,T., Waterston,R. and Wilson,R. TITLE WashU-Merck EST Project 1997 JOURNAL Unpublished (1997)COMMENT Contact: Wilson RK Washington University School of Medicine 4444 Forest Park Parkway, Box 8501, St. Louis, MO 63108 Tel: 314 286 1800 Fax: 314 286 1810 Email: [email protected] This clone is available royalty-free through LLNL ; contact the IMAGE Consortium ([email protected]) for further information. Insert Length: 871 Std Error: 0.00 Seq primer: -28m13 rev2 ET from Amersham High quality sequence stop: 266.FEATURES Location/Qualifiers source 1..276 /organism="Homo sapiens" /mol_type="mRNA"

14

Flat File Databases

• as a flat file database grows larger, the one-dimensional aspect of indices makes it difficult to make connections between attributes

• many biological databases began as flat file databases, and this legacy has influenced file formats and applications (PDB, ESTs, GeneBank)

• because of the ease of browsing a file system, many systems adopt a hybrid approach -- incorporating both a relational database and flat file system

15

Relational Databases• Just like flat file systems, relational databases are a way of assembling, storing, and retrieving information/data

• in a relational database, data is stored in "tables"

• a flat file database that describes protein structure is like a book– chapters about the origin of the sample, how the data was collected, the sequence, secondary structure, positions of the atoms

• in a relational database, this information is spread across tables– a table for experimental conditions, sequence, secondary structure, etc.

16

Relational Databases

• rows in the tables may refer to unique proteins• connections between the proteins can be made ( they aren't bound together like a book)

• the form of the tables follows rules that are uniform so that you can access different attributes from different tables

• you can freely access all entries for atomic position at once, or all attributes for a particular protein

• for one protein, not inconvenient to "look it up" and read the entry in the "book" (I.e., flat file DB)

• protein name, author who deposited it, can be put in an index (like a card catalogue) to help find the book

• but how do you extract the secondary structure out of every "book" in the library -- you can't!

17

Relational Databases

• a relational database management system (RDMS) allows you to "look" at the database from many different "views"

• a RDMS essentially allows you to dynamically "create" the contents of your "book" by querying for the information that you want

18

Table organization

• data in a table are organized in "rows"

• each row represents one "record"• a row may contain separate pieces of information -- "fields" or "attributes"

• tables are not glorified flat files -- although they may look that way if you examine them in their entirety

19

Example

• We want to keep track of genes for the purpose of mutation screening using SSCP or sequencing

• We want to keep track of:– lists of genes– sequence of genes (to design primers)– exons and introns– multiple transcripts?

• Since we are putting this together for Pfizer (who grossed $52 billion in 2006), we'll be cute and refer to genes as "targets" -- since they are looking for druggable "targets"

20

A first attemptproject (table name)

id project_name gene_name datedescription

4 glaucoma BBS4 12/12/04Glaucoma is…

5 glaucoma ABCA6 Glaucoma is…

target (table name)

id name sequence primerF primerR (fields)

5 BBS4 ATGCGG… GCTAGT ATGACCCT

6 BMP4 ATGCCC… GGTATG TGAATGAG

…

exon

id gene_id sequence

1 5 ATGC…

2 5 CCGG…

21

Observations• values in the "target" table are related to values in the "project" table

• however, it doesn't make sense to put all of the data into one big table (we could -- but it would be very redundant)

• why are there multiple entries in project for the same project???

• does it really make sense to put primer information in the "target" table???

• the two data types ("project" and "target") are related but "orthogonal"

• anywhere in a grouping of data where it is not sensible to include fields in a table, a new table should be created– normalization -- the process of separating complex data into a collection of mutually orthogonal related tables

22

What's wrong with this?

• No inherent flaws, but may not be the most efficient– duplication (exon sequences in "exon" table are subsets of the full gene sequence in "target' table)

– gene names are duplicated between "project" and "target" table

– this isn't a major issue -- but for the sake of argument -- what happens if the gene name for BMP4 is changed to BMO4?

– Description of "project"s is duplicated

23

exon

transcript_id

exon_num

sequence_start

sequence_stop

intron

transcript_id

intron_num

sequence_start

sequence_stop

primer_pair

id

transcript_id

left_primer_id

right_primer_id

transcript

id

sequence_id

source

source_id

sequence

id

target_id

type

sequence

chr_name

strand

genomic_start

genomic_stop

source

source_id

refresh

target

id

date

gene_name

description

accession

status

project

id

name

description

date

set_table

id

project_id

name

date

description

set_target_join

set_id

target_id

rank

cas_rank

cas_options

24

exon

transcript_id

exon_num = 3

sequence_start

sequence_stop

intron

transcript_id

intron_num = 3

sequence_start

sequence_stop

primer_pair

id

transcript_id

left_primer_id

right_primer_id

transcript

id

sequence_id

source = Ensembl

source_id

sequence

id

target_id

type = nucleotide

sequence = ATG…

chr_name = 15

strand = 1

genomic_start = 15,123,120

genomic_stop = 16,378,131

source

source_id

refresh

target

id

date

gene_name = BBS4

description

accession

status

project

id

name = pro1

description

date

set_table

id

project_id

name = testset

date

description

set_target_join

set_id

target_id

rank = 5

cas_rank

cas_options

Sample Data

25

Database Schema

• network of tables and relationships defines the "database schema"

• generally you would carefully develop a schema so that it keeps utility over time– test data– example queries

26

Relational Database Model (basis for modern databases)

• Codd, E. “A Relational Model of Data for Large Shared Databanks.” Communications of the ACM, 1970, 377-387.– Based on set theory, and predicate logic– “Relation” term is derived from set theory (not from

notion that tables are “related” to each other).– Physical order of records is immaterial– Each record in a table is identified by a field that

contains a unique value (sound like anything else we know???)

– User does not need to know the physical location of a record to retrieve data (unlike other models, hierarchical, network).

27

End

28

Database Intro: An early model – Hierarchical Database Model

Agents

Entertainers

Schedule

Clients

Engagements Payments

Features:- inverted tree with Root node- relationships are parent/child- each child has 1 parent, but parents may have multiple children- tables linked by a “pointer”- access to data requires familiarity with the structure

29


Agents

Entertainers

Schedule

Clients


Advantages:- quick retrieval- referential integrity is built in and automatically enforced- example: record in child table must be linked to existing record in parent table

- if record deleted in parent table, all associated records in child tables are deleted as well

30


Agents

Entertainers

Schedule

Clients


Disadvantages:- difficult to handle child entries that are initially unrelated to parent- example:

- Want to add a new entertainer, but cannot add entertainer until it is associated with an Agent- Have to insert a “dummy” record

- does not support complex relationships- redundant data: many to many relationship between Clients and Entertainers, so Schedule (date, time) will have same entries asEngagements (date, time)

31

Relational Database Model

Agents

EntertainersClients

Engagements

Advantages:-built in multilevel data integrity:

table: records are not duplicated, and detect missing primary keys

relationship: ensure connection between tables is valid-logical and physical data independence from database application:

changes in physical implementation of database would not (in theory) adversely affect applications built to use DB

-easy data retrieval: data may be retrieved from any table or from any number of related tables

Style

Music

32

Relational Database ModelAgents

EntertainersClients

Engagements

-Relationships -may be: one-to-one, one-to-many, and many-to-many-Established through matching values of a shared field

-Example: Agents are associated to Clients by an AgentsID field in Clients table

-Data may be accessed in an almost unlimited combination of ways

Styles

Music

Documents

1 Assignment: Due Friday: READ: Ensembl API Tutorial: READ: Encode Project paper (on Icon)