49
Lane Medical Library & Knowledge Management Center http://lane.stanford.edu A Guided SQL Tour of Bioinformatics Databases Yannick Pouliot, PhD Bioresearch Informationist [email protected] Lane Medical Library & Knowledge Management Center 2/28/2007

A guided SQL tour of bioinformatics databases

Embed Size (px)

DESCRIPTION

Understanding relational databases for biomedical research

Citation preview

Page 1: A guided SQL tour of bioinformatics databases

Lane Medical Library & Knowledge Management Centerhttp://lane.stanford.edu

A Guided SQL Tour of Bioinformatics Databases

Yannick Pouliot, PhDBioresearch Informationist

[email protected]

Lane Medical Library & Knowledge Management Center

2/28/2007

Page 2: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

2

Content Very abbreviated review of the relational principle Some of the technology required to connect to a

remote database Walk-through of the database schema for Ensembl

Hands-on querying Walk-through of the database schema for

BioWarehouse Hands-on querying

Resources Details on connecting to a remote database

Page 3: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

3

So Why Are We Here?

Page 4: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

4

Bioinformatics Databases: Who Supports Direct Querying?

Page 5: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

5

Relational Database Terms Database: Collection of tables and relationship

between tables Table

Collection of records that share a common fundamental characteristic E.g., patients and locations can each be stored in their own table

Record Basic unit of information in a relational database

E.g., 1 record per perso A record is composed of columns (“fields”)

Query Set of instructions to a database “engine” to retrieve, sort

and format returning data. “find me all patients in my database”

Page 6: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

6

Main Relational Database “Engines”

Filemaker MS Access MS SQL Server

MySQLOracle Postgress Sybase

Page 7: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

7

Structure of Relational DB Tables

Data values live in rows

Page 8: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

8

Understanding the Relational Principle: A Simple Database

Every patient gets ONE record in the Patients table Every visit gets ONE record in the Visits table Rows in different tables can be related one to another using a shared

key (identifier) There can be multiple visits records for a given patient There can be multiple tissue records for a given patient

“join”

return

Page 9: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

9

The Relational Principle at Work

Related records can be found using a shared key Example: Patients.ID = Visits.PatientID

Table name Primary Key

Page 10: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

10

SQL Querying…With What?

Query browsers used here: MySQL Query Browser WinSQL

Other query browsers exist but are more sophisticated Often more expensive or more complex Example: PL/SQL Developer, from Allround Automations

Page 11: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

11

Example: Network Querying of Ensembl Database Using MySQL Query Browser

What happens when you use query a remote database? DEMO

Of note: May take some time

Big database, lots of data to return from far away… Easy to write queries with voluminous output May have to kill the query…

Setting up ODBC: not discussed here, but cheat sheet instructions are in handout. Location will also be mailed

Page 12: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

12

The Database Schema: Your Roadmap For Querying

The schema describes all tables and all fields Used to determine how to inter-relate tables to

retrieve the desired data Very important:

Must understand schema for accurate querying Wrong understanding = wrong results

Page 13: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

13

Introducing The SQL Select Statement

Good news: This is the only SQL statement you need to understand for querying

SELECT LastName, FirstNameFROM Patients

Page 14: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

14

Basic Syntax of Select StatementSELECT field_name FROM table [WHERE condition]

Example:

Select LastName,FirstName From PatientsWhere Alive = ‘Y’;

Note: case sensitive for all but Oracle Query statement are written into a tool such as MS Query or

MySQL Query Browser

[ ] = elective

Handout: p2

Page 15: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

15

SELECT – (Some) Details

Page 16: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

16

Moving On: Real

Biodatabase

Schemas

Page 17: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

17

Schemas We’ll Look At… Remember: Schemas…

describe all tables and all fields used to determine how to inter-relate tables to

retrieve the desired data

Our schemas today: Ensembl BioWarehouse

Page 18: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

18

Ensembl Produced by Sanger Institute Collection of genome databases for many different organisms Free, open source Web querying: http://www.ensembl.org/ FAQ: What is Ensembl? All PubMed references pertaining to Ensembl and written by the

Ensembl group

Page 19: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

19

Exploring the Ensembl Schema

Ensembl CORE schema documentation First place to go to answer: “what does this table

store?” Problem: no graphical representation of overall

schema Relationships harder to appreciate

Use Catalog function and go from there…

Page 20: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

20

“Fundamental” TablesFundamental tables Features and analyses ID Mapping (Map identifiers between releases)assembly alt_allele gene_archiveassembly_exception analysis mapping_sessionattrib_type analysis_description peptide_archivecoord_system density_feature stable_id_eventdna density_typednac dna_align_featureexon map Exernal references (IDs to objects in other dbs)exon_stable_id marker external_dbexon_transcript marker_feature external_synonymgene marker_map_location go_xrefgene_stable_id marker_synonym identity_xrefkaryotype misc_attrib object_xrefmeta misc_feature xrefmeta_coord misc_feature_misc_setprediction_exon misc_setprediction_transcript prediction_transcript Miscellaneousseq_region protein_align_feature interproseq_region_attrib protein_featuresupporting_feature qtltranscript qtl_featuretranscript_attrib qtl_synonymtranscript_stable_id regulatory_factortranslation regulatory_factor_codingtranslation_attrib regulatory_featuretranslation_stable_id regulatory_feature_object

regulatory_search_regionrepeat_consensusrepeat_featuresimple_feature

Page 21: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

21

Understanding The Ensembl Schema Using The Catalog

Page 22: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

22

Querying Ensembl

Ensembl runs on the MySQL database engine We’ll use WinSQL

MySQL Query Browser can also be used, as well as lots of other querying tools

Page 23: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

23

Before Proceeding: A Word of Caution

Go to join

Easy to write queries that… Retrieve nonsense Never complete

Scotty to Captain Kirk: “Where going in circles, and at warp 6 we’re going mighty fast…”

Understanding schema is only way to prevent this Tips:

Use “count” to determine the number of rows in table BEFORE returning large datasets

Remember: the more tables are joined, the slower the query

Page 24: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

24

Demo Queries… To Get You Started

Query 1: return number of genes stored in Ensembl Human

Query 2: return number of transcripts produced by genes stored in Ensembl Human

Demonstrates JOINing of tables

Page 25: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

25

ExercisesTogether: 1. the number of genes stored in Ensembl Human 2. the number of transcripts produced by genes stored in

Ensembl Human(10 min)

On your own: 3. the types of analyses that Ensembl provides 4. the number of types of markers 5. the number of markers per chromosome for all chromosomes 6. Extra points: the minimum and maximum marker distances for

markers on chromosome 19(20 min)

Page 26: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

26

SELCT Statement: A Refresher

SELECT [DISTINCT] select_list FROM table_list

[WHERE conditions]

[START WITH] [CONNECT BY]

[GROUP BY group_by_list] [HAVING search_conditions]

[ORDER BY order_list [ASC | DESC] ]

“Modifiers” of select list:

DISTINCT COUNT SUM MIN MAX

Also: ORDER BY LIKE (used in

WHERE clause)

Page 27: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

27

Example Of A Biologically-Useful Query: All Markers on Chromosome 1

Page 28: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

28

Now We’re Talking: Returning Results into Your Favorite Tool

SQL query results returned to… MS Excel

… using Data/Import External Data/New Database Query Details: Excel Advanced Report Development

, Zapawa 2005

SpotfireIn Lane catalog

Page 29: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

29

Next: BioWarehouse

Produced by SRI International Integration of genome, biochem rxns, pathways, etc databases from many different organisms Free, open source Accessing PublicHouse FAQ Schema All PubMed references pertaining to BioWarehouse and written by the BioWarehouse group

Page 30: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

30

Conceptual Views of the BioWarehouse Database

Page 31: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

31

Page 32: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

32

Querying BioWarehouse

We’ll query using MySQL Query Browser Caveats:

Lots of datasets supported by BioWarehouse… .. but some critical ones are missing from publichouse

due to licensing requirements, e.g., MetaCyc UniProt

Also: Need to request account to query Anonymous user not supported

Resource: MySQL v5 Reference Manual

Page 33: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

33

BioWarehouse Demo Queries…to get you started

Query 1: What are the datasets available in PublicHouse?

Query 2: How many pathways are there for the EcoCyc dataset?

Page 34: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

34

Example Biologically Meaningful Query Of BioWarehouse: For a Given Pathway, Return Proteins Involved Pathway and Their Molecular Weight

SELECT D.Name as PathwayName,J.WID AS ProteinWID, J.Name AS ProteinName, J.MolecularWeightCalc AS MolecularWeightCalc

FROM Pathway D,PathwayReaction F, Reaction G, EnzymaticReaction H, Protein J

WHERE D.WID = F.PathwayWID AND F.ReactionWID = G.WID

AND G.WID = H.ReactionWID and H.ProteinWID = J.WID

AND D.DataSetWID=19AND D.Name LIKE "%lipopolysaccharide%"ORDER BY ProteinName

Page 35: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

35

ExercisesTogether: 1. How many datasets are there in PublicHouse? 2. What is the number of genes in S. aureus

(SAUR158878Cyc)?

(10 min)

On your own: 3. List the coding region start and ends for all genes that

code for proteins in the SAUR158878Cyc dataset 4. How many biochemical reactions are there in each

pathway (of any type) in the EcoCyc (=E. coli) dataset? (20 min)

Page 36: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

36

In Summary… Knowing the db schema is essential SELECT statement all you need to know Remote databases good for exploring a schema at

low cost No installation…

But: Performance can be poor Restrictions on data set Better to install locally if “real work” to be performed

Remember: SQL gives you the power to return results directly into your favorite tool!

Page 37: A guided SQL tour of bioinformatics databases

Lane Medical Library & Knowledge Management Centerhttp://lane.stanford.edu

Don’t Forget The Class Evaluation

Page 38: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

38

Resources

Page 39: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

39

Setting-Up for Internet SQL Querying

Page 40: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

40

Setting Up Data Source Names

Steps1. Make sure you have the requisite

driver (next slide)

2. Create a Data Source Name (Windows only)

3. Write your query

4. Get the results back into Excel!See Lane videorecorded class Managing Experiment Data Using Excel and Friends: Digging Out from Under the Avalanche for lots more details.

Page 41: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

41

Step 1: Getting DriversEssential for SQL Querying

A driver is a piece of software that lets your operating system talk to a database Installed drivers visible in ODBC manager

“data connectivity” tool

Each database engine (Oracle, MySQL, etc) requires its own driver Generally must be installed by user

Drivers are needed by Data Source Name tool and querying programs

Require (simple) installation

Page 42: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

42

MySQL Driver: Needed to Query MySQL Databases

Windows: Download MySQL Connector/ODBC 3.51 here

Must be installed for direct querying using e.g. Excel Not necessary if you are using the MySQL Query

Browser

Page 43: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

43

Oracle Driver: Needed to Query Oracle Databases

Installing “client” software will also install driver Windows: Download 10g Client here Mac: Download 10g Client here Free Oracle user account required to

download Must be installed if you are querying

using MS Query or any other query browser involving Oracle

Page 44: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

44

Step 2: Creating a Data Source Name

A Data Source Name (DSN) tells programs on your PC where and how to query a database

Populating the fields: Data Source Name: Unique name of your choice Description: anything Server: exactly as given by the database provider Port number: as specified by database provider

Defaults: MySQL: 3306; Oracle: 1521; MS Access: N/A

Page 45: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

45

Resources – SQL

eBook: Beginning SQL eBook: Learning SQL

Page 46: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

46

Lots More Resources From Lane

Page 47: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

47

Page 48: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

48

How To Get Accounts for Direct SQL Querying

Direct Querying of Selected Bioinformatics Databases

Database How? DB Engine

BioWarehouse

http://biowarehouse.ai.sri.com/ get account for access to publichouse (publicly-accessible installation of BioWarehouse; see http://biowarehouse.ai.sri.com/PublicHouseOverview.html

MySQL

Ensemblhttp://www.ensembl.org/info/data/download.html

MySQL

Mouse Genome Database

Mail [email protected] to ask for an account

Sybase

Page 49: A guided SQL tour of bioinformatics databases

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

49

Example Querying with MySQL Query Browser Free MySQL only Facilitates writing of a SQL query

graphical Get it at http://www.mysql.com/products/tools/query-

browser/

Query statement

Execute statement

Table descriptions