62
BioMart Databases made easy ichard Holland uropean Bioinformatics Institute elsinki, September 2006

BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Embed Size (px)

Citation preview

Page 1: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

BioMart

Databases made easy

Richard HollandEuropean Bioinformatics InstituteHelsinki, September 2006

Page 2: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

BioMart

• A joint project – European Bioinformatics Institute (EBI) – Cold Spring Harbor Laboratory (CSHL)

• Aim– To develop a generic, query-oriented data

management system capable of integrating distributed data sources.

Page 3: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Focus

• ‘Data mining’ or advance search – Creating custom datasets– Querying multiple datasets– Interactive

•Users– People who provide database-based service– ‘Power user’ biologists and bioinformaticians

Page 4: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Requirements

• User– ‘One-stop shop’ for biological data– Suitable for power biologists and bioinformaticians– A set of interfaces that allow user to group and refine

biological data based upon many criteria

• Deployer– ‘Out of the box’ installation– Built in ‘ query optimization– Easy data federation

• Architecture– Domain agnostic– Distributed– Platform independent

Page 5: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Advanced search GUIs

Page 6: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Single interface

Page 7: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Single access point

Page 8: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Queries across different databases

Dataset 1

Dataset 2

Links

Page 9: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Main features

• Domain agnostic• Platform independent (MySQL, ORACLE,

Postgres)• Scalable for big datasets• Federated architecture• Automated UI configuration

Page 10: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

How does it work?

Page 11: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

BioMart

Data mart XML XML XML Meta data

BioMart software

Source data

Page 12: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Query Engine

Federated architecture

Page 13: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

FK

FK

FK

FK

PK

PK

Data model

Page 14: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

FK

FK

FK

FK

PK

PK

FK FK

FK FK

Data model

Page 15: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

main1

PK1

2

PK2PK1

FK2

dm

FK2

dm

FK1 FK2

dm

FK1 FK2

PK1FK1 FK1

FK2 FK2PK2 FK1

Data model - ‘reversed star’

Page 16: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Data mart and dataset

Dataset

Page 17: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Data mart, dataset and virtual schema

virtual schema

Page 18: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

BioMart abstractions

• Dataset– A subset of data organized into 1 or more tables

• Attribute– A single data point – e. g. gene name

• Filter– An operation on an attribute – e. g. ‘Chromosome =1’

Page 19: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Datasets, Attributes and Filters

GENE

gene_id(PK)gene_stable_id gene_startgene_chrom_endchromosomegene_display_iddescription

Mart

Dataset

Attribute

Filter

Page 20: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

BioMart abstractions (cont)

• Link– ‘common currency’ between two datasets – e. g. accession

• Exportable – Potential links to export

• Importable– Potential links to import

Page 21: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Exportables, Importables and Links

Dataset 1

Dataset 2

Links

Page 22: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Exportables, Importables and Links

Dataset 1 Dataset 2

Exportable Importable

name = uniprot_id

attributes = uniprot_ac

name = uniprot_id

filters = uniprot_ac

Links

Page 23: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Exportables, Importables and Links

Dataset 1 Dataset 2

Exportable Importable

name=genomic_region

attributes=chr_name, chr_start, chr_end

name=genomic_region

filters=chr_name (=), chr_start (>=), chr_end (<=)

Links

Page 24: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Creating BioMart databases

Page 25: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Building BioMart databases

Source databases

Mart

Transformation

MartBuilder

Configuration

XML

MartEditorMartBuilder

Page 26: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Schema transformationprinciples

• Central table– Longest n:1, 1:1 path

• Dimension table– Central transformation ‘around’ 1:n table. – Link tables are decomposed into a set of 1:n first

Page 27: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

MartBuilder Application

• Read database meta data• Transforms a source schema into suggested datasets and lets you edit

the process• Produces a set of SQL statements (DDL)

to run against the server to perform the transformation

Page 28: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006
Page 29: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Dataset Configuration

• Dataset configuration • Attributes • Filters• Trees, Groups, Collections• Exportables, Importables• Semantics• Relational mapping

• User interface• Linking datasets• XML-based

Page 30: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Table naming conventionNaïve configuration

• Tables– Meta tables meta_content– Data tables dataset__content__type

• Data tables– Main __main – Dimension __dm

• Columns– Key _key

Page 31: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Naming convention examples

• Homo sapiens gene ensembl– hsapiens_gene_ensembl__gene__main– hsapiens_gene_ensembl__xref_hugo__dm

• Encode– hsapiens_encode__encode__main

• Uniprot– uniprot__protein__main– uniprot__interpro__dm

• Uniprot sequence– uniprot_sequence__sequence__main

Page 32: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Dataset Configuration

XML

XML

XML

Page 33: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

MartEditor

Page 34: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Accessing BioMart databases

Page 35: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Retrieval

myDatabase

SNPVega

EnsemblUniProt

myMart

MSD

BioMart API

JAVA Perl

MartExplorer MartShell MartView

Schema transformation

MartBuilder

XML

MartEditor

Configuration

Databases

Public data (local or remote)

BioMart architecture

Page 36: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

MartView (current)

Page 37: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

MartView (new 0_5)

Page 38: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

MartExplorer

Page 39: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

MartShell

Using = dataset

Get = attribute

Where = filter

Page 40: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

MartShell (MQL)● Uses Mart Query Language (MQL) to generate queries:

using <dataset> get <attributes> where <filters>

● Can join datasets together:

using Dataset1 get Attribute1 where Filter1=var1 as q;

using Dataset2 get Attribute2 where Filter2=var2 and filter3 in q

● Can script and pipe:

martshell.sh -E MQLscript.mql > results.txtmartshell.sh -E MQLscript.mql | wc

Page 41: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

MartShell examplesMartShell> using MSD.msd get pdb_id where

resolution_less < 1.5 and has_ec_info only;193l194l1arb ...

MartShell> using MSD.msd get pdb_id where resolution_less < 1.5 and has_ec_info only as q;MartShell> using Ensembl.hsapiens_gene_ensembl get sequence transcript_flanks+1000 where pdb in q;ENST00000270142.2 ENSG00000142168.2strand=forward chr=21 assembly=NCBI34downstream flanking sequence of transcript only

AAACTAAATTAGCTCTGATACTTATTTATATAAACAGCTTCAGTGGAA ....

Page 42: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

biomaRt

Page 43: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Taverna

Page 44: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

DAS ProServer

Page 45: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

BioMart deployers

• Large scale data federation (EBI)• Optimising access to a large database

(Ensembl, WormBase)• Connecting priopriatery datasets to

public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

Page 46: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

EBI

UniprotMSD

SANGEREnsemblSNPVegaSequenceWWW

Hinxton example

Page 47: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

BioMart deployers

• Large scale data federation (Hinxton)

• Optimising access to a large database (Ensembl, WormBase, ArrayExpress)

• Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

Page 48: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

WormBase

Genes

Expression

Phenotypes

Variations

Literature

Ontologies

Sequence

Genes

Expression

Phenotypes

Variations

Literature

Ontologies

Sequence

Page 49: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Ensembl

Genes

Ontologies

Variations

Protein annotation

Disease

Homologies

Sequence

Array annotations

Genes

Ontologies

Variations

Protein annotation

Disease

Homologies

Sequence

Array annotations

Page 50: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

HapMap

Population

Frequencies

Inter population

comparisons

Gene

annotation

Population

Frequencies

Inter population

comparisons

Gene

annotation

Page 51: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

ArrayExpress

Page 52: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

BioMart deployers

• Large scale data federation (Hinxton)• Optimising access to a large database

(Ensembl, WormBase)• Federating third party data with public

data (Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …)

Page 53: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

In development

• CAPRISA• RGD• DICTYBASE• PURDUE UNIVERSITY• RZPD

Page 54: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Music Mart

Page 55: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

BioMart model

• Already applied– Ensembl– Vega– SNP– Uniprot– MSD– ArrayExpress– WormBase– Gramene– HapMap– Variety of ‘in house’ projects (academia and industrial)

Page 56: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

User restriction

XML

Dataset

XML

martUser

“default”

“advanced”

Page 57: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Interface configuration

XML

Dataset

XML

Interface

“single-pageweb interface”

“wizard styleweb interface”

Page 58: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Web services

MartView

3306

Local Mart

3306

X

Remote Mart

MartService

3306

80

XML

Page 59: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Web services (cont)MartService requests

• Registry XML

• Dataset information: name, type etc

• DatasetConfig XML

• Mart Query: – API query object is converted to a XML representation on the client

and sent to the server.

– Query object is regenerated on the server and processed. Results are sent back to client as a simple tab-delim HTML page.

Page 60: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Summary

• A generic data management system– A set of easily configurable user interfaces– Distributed Data federation– Query optimization

Page 61: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

BioMart

• www.biomart.org• Open source (LGPL)• Public MySQL server• ftp• [email protected][email protected]

Page 62: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

Acknowledgments• BioMart

– Arek Kasprzyk (EBI)– Damian Smedley (EBI)– Syed Haider (EBI)– Gudmundur Thorisson (CSHL)

• Contributors– Darin London (EBI)– Will Spooner (CSHL)– Damian Keefe (Ensembl)– Arne Stabenau (Ensembl)– Andreas Kahari (Ensembl)– Craig Melsopp (Ensembl)– Katerina Tzouvara (Uniprot)– Paul Donlon (Unilever)– Steffen Durinck (SCD-ESAT, Katholieke Universiteit Leuven)– Benoit Ballester (Universite de la Mediterranee)– Stephen Robinson (EBI)– Asif Kibria (EBI)– Paul Donlon (Unilever)