BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September...

Preview:

Citation preview

BioMart

Databases made easy

Richard HollandEuropean Bioinformatics InstituteHelsinki, September 2006

BioMart

• A joint project – European Bioinformatics Institute (EBI) – Cold Spring Harbor Laboratory (CSHL)

• Aim– To develop a generic, query-oriented data

management system capable of integrating distributed data sources.

Focus

• ‘Data mining’ or advance search – Creating custom datasets– Querying multiple datasets– Interactive

•Users– People who provide database-based service– ‘Power user’ biologists and bioinformaticians

Requirements

• User– ‘One-stop shop’ for biological data– Suitable for power biologists and bioinformaticians– A set of interfaces that allow user to group and refine

biological data based upon many criteria

• Deployer– ‘Out of the box’ installation– Built in ‘ query optimization– Easy data federation

• Architecture– Domain agnostic– Distributed– Platform independent

Advanced search GUIs

Single interface

Single access point

Queries across different databases

Dataset 1

Dataset 2

Links

Main features

• Domain agnostic• Platform independent (MySQL, ORACLE,

Postgres)• Scalable for big datasets• Federated architecture• Automated UI configuration

How does it work?

BioMart

Data mart XML XML XML Meta data

BioMart software

Source data

Query Engine

Federated architecture

FK

FK

FK

FK

PK

PK

Data model

FK

FK

FK

FK

PK

PK

FK FK

FK FK

Data model

main1

PK1

2

PK2PK1

FK2

dm

FK2

dm

FK1 FK2

dm

FK1 FK2

PK1FK1 FK1

FK2 FK2PK2 FK1

Data model - ‘reversed star’

Data mart and dataset

Dataset

Data mart, dataset and virtual schema

virtual schema

BioMart abstractions

• Dataset– A subset of data organized into 1 or more tables

• Attribute– A single data point – e. g. gene name

• Filter– An operation on an attribute – e. g. ‘Chromosome =1’

Datasets, Attributes and Filters

GENE

gene_id(PK)gene_stable_id gene_startgene_chrom_endchromosomegene_display_iddescription

Mart

Dataset

Attribute

Filter

BioMart abstractions (cont)

• Link– ‘common currency’ between two datasets – e. g. accession

• Exportable – Potential links to export

• Importable– Potential links to import

Exportables, Importables and Links

Dataset 1

Dataset 2

Links

Exportables, Importables and Links

Dataset 1 Dataset 2

Exportable Importable

name = uniprot_id

attributes = uniprot_ac

name = uniprot_id

filters = uniprot_ac

Links

Exportables, Importables and Links

Dataset 1 Dataset 2

Exportable Importable

name=genomic_region

attributes=chr_name, chr_start, chr_end

name=genomic_region

filters=chr_name (=), chr_start (>=), chr_end (<=)

Links

Creating BioMart databases

Building BioMart databases

Source databases

Mart

Transformation

MartBuilder

Configuration

XML

MartEditorMartBuilder

Schema transformationprinciples

• Central table– Longest n:1, 1:1 path

• Dimension table– Central transformation ‘around’ 1:n table. – Link tables are decomposed into a set of 1:n first

MartBuilder Application

• Read database meta data• Transforms a source schema into suggested datasets and lets you edit

the process• Produces a set of SQL statements (DDL)

to run against the server to perform the transformation

Dataset Configuration

• Dataset configuration • Attributes • Filters• Trees, Groups, Collections• Exportables, Importables• Semantics• Relational mapping

• User interface• Linking datasets• XML-based

Table naming conventionNaïve configuration

• Tables– Meta tables meta_content– Data tables dataset__content__type

• Data tables– Main __main – Dimension __dm

• Columns– Key _key

Naming convention examples

• Homo sapiens gene ensembl– hsapiens_gene_ensembl__gene__main– hsapiens_gene_ensembl__xref_hugo__dm

• Encode– hsapiens_encode__encode__main

• Uniprot– uniprot__protein__main– uniprot__interpro__dm

• Uniprot sequence– uniprot_sequence__sequence__main

Dataset Configuration

XML

XML

XML

MartEditor

Accessing BioMart databases

Retrieval

myDatabase

SNPVega

EnsemblUniProt

myMart

MSD

BioMart API

JAVA Perl

MartExplorer MartShell MartView

Schema transformation

MartBuilder

XML

MartEditor

Configuration

Databases

Public data (local or remote)

BioMart architecture

MartView (current)

MartView (new 0_5)

MartExplorer

MartShell

Using = dataset

Get = attribute

Where = filter

MartShell (MQL)● Uses Mart Query Language (MQL) to generate queries:

using <dataset> get <attributes> where <filters>

● Can join datasets together:

using Dataset1 get Attribute1 where Filter1=var1 as q;

using Dataset2 get Attribute2 where Filter2=var2 and filter3 in q

● Can script and pipe:

martshell.sh -E MQLscript.mql > results.txtmartshell.sh -E MQLscript.mql | wc

MartShell examplesMartShell> using MSD.msd get pdb_id where

resolution_less < 1.5 and has_ec_info only;193l194l1arb ...

MartShell> using MSD.msd get pdb_id where resolution_less < 1.5 and has_ec_info only as q;MartShell> using Ensembl.hsapiens_gene_ensembl get sequence transcript_flanks+1000 where pdb in q;ENST00000270142.2 ENSG00000142168.2strand=forward chr=21 assembly=NCBI34downstream flanking sequence of transcript only

AAACTAAATTAGCTCTGATACTTATTTATATAAACAGCTTCAGTGGAA ....

biomaRt

Taverna

DAS ProServer

BioMart deployers

• Large scale data federation (EBI)• Optimising access to a large database

(Ensembl, WormBase)• Connecting priopriatery datasets to

public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

EBI

UniprotMSD

SANGEREnsemblSNPVegaSequenceWWW

Hinxton example

BioMart deployers

• Large scale data federation (Hinxton)

• Optimising access to a large database (Ensembl, WormBase, ArrayExpress)

• Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

WormBase

Genes

Expression

Phenotypes

Variations

Literature

Ontologies

Sequence

Genes

Expression

Phenotypes

Variations

Literature

Ontologies

Sequence

Ensembl

Genes

Ontologies

Variations

Protein annotation

Disease

Homologies

Sequence

Array annotations

Genes

Ontologies

Variations

Protein annotation

Disease

Homologies

Sequence

Array annotations

HapMap

Population

Frequencies

Inter population

comparisons

Gene

annotation

Population

Frequencies

Inter population

comparisons

Gene

annotation

ArrayExpress

BioMart deployers

• Large scale data federation (Hinxton)• Optimising access to a large database

(Ensembl, WormBase)• Federating third party data with public

data (Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …)

In development

• CAPRISA• RGD• DICTYBASE• PURDUE UNIVERSITY• RZPD

Music Mart

BioMart model

• Already applied– Ensembl– Vega– SNP– Uniprot– MSD– ArrayExpress– WormBase– Gramene– HapMap– Variety of ‘in house’ projects (academia and industrial)

User restriction

XML

Dataset

XML

martUser

“default”

“advanced”

Interface configuration

XML

Dataset

XML

Interface

“single-pageweb interface”

“wizard styleweb interface”

Web services

MartView

3306

Local Mart

3306

X

Remote Mart

MartService

3306

80

XML

Web services (cont)MartService requests

• Registry XML

• Dataset information: name, type etc

• DatasetConfig XML

• Mart Query: – API query object is converted to a XML representation on the client

and sent to the server.

– Query object is regenerated on the server and processed. Results are sent back to client as a simple tab-delim HTML page.

Summary

• A generic data management system– A set of easily configurable user interfaces– Distributed Data federation– Query optimization

BioMart

• www.biomart.org• Open source (LGPL)• Public MySQL server• ftp• mart-dev@ebi.ac.uk• mart-announce@ebi.ac.uk

Acknowledgments• BioMart

– Arek Kasprzyk (EBI)– Damian Smedley (EBI)– Syed Haider (EBI)– Gudmundur Thorisson (CSHL)

• Contributors– Darin London (EBI)– Will Spooner (CSHL)– Damian Keefe (Ensembl)– Arne Stabenau (Ensembl)– Andreas Kahari (Ensembl)– Craig Melsopp (Ensembl)– Katerina Tzouvara (Uniprot)– Paul Donlon (Unilever)– Steffen Durinck (SCD-ESAT, Katholieke Universiteit Leuven)– Benoit Ballester (Universite de la Mediterranee)– Stephen Robinson (EBI)– Asif Kibria (EBI)– Paul Donlon (Unilever)

Recommended