Upload
myron-conley
View
218
Download
0
Embed Size (px)
Citation preview
BioMart
Databases made easy
Richard HollandEuropean Bioinformatics InstituteHelsinki, September 2006
BioMart
• A joint project – European Bioinformatics Institute (EBI) – Cold Spring Harbor Laboratory (CSHL)
• Aim– To develop a generic, query-oriented data
management system capable of integrating distributed data sources.
Focus
• ‘Data mining’ or advance search – Creating custom datasets– Querying multiple datasets– Interactive
•Users– People who provide database-based service– ‘Power user’ biologists and bioinformaticians
Requirements
• User– ‘One-stop shop’ for biological data– Suitable for power biologists and bioinformaticians– A set of interfaces that allow user to group and refine
biological data based upon many criteria
• Deployer– ‘Out of the box’ installation– Built in ‘ query optimization– Easy data federation
• Architecture– Domain agnostic– Distributed– Platform independent
Advanced search GUIs
Single interface
Single access point
Queries across different databases
Dataset 1
Dataset 2
Links
Main features
• Domain agnostic• Platform independent (MySQL, ORACLE,
Postgres)• Scalable for big datasets• Federated architecture• Automated UI configuration
How does it work?
BioMart
Data mart XML XML XML Meta data
BioMart software
Source data
Query Engine
Federated architecture
FK
FK
FK
FK
PK
PK
Data model
FK
FK
FK
FK
PK
PK
FK FK
FK FK
Data model
main1
PK1
2
PK2PK1
FK2
dm
FK2
dm
FK1 FK2
dm
FK1 FK2
PK1FK1 FK1
FK2 FK2PK2 FK1
Data model - ‘reversed star’
Data mart and dataset
Dataset
Data mart, dataset and virtual schema
virtual schema
BioMart abstractions
• Dataset– A subset of data organized into 1 or more tables
• Attribute– A single data point – e. g. gene name
• Filter– An operation on an attribute – e. g. ‘Chromosome =1’
Datasets, Attributes and Filters
GENE
gene_id(PK)gene_stable_id gene_startgene_chrom_endchromosomegene_display_iddescription
Mart
Dataset
Attribute
Filter
BioMart abstractions (cont)
• Link– ‘common currency’ between two datasets – e. g. accession
• Exportable – Potential links to export
• Importable– Potential links to import
Exportables, Importables and Links
Dataset 1
Dataset 2
Links
Exportables, Importables and Links
Dataset 1 Dataset 2
Exportable Importable
name = uniprot_id
attributes = uniprot_ac
name = uniprot_id
filters = uniprot_ac
Links
Exportables, Importables and Links
Dataset 1 Dataset 2
Exportable Importable
name=genomic_region
attributes=chr_name, chr_start, chr_end
name=genomic_region
filters=chr_name (=), chr_start (>=), chr_end (<=)
Links
Creating BioMart databases
Building BioMart databases
Source databases
Mart
Transformation
MartBuilder
Configuration
XML
MartEditorMartBuilder
Schema transformationprinciples
• Central table– Longest n:1, 1:1 path
• Dimension table– Central transformation ‘around’ 1:n table. – Link tables are decomposed into a set of 1:n first
MartBuilder Application
• Read database meta data• Transforms a source schema into suggested datasets and lets you edit
the process• Produces a set of SQL statements (DDL)
to run against the server to perform the transformation
Dataset Configuration
• Dataset configuration • Attributes • Filters• Trees, Groups, Collections• Exportables, Importables• Semantics• Relational mapping
• User interface• Linking datasets• XML-based
Table naming conventionNaïve configuration
• Tables– Meta tables meta_content– Data tables dataset__content__type
• Data tables– Main __main – Dimension __dm
• Columns– Key _key
Naming convention examples
• Homo sapiens gene ensembl– hsapiens_gene_ensembl__gene__main– hsapiens_gene_ensembl__xref_hugo__dm
• Encode– hsapiens_encode__encode__main
• Uniprot– uniprot__protein__main– uniprot__interpro__dm
• Uniprot sequence– uniprot_sequence__sequence__main
Dataset Configuration
XML
XML
XML
MartEditor
Accessing BioMart databases
Retrieval
myDatabase
SNPVega
EnsemblUniProt
myMart
MSD
BioMart API
JAVA Perl
MartExplorer MartShell MartView
Schema transformation
MartBuilder
XML
MartEditor
Configuration
Databases
Public data (local or remote)
BioMart architecture
MartView (current)
MartView (new 0_5)
MartExplorer
MartShell
Using = dataset
Get = attribute
Where = filter
MartShell (MQL)● Uses Mart Query Language (MQL) to generate queries:
using <dataset> get <attributes> where <filters>
● Can join datasets together:
using Dataset1 get Attribute1 where Filter1=var1 as q;
using Dataset2 get Attribute2 where Filter2=var2 and filter3 in q
● Can script and pipe:
martshell.sh -E MQLscript.mql > results.txtmartshell.sh -E MQLscript.mql | wc
MartShell examplesMartShell> using MSD.msd get pdb_id where
resolution_less < 1.5 and has_ec_info only;193l194l1arb ...
MartShell> using MSD.msd get pdb_id where resolution_less < 1.5 and has_ec_info only as q;MartShell> using Ensembl.hsapiens_gene_ensembl get sequence transcript_flanks+1000 where pdb in q;ENST00000270142.2 ENSG00000142168.2strand=forward chr=21 assembly=NCBI34downstream flanking sequence of transcript only
AAACTAAATTAGCTCTGATACTTATTTATATAAACAGCTTCAGTGGAA ....
biomaRt
Taverna
DAS ProServer
BioMart deployers
• Large scale data federation (EBI)• Optimising access to a large database
(Ensembl, WormBase)• Connecting priopriatery datasets to
public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)
EBI
UniprotMSD
SANGEREnsemblSNPVegaSequenceWWW
Hinxton example
BioMart deployers
• Large scale data federation (Hinxton)
• Optimising access to a large database (Ensembl, WormBase, ArrayExpress)
• Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)
WormBase
Genes
Expression
Phenotypes
Variations
Literature
Ontologies
Sequence
Genes
Expression
Phenotypes
Variations
Literature
Ontologies
Sequence
Ensembl
Genes
Ontologies
Variations
Protein annotation
Disease
Homologies
Sequence
Array annotations
Genes
Ontologies
Variations
Protein annotation
Disease
Homologies
Sequence
Array annotations
HapMap
Population
Frequencies
Inter population
comparisons
Gene
annotation
Population
Frequencies
Inter population
comparisons
Gene
annotation
ArrayExpress
BioMart deployers
• Large scale data federation (Hinxton)• Optimising access to a large database
(Ensembl, WormBase)• Federating third party data with public
data (Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …)
In development
• CAPRISA• RGD• DICTYBASE• PURDUE UNIVERSITY• RZPD
Music Mart
BioMart model
• Already applied– Ensembl– Vega– SNP– Uniprot– MSD– ArrayExpress– WormBase– Gramene– HapMap– Variety of ‘in house’ projects (academia and industrial)
User restriction
XML
Dataset
XML
martUser
“default”
“advanced”
Interface configuration
XML
Dataset
XML
Interface
“single-pageweb interface”
“wizard styleweb interface”
Web services
MartView
3306
Local Mart
3306
X
Remote Mart
MartService
3306
80
XML
Web services (cont)MartService requests
• Registry XML
• Dataset information: name, type etc
• DatasetConfig XML
• Mart Query: – API query object is converted to a XML representation on the client
and sent to the server.
– Query object is regenerated on the server and processed. Results are sent back to client as a simple tab-delim HTML page.
Summary
• A generic data management system– A set of easily configurable user interfaces– Distributed Data federation– Query optimization
BioMart
• www.biomart.org• Open source (LGPL)• Public MySQL server• ftp• [email protected]• [email protected]
Acknowledgments• BioMart
– Arek Kasprzyk (EBI)– Damian Smedley (EBI)– Syed Haider (EBI)– Gudmundur Thorisson (CSHL)
• Contributors– Darin London (EBI)– Will Spooner (CSHL)– Damian Keefe (Ensembl)– Arne Stabenau (Ensembl)– Andreas Kahari (Ensembl)– Craig Melsopp (Ensembl)– Katerina Tzouvara (Uniprot)– Paul Donlon (Unilever)– Steffen Durinck (SCD-ESAT, Katholieke Universiteit Leuven)– Benoit Ballester (Universite de la Mediterranee)– Stephen Robinson (EBI)– Asif Kibria (EBI)– Paul Donlon (Unilever)