Upload
bartholomew-henderson
View
221
Download
2
Tags:
Embed Size (px)
Citation preview
SciDBAn Open Source Data Base
Project
by
Michael Stonebraker(and others)
OutlineOutline
Why science folks are unhappy with RDBMSHow we plan to fix thatThe details
Why SciDB?Why SciDB?
“Big science” very unhappy with RDBMS
Astronomy
HEP
Fusion
Bio
Remote sensing
Why?Why?
Experience of Sequoia 2000 (mid 1990s)Tried to use Postgres for science databasesFailed badly……
Main science data type is an array – horribly
inefficient to simulate arrays on top of tables
Required features absent (provenance, uncertainty,
version control)
SQL operations wrong (regrid – not join)
Why SciDB?Why SciDB?
Net result
Mentality of “roll your own from the ground up” for
every new science project
Realization by the science community that this is
long-term suicide
Community wants to get behind something
better
Great commonality of needs among domains
A Little ContextA Little Context
XLDB-1 Genesis of the need
Asilomar conference (March 2008)Small conference to generate requirements
A Little ContextA Little Context
March 2008 – September 2008
Initial design completed
Fund raising
Recruiting of initial team
Detailed use cases specified
Our PartnershipOur Partnership
Science and high-end commercial folks
Who will put up some resources
And review design
DBMS brain trust
Who will design the system, oversee its
construction, and perform needed research
Non-profit company
Which will manage the open source project
And support the resulting system
May need long term funding help
Partners – Science Partners – Science (We are recruiting more….)(We are recruiting more….)
LSST astronomy project
DBMS work co-ordinated by SLAC
Pacific Northwest National Laboratory (PNNL)
Various bio projects
Lawrence Livermore National Laboratory
Fusion projects
UCSB
Remote sensing
Partners -- DBMSPartners -- DBMS
Mike Stonebraker (MIT)Dave DeWitt (Wisconsin -> Microsoft)Jignesh Patel (Wisconsin)Jennifer Widom (Stanford)Dave Maier (Portland State)Stan Zdonik (Brown)Sam Madden (MIT)Ugur Cetintemal (Brown)Magda Balazinska (Washington)Mike Carey (UCI)
Partners -- OtherPartners -- Other
E-Bay VerticaMicrosoftLSSTSLACWill hit up NSF and DOE
The SciDB Data ModelThe SciDB Data Model
Nothing (e.g. Hadoop, Pig, Hive, …)?Most of you have schemas
Hadoop is not a good starting pointSlowNo HA
The SciDB Data ModelThe SciDB Data Model
Tables?Makes a few of you happyUsed by Sloan Sky Survey
ButPanStarrs (Alex Szalay) wants arrays and
scalability
The SciDB Data ModelThe SciDB Data Model
Arrays?Superset of tables (tables with a primary
key are a 1-D array)Makes HEP, remote sensing, astronomy,
oceanography folks happyBut
Not biology and chemistry (who wants
networks and sequences)
The SciDB Data ModelThe SciDB Data Model
Multidimensional gridsSuperset of arrays (non-uniform cells)Makes solid modeling folks happy
ButComplex and slow
SciDB Data ModelSciDB Data Model
Nested multidimensional arraysArray values are a tuple of values and arrays
Sightings (sid, details) [x, y, z, t]
Objects (type, [sid]) [id]
Basic ArraysBasic Arrays
Positive integer dimensions, no gapsBounded or unbounded
Enhanced ArraysEnhanced Arrays
“Shape” functionSupports irregular boundary
Enhanced ArraysEnhanced Arrays
Co-ordinate systemsUser defined functions that map integers to
something elseE.g. mercator
Use dimension notation to access, e.g.A[17,36] orA{468.2, 917.6}
SciDB Query LanguageSciDB Query Language
“Parse-tree” representation of array operationsWith a “binding” to:
MatLab
C++
Python
IDL
There may be more….
User extendable operations (Postgres-style)
OperationsOperations
Standard relational ones (filter, join)Plus whatever you want (regrid, interpolate,
fourier transform, eigenvalues, …)Plus add your own (Postgres-style)We need science input here!!!
Environment and StorageEnvironment and Storage
Extendable grid (cloud) of Linux machinesWith built-in high availability and failoverAnd built in disaster recovery
In Situ ProcessingIn Situ Processing
Operate on data with loading itSupported by a SciDB self-describing file
formatAnd some number of adaptors, e.g. HDF-5,
NetCDF Or write your own
Storage ModelStorage Model
Arrays are “chunked” in storageChunk size can vary
Chunks are partitioned across the gridGo for scalability to petabytes
Other Features Other Features Which Science Guys WantWhich Science Guys Want
(These could be in RDBMS, but Aren’t)(These could be in RDBMS, but Aren’t)
Uncertainty
Data has error bars
Which must be carried along in the computation
(interval arithmetic)
Will look at more sophisticated error models later
Other FeaturesOther Features
Provenance (lineage)
What calibration generated the data
What was the “cooking” algorithm
In general – repeatability of data derivation
Supported by a command log
with query facilities (interesting research problem)
And redo
Other FeaturesOther Features
Time travel
Don’t fix errors by overwrite
I.e. keep all of the data
Supported by an extra array dimension (history)
Spatial supportNamed versions
Recalibration usually handled this way
Supported by allocating an array for the new
version and “diffing” against its parent
Other Features Other Features
(Optionally) integration of the real time data
capture system
“cooking” inside DBMS
Makes provenance capture easier
Sometimes important
Time LineTime Line
Q4/08 start company, begin research activities
Late 2009
Demoware available
Late 2010
V1 ships
Project Organization Project Organization (Build-it for real)(Build-it for real)
CEO (Andy Palmer -- Vertica)Project management (Bobbi Heath -- Vertica)CTO (Stonebraker)
Project Organization Project Organization (Design and Research)(Design and Research)
Overall co-ordination (Stonebraker, DeWitt)Storage and execution (Madden, Cetintemal)Query layer and semantics (Zdonik, Maier)Provenance (Widom, Patel)Resource management (Balazinska)Language bindings (Carey)
SciDB Has a Good Chance at SuccessSciDB Has a Good Chance at Success
Community realizes shared infrastructure is
good“Lighthouse” customersStrong teamComputation goes inside the DBMS
Easier to shareAnd reuse
How Can You Help?How Can You Help?
Get involved!!!!