33
SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Embed Size (px)

Citation preview

Page 1: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

SciDBAn Open Source Data Base

Project

by

Michael Stonebraker(and others)

Page 2: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

OutlineOutline

Why science folks are unhappy with RDBMSHow we plan to fix thatThe details

Page 3: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Why SciDB?Why SciDB?

“Big science” very unhappy with RDBMS

Astronomy

HEP

Fusion

Bio

Remote sensing

Page 4: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Why?Why?

Experience of Sequoia 2000 (mid 1990s)Tried to use Postgres for science databasesFailed badly……

Main science data type is an array – horribly

inefficient to simulate arrays on top of tables

Required features absent (provenance, uncertainty,

version control)

SQL operations wrong (regrid – not join)

Page 5: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Why SciDB?Why SciDB?

Net result

Mentality of “roll your own from the ground up” for

every new science project

Realization by the science community that this is

long-term suicide

Community wants to get behind something

better

Great commonality of needs among domains

Page 6: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

A Little ContextA Little Context

XLDB-1 Genesis of the need

Asilomar conference (March 2008)Small conference to generate requirements

Page 7: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

A Little ContextA Little Context

March 2008 – September 2008

Initial design completed

Fund raising

Recruiting of initial team

Detailed use cases specified

Page 8: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Our PartnershipOur Partnership

Science and high-end commercial folks

Who will put up some resources

And review design

DBMS brain trust

Who will design the system, oversee its

construction, and perform needed research

Non-profit company

Which will manage the open source project

And support the resulting system

May need long term funding help

Page 9: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Partners – Science Partners – Science (We are recruiting more….)(We are recruiting more….)

LSST astronomy project

DBMS work co-ordinated by SLAC

Pacific Northwest National Laboratory (PNNL)

Various bio projects

Lawrence Livermore National Laboratory

Fusion projects

UCSB

Remote sensing

Page 10: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Partners -- DBMSPartners -- DBMS

Mike Stonebraker (MIT)Dave DeWitt (Wisconsin -> Microsoft)Jignesh Patel (Wisconsin)Jennifer Widom (Stanford)Dave Maier (Portland State)Stan Zdonik (Brown)Sam Madden (MIT)Ugur Cetintemal (Brown)Magda Balazinska (Washington)Mike Carey (UCI)

Page 11: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Partners -- OtherPartners -- Other

E-Bay VerticaMicrosoftLSSTSLACWill hit up NSF and DOE

Page 12: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

The SciDB Data ModelThe SciDB Data Model

Nothing (e.g. Hadoop, Pig, Hive, …)?Most of you have schemas

Hadoop is not a good starting pointSlowNo HA

Page 13: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

The SciDB Data ModelThe SciDB Data Model

Tables?Makes a few of you happyUsed by Sloan Sky Survey

ButPanStarrs (Alex Szalay) wants arrays and

scalability

Page 14: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

The SciDB Data ModelThe SciDB Data Model

Arrays?Superset of tables (tables with a primary

key are a 1-D array)Makes HEP, remote sensing, astronomy,

oceanography folks happyBut

Not biology and chemistry (who wants

networks and sequences)

Page 15: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

The SciDB Data ModelThe SciDB Data Model

Multidimensional gridsSuperset of arrays (non-uniform cells)Makes solid modeling folks happy

ButComplex and slow

Page 16: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

SciDB Data ModelSciDB Data Model

Nested multidimensional arraysArray values are a tuple of values and arrays

Sightings (sid, details) [x, y, z, t]

Objects (type, [sid]) [id]

Page 17: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Basic ArraysBasic Arrays

Positive integer dimensions, no gapsBounded or unbounded

Page 18: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Enhanced ArraysEnhanced Arrays

“Shape” functionSupports irregular boundary

Page 19: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Enhanced ArraysEnhanced Arrays

Co-ordinate systemsUser defined functions that map integers to

something elseE.g. mercator

Use dimension notation to access, e.g.A[17,36] orA{468.2, 917.6}

Page 20: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

SciDB Query LanguageSciDB Query Language

“Parse-tree” representation of array operationsWith a “binding” to:

MatLab

C++

Python

IDL

There may be more….

User extendable operations (Postgres-style)

Page 21: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

OperationsOperations

Standard relational ones (filter, join)Plus whatever you want (regrid, interpolate,

fourier transform, eigenvalues, …)Plus add your own (Postgres-style)We need science input here!!!

Page 22: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Environment and StorageEnvironment and Storage

Extendable grid (cloud) of Linux machinesWith built-in high availability and failoverAnd built in disaster recovery

Page 23: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

In Situ ProcessingIn Situ Processing

Operate on data with loading itSupported by a SciDB self-describing file

formatAnd some number of adaptors, e.g. HDF-5,

NetCDF Or write your own

Page 24: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Storage ModelStorage Model

Arrays are “chunked” in storageChunk size can vary

Chunks are partitioned across the gridGo for scalability to petabytes

Page 25: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Other Features Other Features Which Science Guys WantWhich Science Guys Want

(These could be in RDBMS, but Aren’t)(These could be in RDBMS, but Aren’t)

Uncertainty

Data has error bars

Which must be carried along in the computation

(interval arithmetic)

Will look at more sophisticated error models later

Page 26: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Other FeaturesOther Features

Provenance (lineage)

What calibration generated the data

What was the “cooking” algorithm

In general – repeatability of data derivation

Supported by a command log

with query facilities (interesting research problem)

And redo

Page 27: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Other FeaturesOther Features

Time travel

Don’t fix errors by overwrite

I.e. keep all of the data

Supported by an extra array dimension (history)

Spatial supportNamed versions

Recalibration usually handled this way

Supported by allocating an array for the new

version and “diffing” against its parent

Page 28: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Other Features Other Features

(Optionally) integration of the real time data

capture system

“cooking” inside DBMS

Makes provenance capture easier

Sometimes important

Page 29: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Time LineTime Line

Q4/08 start company, begin research activities

Late 2009

Demoware available

Late 2010

V1 ships

Page 30: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Project Organization Project Organization (Build-it for real)(Build-it for real)

CEO (Andy Palmer -- Vertica)Project management (Bobbi Heath -- Vertica)CTO (Stonebraker)

Page 31: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Project Organization Project Organization (Design and Research)(Design and Research)

Overall co-ordination (Stonebraker, DeWitt)Storage and execution (Madden, Cetintemal)Query layer and semantics (Zdonik, Maier)Provenance (Widom, Patel)Resource management (Balazinska)Language bindings (Carey)

Page 32: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

SciDB Has a Good Chance at SuccessSciDB Has a Good Chance at Success

Community realizes shared infrastructure is

good“Lighthouse” customersStrong teamComputation goes inside the DBMS

Easier to shareAnd reuse

Page 33: SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

How Can You Help?How Can You Help?

Get involved!!!!