Rise of the scientific database

Preview:

DESCRIPTION

Slides from the talk, "Rise of the Scientific Database" at Strata 2012 (Santa Clara).

Citation preview

Rise of the Scientific Database

John A. De Goes, @jdegoes

Agenda

• Scientific Computing & Databases

• Blessing / Curse of the RDBMS

• Power of the Array

• Scientific Databases

• Hadoop

• Summary & Conclusions

What is Scientific Computing?

"Scientific computing is concerned with constructing mathematical models and quantitative analysis techniques and using computers to analyze and solve scientific problems."

—Wikipedia

1940's

1950's

1960's

1970's

1980's

1990's

2000's

2010's

The Future

Finite element methods

Numeric linear algebra

Linear programming

Monte carlo

Finite differences

Fortran

Modern numerical linear algebra

Gradient methods

Finite difference for PDEs

Stable SVD algorithms

Iterative methods

Stable pseudoinverses

FFT

APL invented

SAS released

LINPACK

MATLAB

Conjugate gradient

Poisson solvers

Large-scale eigenvalue solvers

GNU Octave

Python

SPSS

J

LAPACK

Mathematica

SciLab

SciPy

PDL

Rasdaman

NumPy

Hadoop

Mahout

HPCC

CUDA

OpenCL

BrookGPU

Julia

Spark

MLBase

SciDB

MonetDB / SciQL

???

What is a Database?

"A technology that combines the ability to store data with a high-level, high-performance means of storing, retrieving, and manipulating that data without having to write code or have knowledge of the mechanisms of implementation."

1960's

1970's

1980's

1990's

2000's

2010's

The Future

CODASYL

IMS

SABRE

Relational Model

Ingres (QUEL)

System R (SEQUEL)

SQL/DBS

DBS2

Oracle

"RDBMS"

SQL wins

DB2

DBase

SQL Server

Other solutions

ODBMS

MySQL

PostgreSQL

MongoDB

CouchDB

Riak

Neo4j

Julia

Spark

MLBase

SciDB

MonetDB / SciQL

???

The Relationship between Scientific Computing & Databases

ScientificComputing

Data Analysis

Scientific Databases

The Database Landscape

Operational Analytical

Structured

Unstructured

Scientific

2005

1980

2000

1970 ?

?

?

?2000

Semi-structured

sums & countsgets & puts data analysis

Relational Algebra

Projection Selection Rename Natural Join

R S

Theta JoinSemijoin

R S R S

Antijoin

÷R S

Division

⟕R S

Left outer join

R S

Right outer join

⟖ ⟗R S

Full outer join

G1, G2, ..., Gm g f1(A1'), f2(A2'), ..., fk(Ak') (r)

Aggregation

The Curse of RDBMS

Setsrows

Tuplescolumns

???

The Curse of RDBMS

Setsrows

Tuplescolumns

Arrays

The Power of the Array

• Linear Algebra

• Transforms (Fourier, wavelet, etc.)

• Spatial Analysis

• Temporal Analysis

• Etc.

Poor Man’s Arrays

SELECT X.row AS row, Y.col AS col,

SUM(X.value * Y.value) AS value,

FROM X, Y where X.col = X.row

GROUP BY X.row, Y.col

Poor Man’s Arrays

SELECT A.name, A.sales, SUM(B.sales) AS

running_total

FROM Sales AS A, Sales AS B

WHERE A.sales < B.sales or

(A.sales = B.sales and

A.name = B.name)

GROUP BY A.name, A.sales

Poor Man’s Arrays

What is a Scientific Database?

• First-class support for multidimensional arrays

• Creation

• Manipulation

• Composition

• Capable of expressing whole analyses, not just snippets

• Tremendous benefits across multiple dimensions

• Scalability & Performance

• Expressiveness & Usability

• Robustness & Accuracy

Array Algebra

• Many different approaches (NRCA, SciQL, AFL, ODMG, etc.)

• Possible to define as extensions to relational core (but not necessary)

• Most approaches share common core

• Array deconstruction

• Array construction

• Array reduction

Scientific Databases

Rasdaman SciDB MonetDB (+SciQL)

What About Hadoop?

• Commonly used in scientific computing

• No scientific database technology

• But many useful programming libraries

• Hama

• Mahout

• Cascading

• Hadoop doesn’t make it easy

• YARN should help (Tez?)

• Balancing needs help

• Not the only game in town anymore (BDAS, MPI-2, HPCC, etc.)

Conclusions

• Scientific computing can benefit from a scientific database

• Success of RDBMS was also a curse

• NoSQL, big data, catalysts for disruption

• Still early for scientific databases

• Hadoop loves/hates science

Resources

John A. De Goes, @jdegoes

SciDB / Array Functional Languagehttp://bit.ly/VdXJkA

Rasdaman / rasqlhttp://en.wikipedia.org/wiki/Rasdaman

MonetDB / SciQLhttp://monetdb.org

Precog / Quirrelhttp://precog.com

Query Language for Multidimensional Arrays: Design, Implementation, & Optimization Techniques

Recommended