In-Database Predictive Analytics

In-DatabasePredictive Analytics

John A. De Goes@jdegoes, [email protected]

mailto:[email protected]


• Introduction

• Abusing SQL

• Painful by Design

• Database Extensions

• MADlib

• Other Approaches

• Summary

Agenda

Introduction

In-Database Predictive Analytics

In-database predictive analytics refers to the the process of performing advanced predictive analytics directly inside the database.

Traditional Predictive Analytics

Introduction

database

R

SAS

Data Bottleneck:Painful, Slow

Introduction

database

R

SAS

What’s the answer?

Introduction

“MapReduce”

Move the Code, not the Data!

AdvancedAnalytics

Introduction

Let’s Do K-Means in SQL!

Abusing SQL

General Approach in RDBMS

SQL

Feedback

DatabaseDriver

Abusing SQL

Our Initial Model

model

d k n iteration avg_q

number of dimensions

number of clusters

number of points

number of iterations

variance

Abusing SQL

Our Initial Data Set

Y

Y1 Y2 Y3 Y3

n rows

Abusing SQL

Projection & Numbering

Y

Y1 Y2 Y3 ...

YH

i Y1 ... Yd

INSERT INTO YHSELECT sum(1) over(rows unbounded preceding) AS i,Y1, Y2, ..., YdFROM Y;

1

2

3

4

...

...

n

1

2

3

4

...

...

n

Abusing SQL

Flattening

YH

i Y1 ... Yd

INSERT INTO YV SELECT i,1,Y1 FROM YH;...INSERT INTO YV SELECT i,d,Yd FROM YH;

1

2

3

4

...

...

n

1

1

1

1

2

...

n

YV

i l val

1

2

...

d

1

...

d

n x d rows

1

1

...

1

2

...

n

Abusing SQL

Initializing k Cluster Centers

YH

i Y1 ... Yd

CH

j Y1 ... Yd

1

2

3

4

...

...

n

INSERT INTO CHSELECT 1,Y1, ..., Yd FROM YH SAMPLE 1;...INSERT INTO CHSELECT k,Y1, ..., Yd FROM YH SAMPLE 1;

1

2

3

4

...

...

k

Abusing SQL

CH

j Y1 ... Yd

1

2

3

4

...

...

k

Flattening

C

l j val

d x k rows

1

1

...

1

2

...

d

1

2

...

k

1

...

k

INSERT INTO CSELECT 1, 1, Y1 FROM CH WHERE j = 1;...INSERT INTO CSELECT d, k, Yd FROM CH WHERE j = k;

Abusing SQL

Computing Distances to Clusters

INSERT INTO YDSELECT i, j, sum((YV.val - C.val)**2)FROM YV, C WHERE YV.l = C.l GROUP BY i, j;

YD

i j dist

1

2

...

k

1

...

k

n x k rows

1

1

...

1

2

...

n

Abusing SQL

Computing Nearest Neighbors

INSERT INTO YNNSELECT YD.i,Y D.jFROM YD, (SELECT i, min(dist) AS mindist FROM YD GROUP BY i) YMINDWHERE Y D.i = YMIND.i and Y D.distance = YMIND.mindist;

nearest clusters

YNN

i j

n rows

1

2

3

4

5

...

n

Abusing SQL

Count Points Per Cluster

INSERT INTO W SELECT j, count(*)FROM YNN GROUP BY j;UPDATE W SET w = w/model.n;

Abusing SQL

Compute New Centroids

INSERT INTO CSELECT l, j, avg(YV.val) FROM YV, YNNWHERE YV.i = YNN.i GROUP BY l, j;

Abusing SQL

Compute Variances

INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;

Abusing SQL

Update Model

INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;

Abusing SQL

Let’s not do that again!

Abusing SQL

Why are predictive analytics so hard to express in SQL?

Painful by Design

#1: No Arrays

Setsrows

Tuplescolumns

Arrays

Painful by Design

#2: Relational Algebra Sucks

Projection Selection Rename Natural Join

R S

Theta JoinSemijoin

R S R S

Antijoin

÷R S

Division

⟕R S

Left outer join

R S

Right outer join

⟖ ⟗R S

Full outer join

G1, G2, ..., Gm g f1(A1'), f2(A2'), ..., fk(Ak') (r)

Aggregation

Painful by Design

Iteration Recursion Multiple Dimensions

There’s GOT to be a better way!

Database Extensions

C Extension

Database Extensions

UDFUser-Defined Function

UDAUser-Defined Aggregate

Map Reducemap(a)

op2(a,b)init(a)

accum(a, b)merge(a, b)final(a)

Database Extensions

MADlib is an open-source library for scalable in-database analytics.It is implemented using database extensions written in C, and is available for PostgreSQL and Greenplum.

MADlib

Mac OS X

http://www.madlib.net/files/madlib-0.6-Darwin.dmg

Linux

http://www.madlib.net/files/madlib-0.6-Linux.rpm

1. Download the binaryMADlib





Mac OS X

Double-click on installer

Linux

yum install $MADLIB_PACKAGE --nogpgcheck

2. Start the InstallationMADlib

Greenplum

source /path/to/greenplum/greenplum_path.sh

PostgreSQL

Make sure psql is in PATH

3. Verify LocatabilityMADlib

Greenplum

/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install

PostgreSQL

/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install

4. Register MADlibMADlib

Greenplum

/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install-check

PostgreSQL

/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install-check

5. Test InstallationMADlib

SELECT * FROM kmeans_random( 'rel_source', 'expr_point', k, [ 'fn_dist', 'agg_centroid', max_num_iterations, min_frac_reassigned ]);

Clustering in MADlibMADlib

Ahhhhhh......

MADlib

Our Way or the Highway

Composability

MADlib

RDBMS Isn’t the Only Game in Town!

Other Approaches

1. Embrace Coding

• Hadoop Ecosystem• Mahout, Cascading/Scalding, Crunch/Scrunch, Pangool, Cascalog, and,

of course, MapReduce

• BDAS Ecosystem• Spark

Other Approaches

2. Reject RDBMS

• Datalog + variants• In theory, ideal for many kinds of predictive analytics

• Suffers from a lack of distributed, feature-complete implementations

Other Approaches

2. Reject RDBMS

• Rasdaman / RASQL• Arrays but not analytics

Community Editionshttp://www.rasdaman.org

Other Approaches

http://www.scidb.org/forum/viewtopic.php?f=16&t=364/


2. Reject RDBMS

• MonetDB / SciQL• Array extension of SQL

• Poor analytics

Community Editionshttp://www.monetdb.org

Other Approaches



2. Reject RDBMS

• SciDB / AFL (AQL)• Excellent analytics

• Limited composability

Community Editionshttp://www.scidb.org/forum/viewtopic.php?f=16&t=364/

Other Approaches



2. Reject RDBMS

• Precog / Quirrel (simple “R for big data”)• Multidimensional, arrays + functions

• Still immature

Community Editionshttp://www.precog.com/editions/precog-for-mongodb (MongoDB)

http://www.precog.com/editions/precog-for-postgresql (PostgreSQL)

Other Approaches

http://www.precog.com/editions/precog-for-mongodb

http://www.precog.com/editions/precog-for-mongodb

http://www.precog.com/editions/precog-for-postgresql

http://www.precog.com/editions/precog-for-postgresql

Summary

• Increase performance, reduce friction by doing more inside the database

• Not a panacea• Hard to do in SQL

• Hard to do in C (but you may not have to: MADlib)

• Pre-canned & brittle in most databases

• Ultimately what’s needed is tech designed for advanced analytics

Q&AJohn A. De Goes

@jdegoes, [email protected]



References

• Programming the K-means Clustering Algorithm in SQL (Teradata, NCR)

Technology

In-Database Predictive Analytics