Upload
john-de-goes
View
2.231
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Predictive analytics have long lived in the domain of statistical tools like R. Increasingly, however, as companies struggle to deal with exploding volumes of data not easily analyzed by small data tools, they are looking at ways of doing predictive analytics directly inside the primary data store. This approach, called in-database predictive analytics, eliminates the need to sample data and perform a separate ETL process into a statistical tool, which can decrease total cost, improve the quality of predictive models, and dramatically shorten development time. In this class, you will learn the pros and cons of doing in-database predictive analytics, highlights of its limitations, and survey the tools and technologies necessary to head down the path.
Citation preview
In-DatabasePredictive Analytics
John A. De Goes@jdegoes, [email protected]
• Introduction
• Abusing SQL
• Painful by Design
• Database Extensions
• MADlib
• Other Approaches
• Summary
Agenda
Introduction
In-Database Predictive Analytics
In-database predictive analytics refers to the the process of performing advanced predictive analytics directly inside the database.
Traditional Predictive Analytics
Introduction
database
R
SAS
Data Bottleneck:Painful, Slow
Introduction
database
R
SAS
What’s the answer?
Introduction
“MapReduce”
Move the Code, not the Data!
AdvancedAnalytics
Introduction
Let’s Do K-Means in SQL!
Abusing SQL
General Approach in RDBMS
SQL
Feedback
DatabaseDriver
Abusing SQL
Our Initial Model
model
d k n iteration avg_q
number of dimensions
number of clusters
number of points
number of iterations
variance
Abusing SQL
Our Initial Data Set
Y
Y1 Y2 Y3 Y3
n rows
Abusing SQL
Projection & Numbering
Y
Y1 Y2 Y3 ...
YH
i Y1 ... Yd
INSERT INTO YHSELECT sum(1) over(rows unbounded preceding) AS i,Y1, Y2, ..., YdFROM Y;
1
2
3
4
...
...
n
1
2
3
4
...
...
n
Abusing SQL
Flattening
YH
i Y1 ... Yd
INSERT INTO YV SELECT i,1,Y1 FROM YH;...INSERT INTO YV SELECT i,d,Yd FROM YH;
1
2
3
4
...
...
n
1
1
1
1
2
...
n
YV
i l val
1
2
...
d
1
...
d
n x d rows
1
1
...
1
2
...
n
Abusing SQL
Initializing k Cluster Centers
YH
i Y1 ... Yd
CH
j Y1 ... Yd
1
2
3
4
...
...
n
INSERT INTO CHSELECT 1,Y1, ..., Yd FROM YH SAMPLE 1;...INSERT INTO CHSELECT k,Y1, ..., Yd FROM YH SAMPLE 1;
1
2
3
4
...
...
k
Abusing SQL
CH
j Y1 ... Yd
1
2
3
4
...
...
k
Flattening
C
l j val
d x k rows
1
1
...
1
2
...
d
1
2
...
k
1
...
k
INSERT INTO CSELECT 1, 1, Y1 FROM CH WHERE j = 1;...INSERT INTO CSELECT d, k, Yd FROM CH WHERE j = k;
Abusing SQL
Computing Distances to Clusters
INSERT INTO YDSELECT i, j, sum((YV.val - C.val)**2)FROM YV, C WHERE YV.l = C.l GROUP BY i, j;
YD
i j dist
1
2
...
k
1
...
k
n x k rows
1
1
...
1
2
...
n
Abusing SQL
Computing Nearest Neighbors
INSERT INTO YNNSELECT YD.i,Y D.jFROM YD, (SELECT i, min(dist) AS mindist FROM YD GROUP BY i) YMINDWHERE Y D.i = YMIND.i and Y D.distance = YMIND.mindist;
nearest clusters
YNN
i j
n rows
1
2
3
4
5
...
n
Abusing SQL
Count Points Per Cluster
INSERT INTO W SELECT j, count(*)FROM YNN GROUP BY j;UPDATE W SET w = w/model.n;
Abusing SQL
Compute New Centroids
INSERT INTO CSELECT l, j, avg(YV.val) FROM YV, YNNWHERE YV.i = YNN.i GROUP BY l, j;
Abusing SQL
Compute Variances
INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;
Abusing SQL
Update Model
INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;
Abusing SQL
Let’s not do that again!
Abusing SQL
Why are predictive analytics so hard to express in SQL?
Painful by Design
#1: No Arrays
Setsrows
Tuplescolumns
Arrays
Painful by Design
#2: Relational Algebra Sucks
Projection Selection Rename Natural Join
R S
Theta JoinSemijoin
R S R S
Antijoin
÷R S
Division
⟕R S
Left outer join
R S
Right outer join
⟖ ⟗R S
Full outer join
G1, G2, ..., Gm g f1(A1'), f2(A2'), ..., fk(Ak') (r)
Aggregation
Painful by Design
Iteration Recursion Multiple Dimensions
There’s GOT to be a better way!
Database Extensions
C Extension
Database Extensions
UDFUser-Defined Function
UDAUser-Defined Aggregate
Map Reducemap(a)
op2(a,b)init(a)
accum(a, b)merge(a, b)final(a)
Database Extensions
MADlib is an open-source library for scalable in-database analytics.It is implemented using database extensions written in C, and is available for PostgreSQL and Greenplum.
MADlib
Mac OS X
http://www.madlib.net/files/madlib-0.6-Darwin.dmg
Linux
http://www.madlib.net/files/madlib-0.6-Linux.rpm
1. Download the binaryMADlib
Mac OS X
Double-click on installer
Linux
yum install $MADLIB_PACKAGE --nogpgcheck
2. Start the InstallationMADlib
Greenplum
source /path/to/greenplum/greenplum_path.sh
PostgreSQL
Make sure psql is in PATH
3. Verify LocatabilityMADlib
Greenplum
/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install
PostgreSQL
/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install
4. Register MADlibMADlib
Greenplum
/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install-check
PostgreSQL
/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install-check
5. Test InstallationMADlib
SELECT * FROM kmeans_random( 'rel_source', 'expr_point', k, [ 'fn_dist', 'agg_centroid', max_num_iterations, min_frac_reassigned ]);
Clustering in MADlibMADlib
Ahhhhhh......
MADlib
Our Way or the Highway
Composability
MADlib
RDBMS Isn’t the Only Game in Town!
Other Approaches
1. Embrace Coding
• Hadoop Ecosystem• Mahout, Cascading/Scalding, Crunch/Scrunch, Pangool, Cascalog, and,
of course, MapReduce
• BDAS Ecosystem• Spark
Other Approaches
2. Reject RDBMS
• Datalog + variants• In theory, ideal for many kinds of predictive analytics
• Suffers from a lack of distributed, feature-complete implementations
Other Approaches
2. Reject RDBMS
• Rasdaman / RASQL• Arrays but not analytics
Community Editionshttp://www.rasdaman.org
Other Approaches
2. Reject RDBMS
• MonetDB / SciQL• Array extension of SQL
• Poor analytics
Community Editionshttp://www.monetdb.org
Other Approaches
2. Reject RDBMS
• SciDB / AFL (AQL)• Excellent analytics
• Limited composability
Community Editionshttp://www.scidb.org/forum/viewtopic.php?f=16&t=364/
Other Approaches
2. Reject RDBMS
• Precog / Quirrel (simple “R for big data”)• Multidimensional, arrays + functions
• Still immature
Community Editionshttp://www.precog.com/editions/precog-for-mongodb (MongoDB)
http://www.precog.com/editions/precog-for-postgresql (PostgreSQL)
Other Approaches
Summary
• Increase performance, reduce friction by doing more inside the database
• Not a panacea• Hard to do in SQL
• Hard to do in C (but you may not have to: MADlib)
• Pre-canned & brittle in most databases
• Ultimately what’s needed is tech designed for advanced analytics
References
• Programming the K-means Clustering Algorithm in SQL (Teradata, NCR)