Upload
pivotalopensourcehub
View
472
Download
0
Embed Size (px)
Citation preview
1Pivotal Confidential–Internal Use Only
BUILT FOR THE SPEED OF BUSINESS
2Pivotal Confidential–Internal Use Only 2Pivotal Confidential–Internal Use Only
MADlib Architecture
3Pivotal Confidential–Internal Use Only
MPP (Massively Parallel Processing)
NetworkInterconnect
... ...
......MasterServers
Query planning & dispatch
SegmentServers
Query processing & data storage
SQLMapReduce
ExternalSourcesLoading,
streaming, etc.
Shared-Nothing Database Architecture
4Pivotal Confidential–Internal Use Only
Architecture
C API(HAWQ, GPDB, PostgreSQL)
Low-level Abstraction Layer(array operations,
C++ to DB type-bridge, …)
RDBMSBuilt-in
Functions
User Interface
Functions for Inner Loops(implements ML logic)
SQL, generated per specification
C++
Eigen
5Pivotal Confidential–Internal Use Only
How do we implement scalability? Example: Linear Regression
• Finding linear dependencies between variables
y ≈ c0 + c1 · x1 + c2 · x2 + …?
y | x1 | …-------+------------- 10.14 | 0 | … 11.93 | 0.69 | … 13.57 | 1.1 | … 14.17 | 1.39 | … 15.25 | 1.61 | … 16.15 | 1.79 | … Design
matrix XVector of dependent variables y
Predictor (x1)
Reg
ress
or (y
)
7Pivotal Confidential–Internal Use Only
Challenges in computing OLS solution
a bc de fg h
Segment 1
Segment 2
8Pivotal Confidential–Internal Use Only
Challenges in computing OLS solution
a bc de fg h
Segment 1
Segment 2
a c e gb d f h
Segm
ent 1
Segm
ent 2
9Pivotal Confidential–Internal Use Only
Challenges in computing OLS solution
a bc de fg h
a c e gb d f h
a2+c2+e2+g2
=Data across nodes are multiplied
10Pivotal Confidential–Internal Use Only
Challenges in computing OLS solution
a bc de fg h
a c e gb d f h
a2+c2+e2+g2
=Data across nodes are multiplied!
ab+cd+ef+gh
11Pivotal Confidential–Internal Use Only
Challenges in computing OLS solution
a bc de fg h
a c e gb d f h
a2+c2+e2+g2
=Looks like the result can be decomposed
ab+cd+ef+gh
b2+d2+f2+h2ab+cd+ef+gh
12Pivotal Confidential–Internal Use Only
Challenges in computing OLS solution
a bc de fg h
a c e gb d f h
a2+c2+e2+g2
=Data across nodes are multiplied!
ab+cd+ef+gh
b2+d2+f2+h2ab+cd+ef+gh
= +a b ef
e fab +c d g
hg hc
d +
13Pivotal Confidential–Internal Use Only
Linear Regression: Streaming AlgorithmHow to compute with a single table scan?
XT
XXT
y
-1
XTyXTX
+ +-1
14Pivotal Confidential–Internal Use Only
Problem solved? … Not Yet Many ML solutions are iterative without analytical
formulationsInitialize problem
Perform single step
Has converged?
Return results
false
true
15Pivotal Confidential–Internal Use Only
In general, use a convex optimization framework
Each step has an analytical formulation that can be performed in parallel
Gradient Descent
Start at a random pointRepeat
Determine a descent direction
Choose a step sizeUpdate the model
Until stopping criterion is satisfied
16Pivotal Confidential–Internal Use Only
Architecture
C API(HAWQ, GPDB, PostgreSQL)
Low-level Abstraction Layer(array operations,
C++ to DB type-bridge, …)
RDBMSBuilt-in
Functions
User Interface
Functions for Inner Loops(implements ML logic)
SQL, generated per specification
C++
Eigen
17Pivotal Confidential–Internal Use Only
Architecture
C API(Greenplum, PostgreSQL, HAWQ)
Low-level Abstraction Layer(array operations,
C++ to DB type-bridge, …)
RDBMSBuilt-in
Functions
User Interface
High-level Iteration Layer(iteration controller, …)
Functions for Inner Loops(implements ML logic)
Python
SQL, generated per specification
C++ Eigen
18Pivotal Confidential–Internal Use Only 18Pivotal Confidential–Internal Use Only
But not all data scientists speak SQL …Accessing scalability through R
19Pivotal Confidential–Internal Use Only
Why R?
O’Reilly: Strata 2013 Data Science Salary Survey
“The preponderance of R and Python usage is more surprising … two most commonly used individual tools, even above Excel. R and Python are likely popular because they are easily accessible and effective open source tools.”
20Pivotal Confidential–Internal Use Only
PivotalR: Bringing MADlib and HAWQ to a familiar R interface
ChallengeWant to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics
d <- db.data.frame(”houses")houses_linregr <-
madlib.lm(price ~ tax +
bath +
size,
data=d)
Pivotal R
SELECT madlib.linregr_train( 'houses’,'houses_linregr’,
'price’,'ARRAY[1, tax, bath, size]’);
SQL Code
21Pivotal Confidential–Internal Use Only
PivotalR Design Overview
2. SQL to execute
3. Computation results
1. R SQL
RPostgreSQL
PivotalR
Data lives hereNo data here
Database/HAWQ w/ MADlib
• Syntax is analogous to native R function
• Data doesn’t need to leave the database• All heavy lifting, including model estimation
& computation, are done in the database
22Pivotal Confidential–Internal Use Only 22Pivotal Confidential–Internal Use Only
Demo
23Pivotal Confidential–Internal Use Only
library(PivotalR)
db.connect(port = 14526, dbname = "madlib")
db.objects()
x <- db.data.frame("madlibtestdata.dt_abalone")
dim(x)
names(x)
x$rings
lookat(x, 10) # look at a sample of table
mean(x$rings)
lookat(mean(x$rings))
fit <- madlib.lm(rings ~ . - id | sex, data = y)
predict(fit, x)
mean((x$rings - predict(fit, x))^2)
x$sex <- as.factor(v$sex)
m0 <- madlib.glm(resp ~ age,
family="binomial", data=dbbank)
mstep <- step(m0, scope=list( lower=~age, upper=~age + factor(marital) + factor(education) + factor(housing) + factor(loan) + factor(job)))
Load the Library
Connect to the database “madlib” on port 14526
List all the tables in the active connection
Create an R object that references a table in the database
Report #/rows and #/columns in the table
Column names within the table
Database query object representing “select rings from madlibtestdata.dt_abalone”
Pull 10 rows of data from the table back into the R environment
query object representing “select avg(rings) from madlibtestdata.dt_abalone”
execute the query and report back the result
Run a linear regression within the database and return a model object
Create a query object representing scoring the model in the database
Query object calculating the mean square error of the model
Add a calculated factor column to the database query object
Calculate a logistic regression model
Perform stepwise feature selection
Demonstration
26Pivotal Confidential–Internal Use Only
Class hierarchy
db.obj
db.data.frame db.Rquery
db.table db.view
Wrapper of objects in databasex = db.data.frame("table")
Resides in R onlyx[,1:2], merge(x, y, by="column")
Operations/ MADlib
functions
lookat
as.db.data.frame
operation
27Pivotal Confidential–Internal Use Only
Some of current features
A wrapper of MADlib
• Generalized linear models
(lm, glm)
• Elastic Net (elnet)
• Cross validation (generic.cv)
• ARIMA
• Tree methods
(rpart, randomforest)
• Table summary
• $ [ [[ $<- [<- [[<-
• is.na
+ - * / %% %/% ^
• & | !
• == != > < >= <=
• merge
• by
• db.data.frame
• as.db.data.frame
• preview• sort
• c mean sum sd var min max length colMeans colSums
• db.connect db.disconnect db.list db.objects
db.existsObject delete• dim • names• as.factor()
• content
And more ... (SQL wrapper)
• predict
28Pivotal Confidential–Internal Use Only
We’re looking for contributors
• Browse our help pages– Start page: madlib.net– Github pages
• github.com/apache/incubator-madlib (SQL)• github.com/pivotalsoftware/pivotalr (R)• github.com/pivotalsoftware/pymadlib (Python)
• Use our product and report issues: • https://issues.apache.org/jira/browse/MADLIB (Issue tracker)• [email protected] (User forum)• [email protected] (Developer forum)
29Pivotal Confidential–Internal Use Only
Credits
Leaders and contributors:
Gavin SherryCaleb WeltonJoseph HellersteinChristopher RéZhe Wang
Florian Schoppmann
Hai QianShengwen YangXixuan Feng
and many others …
30Pivotal Confidential–Internal Use Only 30Pivotal Confidential–Internal Use Only
Thank you for your attention
Important links:
Product email: [email protected]
Product site: madlib.net
31Pivotal Confidential–Internal Use Only 31Pivotal Confidential–Internal Use Only
Backup slides
32Pivotal Confidential–Internal Use Only
Performing a linear regression on 10 million rows in seconds
Hellerstein et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.
33Pivotal Confidential–Internal Use Only
Reminder: Linear-Regression Model
• • If residuals i.i.d. Gaussians with standard deviation σ:
– max likelihood ⇔ min sum of squared residuals
• First-order conditions for the following quadratic objective (in c)
yield the minimizer
34Pivotal Confidential–Internal Use Only
Linear Regression: Streaming Algorithm
How to compute with a single table scan?
XT
XXT
y
-1
XTX XTy
35Pivotal Confidential–Internal Use Only
PivotalR Architecture
36Pivotal Confidential–Internal Use Only
37Pivotal Confidential–Internal Use Only 37Pivotal Confidential–Internal Use Only
PL/X Procedural Languages
38Pivotal Confidential–Internal Use Only
PivotalR vs PL/R
PivotalR• Interface is R client• Execution is in database• Parallelism handled by
PivotalR• Supports a portion of R
R> x = db.data.frame(“t1”)
R> l = madlib.lm(interlocks ~ assets + nation, data = t)
PL/R• Interface is SQL client• Execution is in R• Parallelism via SQL
function invocation• Supports all of R
psql> CREATE FUNCTION lregr() …
LANGUAGE PLR;
psql> SELECT lregr( array_agg(interlocks),
array_agg(assets),
array_agg(nation) )
FROM t1;
39Pivotal Confidential–Internal Use Only
Parallelized R in Pivotal via PL/R: An Example
SQL & R
R piggy-backs on Pivotal’s parallel architecture Minimize data movement Build predictive model for each state in parallel
TN Data
CA Data
NY Data
PA Data
TX Data
CT Data
NJ Data
IL Data
MA Data
WA Data
TN Model
CA Model
NY Model
PA Model
TX Model
CT Model
NJ Model
IL Model
MA Model
WA Model