Upload
go-pivotal
View
1.413
Download
0
Tags:
Embed Size (px)
DESCRIPTION
With the explosion of big data, the need for fast and inexpensive analytics solutions has become a key basis of competition in many industries. Extracting the value of big data with analytics can be complex, and requires advanced skills. At Pivotal, we are building open-source solutions (MADlib, PivotalR, PyMadlib) to simplify this process for the user, while maintaining the efficiency necessary for big data analysis. This talk will provide information about MADlib, an open source library of SQL-based algorithms for machine learning, data mining and statistics that run at large scale within a database engine, with no need for data import/export to other tools. It provides an overview of the library’s architecture and compares various statistical methods with those available in Apache Mahout. We also introduce, PivotalR, a R-based wrapper for MADlib that allows data scientists and programmers to access power of MADlib along with the ease of use of R.
Citation preview
1 Pivotal Confidential–Internal Use Only
BUILT FOR THE SPEED OF BUSINESS
2 Pivotal Confidential–Internal Use Only 2 Pivotal Confidential–Internal Use Only
Big Data Analytics MADlib and PivotalR: Scalable Machine Learning for Massively Parallel Databases
Rahul Iyer, Senior Software Developer, Predictive Analytics March, 4th 2014
Pivotal OSS Meetups
3 Pivotal Confidential–Internal Use Only
Agenda for the talk
• Introduce MADlib, a distributed machine learning library for SQL users
• How scalability is achieved by distributing the computation?
• Performance metrics + comparisons with Mahout
• A new R interface to access all of MADlib’s features
• How does it get big-data results with small-data efforts?
• Demo to showcase PivotalR
4 Pivotal Confidential–Internal Use Only
What is Big data?
• Volumes of data … • In various formats … • From multiple sources …
and Analytics?
• Generate insights … • for informed decision-making
6 Pivotal Confidential–Internal Use Only
Data ---! Information ---! Insights Traditional analytics pipeline
sample.csv&
Time;to;Insights&
Data&Prep& DB&Extract& DB&Import&spec.docx& scores.csv&
3&
7 Pivotal Confidential–Internal Use Only
Data ---! Information ---! Insights The MAD approach
Enterprise)Data)
RDBMS& RDBMS&RDBMS& RDBMS&
Time-to-Insights
Data&Prep& Model& Score&
Reduced&Data&Movement&
Billions&of&rows&in&minutes&
4&
8 Pivotal Confidential–Internal Use Only
What is MADlib?
MADlib project was initiated in 2011 by Greenplum architects and Joe Hellerstein from Univ. of California, Berkeley.
• MAD stands for:
• lib stands for SQL library of: • advanced (mathematical, statistical, machine learning) • parallel & scalable in-database functions
9 Pivotal Confidential–Internal Use Only
What is MADlib?
MADlib project was initiated in 2011 by Greenplum architects and Joe Hellerstein from Univ. of California, Berkeley.
• MAD stands for:
• lib stands for SQL library of: • advanced (mathematical, statistical, machine learning) • parallel & scalable in-database functions
UrbanDictionary.com: mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills. 2- dude, you got mad skills.
10 Pivotal Confidential–Internal Use Only
Which platforms does it run on?
HDFS
HAWQ Impala
GPDB PostgreSQL
(Partly ported)
11 Pivotal Confidential–Internal Use Only
MPP (Massively Parallel Processing)
Network Interconnect
... ...
... ... Master Servers
Query planning & dispatch
Segment Servers
Query processing & data storage
SQL MapReduce
External Sources Loading,
streaming, etc.
Shared-Nothing Database Architecture
12 Pivotal Confidential–Internal Use Only
Supervised Learning • Generalized Linear models
• Linear Regression • Logistic Regression • Multinomial logit …
• Decision Trees and Random Forest • Naive Bayes Classification • Support Vector Machines • Cox-Prop Hazards
and more …
Analytics Pipeline
Predictive Modeling Data Exploration
Summary function Sketch estimators Percentiles Correlation matrix
Data Prep
Aggregation Normalizing Pivoting Filtering
Text analytics • CRF • LDA
Unsupervised Learning • Association Rules • k-Means Clustering • Low-rank Matrix Factorization • PCA • SVD Matrix Factorization
Data mining
Sampling methods • Cross Validation
Scoring
Scoring • Linear Regression • Logistic Regression • Naïve Bayes …
Statistical metrics • Descriptive statistics • Goodness of fit • Inferential statistics • ROC
Model fitness
Support modules • Array operations • Sparse Vectors • Probability functions
13 Pivotal Confidential–Internal Use Only
Example usage
Train a model
Predict for new data
14 Pivotal Confidential–Internal Use Only
How do we implement scalability? Example: Linear Regression
• Finding linear dependencies between variables
y ≈ c0 + c1 · x1 + c2 · x2 ? y | x1 | x2 -------+------+----- 10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design
matrix X
Vector of dependent variables y
Predictor (x1)
Reg
ress
or (y
)
15 Pivotal Confidential–Internal Use Only
Challenges in computing OLS solution
16 Pivotal Confidential–Internal Use Only
Challenges in computing OLS solution
a b c d
a c b d
XT X
Segment 1
Segment 2
Segm
ent 1
Segm
ent 2
17 Pivotal Confidential–Internal Use Only
Challenges to compute OLS solution
a b c d
a c b d
XT X
a2 + c2 =
Data across nodes are multiplied!
18 Pivotal Confidential–Internal Use Only
Challenges to compute OLS solution
a b c d
a c b d
XT X
a2 + c2 ab + cd =
Data across nodes are multiplied!
19 Pivotal Confidential–Internal Use Only
Challenges to compute OLS solution
a b c d
a c b d
XT X
a2 + c2 ab + cd ba + dc b2 + d2
= Looks like the result can be decomposed
20 Pivotal Confidential–Internal Use Only
Challenges to compute OLS solution
a b c d
a c b d
XT X
a2 + c2 ab + cd ba + dc b2 + d2
= Let’s change perspective
= + a b a b c
d c d
22 Pivotal Confidential–Internal Use Only
Linear Regression: Streaming Algorithm
How to compute with a single table scan?
XT
X
XT
y
-1
XTX XTy
! "
! "# $ ! "# $
23 Pivotal Confidential–Internal Use Only
Linear Regression: Parallel Computation
XT
y
XT 1 y 1 XT
2 y 2 Segment 1 Segment 2 XTy Master
24 Pivotal Confidential–Internal Use Only
Basic Building Block: User-defined aggregate Basic&Building&Block:&
User;Defined&Aggregates&
AggregaOon&phase&1&on&each&node:&
1. IniOalize:&
2. TransiOon&for&all&rows:&
&
3. Send&(A,b)&&
x# y)(1,0,3,…,5)& 3&
(;2,4,5,…,2)& 2&
…& …&
(A,b) = (0,0)
(A,b) = (A,b)+ (x ⋅ xT ,x ⋅ y) map&
reduce&
(A,b)&…&
AggregaOon&phase&2&on&master&node:&
1. Merge:&&
2. Finalize:& β̂ = solve(A,b) = A−1 ⋅b(A,b) = (A,b)+ (A,b)
13&
25 Pivotal Confidential–Internal Use Only
Problem solved? … Not Yet
" Many ML solutions are iterative without analytical formulations
Initialize problem
Perform optimization step
Has converged?
Return results
false
true
26 Pivotal Confidential–Internal Use Only
Use a convex optimization framework
# segments # variables # rows v0.3 v0.2.1beta v0.1alpha(million) (s) (s) (s)
6 10 10 4.447 9.501 1.3376 20 10 4.688 11.60 1.8746 40 10 6.843 17.96 3.8286 80 10 13.28 52.94 12.986 160 10 35.66 181.4 51.206 320 10 186.2 683.8 333.4
12 10 10 2.115 4.756 0.960012 20 10 2.432 5.760 1.21212 40 10 3.420 9.010 2.04612 80 10 6.797 26.48 6.46912 160 10 17.71 90.95 25.6712 320 10 92.41 341.5 166.6
18 10 10 1.418 3.206 0.619718 20 10 1.648 3.805 1.00318 40 10 2.335 5.994 1.18318 80 10 4.461 17.73 4.31418 160 10 11.90 60.58 17.1418 320 10 61.66 227.7 111.4
24 10 10 1.197 2.383 0.390424 20 10 1.276 2.869 0.476924 40 10 1.698 4.475 1.15124 80 10 3.363 13.35 3.26324 160 10 8.840 45.48 13.1024 320 10 46.18 171.7 84.59
Figure 4: Linear-regression execution times
search. In our prototype implementation in MADlib, we picked upone such simple greedy algorithm, called stochastic (or sometimes,“incremental”) gradient descent (SGD) [33, 6], that goes back to the1960s. SGD is an approximation of gradient methods that is usefulwhen the convex function we are considering, f (x), has the form:
f (x) =NX
i=1
fi(x)
If each of the fi is convex, then so is f [8, pg. 38]. Notice thatall problems in Table 2 are of this form: intuitively each of thesemodels is finding some model (i.e., a vector w) that is scored onmany di↵erent training examples. SGD leverages the above form toconstruct a rough estimate of the gradient of f using the gradient of asingle term: for example, the estimate if we select i is the gradient offi (that we denote Gi(x)). The resulting algorithm is then describedas:
x x � ↵N ·Gi(x) (1)
This approximation is guaranteed to converge to an optimal solu-tion [26].
Using the MADlib framework. In our setting, each tuple inthe input table for an analysis task encodes a single fi. We use themicro-programming interfaces of Sections 3.2 and 3.3 to performthe mapping from the tuples to the vector representation that is usedin Eq. 1. Then, we observe Eq. 1 is simply an expression over eachtuple (to compute Gi(x)) which is then averaged together. Instead ofaveraging a single number, we average a vector of numbers. Here,we use the macro-programming provided by MADlib to handleall data access, spills to disk, parallelized scans, etc. Finally, the
Figure 6: The Archetypical Convex Function f (x) = x2.
Application ObjectiveLeast Squares
P(u,y)2⌦(xT u � y)2
Lasso [38]P
(u,y)2⌦(xT u � y)2 + µkxk1Logisitic Regression
P(u,y)2⌦ log(1 + exp(�yxtu))
Classification (SVM)P
(u,y)2⌦(1 � yxT u)+Recommendation
P(i, j)2⌦(LT
i R j � Mi j)2 + µkL,Rk2FLabeling (CRF) [40]
Pk
hPj x jF j(yk , zk) � log Z(zk)
i
Table 2: Models currently Implemented in MADlib using theSGD-based approach.
macro programming layer helps us test for convergence (which isimplemented with either a python combination or C driver.) Usingthis approach, we were able to add in implementations of all themodels in Table 2 in a matter of days.
In an upcoming paper we report initial experiments showing thatour SGD based approach achieves higher performance than priordata mining tools for some datasets [13].
- Each step has an analytical formulation that can be performed in parallel
1.&Lack&of&portable&mulO;pass&iteraOons&
• WITH RECURSIVE¬&reliable&basis&for&portability&
• User;defined&driver&funcOons&in&Python&– Outer&loops¬&performance;criOcal&
• Compromise:&Different&user&interface&
CREATE TEMP TABLE temp !
INSERT INTO temp SELECT step(...) FROM ... !
SELECT converged(...) FROM temp, ... !
SELECT result(...) !FROM temp!
false&
true&
16&
27 Pivotal Confidential–Internal Use Only
Architecture
RDBMS Query Processing (Greenplum, PostgreSQL, Hadoop with SQL)
Low-level Abstraction Layer (matrix operations,
C++ to DB type bridge, …)
RDBMS Built-in
Functions
User Interface
High-level Abstraction Layer (iteration controller, ...)
Functions for Inner Loops (implements convex optimization)
Python
SQL, generated per specification
C++
3.&Lack&of&language&support&for&linear&algebra&
• C++&AbstracOon&Layer&uses&Eigen&• (Dense)&Vectors&and&matrices:&DOUBLE PRECISION[]!
• Example:&AnyType!solve::run(AnyType& args) { ! MappedMatrix A = args[0].getAs<MappedMatrix>(); ! MappedColumnVector b = args[1].getAs<MappedColumnVector>(); ! ! MutableMappedColumnVector x = allocateArray<double>(A.cols()); ! x = A.colPivHouseholderQr().solve(b); ! return x; !} ! Performance:&
• No&unnecessary©ing&• No&internal&type&conversion&
18&
The&MADlib&Vision&
• Academic&and&industry&contribuOons&• Think&of&“CRAN&for&databases”&– Repository&of&open;source&ML&algorithms&– This&Ome&with&data¶llelism&in&mind&
• Open;Source&Framework&
Eigen&BSD&License&10&
28 Pivotal Confidential–Internal Use Only
Performance trends Performance&Trends&
• Disk&I/O&is¬&always&the&boLleneck&• Performance&tuning&is&
essenOal&
• Overhead&for&single&query&very&low&(fracOon&of&a&second)&
• Greenplum&achieves&nearly&perfect&speedup&
0&
5&
10&
15&
20&
25&
30&
35&
40&
6& 12& 18& 24&
20& 40& 80& 160&
OLS&on&10&million&rows&(in&seconds)&
#&segments&
#&variables:&
22&
• Overhead for a single row is very low (fraction of a second)
• Able to achieve close to linear speedup
29 Pivotal Confidential–Internal Use Only
Performance Comparison with Apache Mahout
" Analytics WorkBench (http://www.gopivotal.com/big-data/analytics-workbench) – 1000-node cluster located in Las Vegas – Over 24,000 processors, 48 TB of Memory, and 24 PB of
raw disk storage – 8000+ Map Task Capacity, 5000+ Reduce Task Capacity – Infrastructure: Pivotal HD 1.1
" Mahout v0.7 " Test matrix*
– Data size ▪ KDD Cup 2009 Orange marketing churn data (16.5 GB) ▪ Enron data (1.9 GB) ▪ Census data 2000 (1.7 GB)
– Algorithms: Logistic Regression and K-means – Algorithm parameters (e.g. convergence threshold, # iterations)
Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)
* Reporting a subset of results from whitepaper.
30 Pivotal Confidential–Internal Use Only
Logistic Regression
0
100
200
300
400
500
600
700
1000000 10000000 100000000 1E+09
Tim
e in
Min
utes
log(Number of Rows)
MADlib & Mahout Logistic Regression Scalability Across Number of Attributes
Census data, 48 attributes [Mahout]
Census data, 48 attributes [MADlib]
Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)
31 Pivotal Confidential–Internal Use Only
Logistic Regression
0
1
2
3
4
5
6
7
8
9
1000000 10000000 100000000 1E+09
Tim
e in
Min
utes
log(Number of Rows)
Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)
32 Pivotal Confidential–Internal Use Only
K-Means
0
50
100
150
200
250
300
350
1000000 10000000 100000000 1E+09
Tim
e in
Min
log(Number of Rows)
MADlib & Mahout K-means Scalability Across Number of Rows
Census data, 48 attributes [Mahout]
Census data, 48 attributes [MADlib]
Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)
33 Pivotal Confidential–Internal Use Only
Random Forest
0
200
400
600
800
1000
1200
1400
1600
1000000 10000000 100000000 1E+09
Tim
e in
Min
log(Number of Rows)
Census data, 46 attributes [Mahout]
Census data, 46 attributes [MADlib]
Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)
35 Pivotal Confidential–Internal Use Only
Part 1 Summary
MADlib is a easy-to-use library that provides a SQL interface to fast, scalable machine learning algorithms …
36 Pivotal Confidential–Internal Use Only 36 Pivotal Confidential–Internal Use Only
But not all Data Scientists speak SQL … Accessing Scalability through R
37 Pivotal Confidential–Internal Use Only
Why R?
O’Reilly: 2013 Data Science Salary Survey
From the report: “The preponderance of R and Python usage is more surprising … two most commonly used individual tools, even above Excel. R and Python are likely popular because they are easily accessible and effective open source tools.”
38 Pivotal Confidential–Internal Use Only
Execution in Database
• All data stays in DB: R objects merely point to DB objects
• All model estimation and heavy lifting done in DB by MADlib
• R → SQL translation done in the R client
• Only strings of SQL and model output transferred across DBI
SQL to execute MADlib
Model output
36 © Copyright 2014 Pivotal. All rights reserved.
PivotalR Design Overview
SQL to execute
Computation results
RPostgreSQL
Data lives here
R " SQL
PivotalR
No data here
Database w/ MADlib
• Call MADlib’s in-DB machine learning functions directly from R
• Syntax is analogous to native R function
• Data doesn’t need to leave the database • All heavy lifting, including model estimation
& computation, are done in the database
Woo Jung
http://gopivotal.github.io/PivotalR/ 36 © Copyright 2014 Pivotal. All rights reserved.
PivotalR Design Overview
SQL to execute
Computation results
RPostgreSQL
Data lives here
R " SQL
PivotalR
No data here
Database w/ MADlib
• Call MADlib’s in-DB machine learning functions directly from R
• Syntax is analogous to native R function
• Data doesn’t need to leave the database • All heavy lifting, including model estimation
& computation, are done in the database
Woo Jung
http://gopivotal.github.io/PivotalR/
Courtesy Woo Jung and Hai Qian
40 Pivotal Confidential–Internal Use Only
Some of current features
A wrapper of MADlib
• Linear regression
• Logistic regression
• Elastic Net
• ARIMA
• Table summary
• Categorial variable
as.factor()
• $ [ [[ $<- [<- [[<-
• is.na
+ - * / %% %/% ^
• & | !
• == != > < >= <=
• merge
• by
• db.data.frame
• as.db.data.frame
• preview • sort
• c mean sum sd var min max length colMeans colSums
• db.connect db.disconnect db.list db.objects
db.existsObject delete • dim • names
• content
And more ... (SQL wrapper)
• predict
43 Pivotal Confidential–Internal Use Only
library(PivotalR)
db.connect(port = 14526, dbname = "madlib")
db.objects()
x <- db.data.frame("madlibtestdata.dt_abalone")
dim(x)
names(x)
x$rings
lookat(x, 10) # look at a sample of table
mean(x$rings)
lookat(mean(x$rings))
fit <- madlib.lm(rings ~ . - id | sex, data = y)
predict(fit, x)
mean((x$rings - predict(fit, x))^2)
x$sex <- as.factor(v$sex)
m0 <- madlib.glm(resp ~ age,
family="binomial", data=dbbank)
mstep <- step(m0, scope=list(lower=~age, upper=~age + factor(marital) + factor(education) + factor(housing) + factor(loan) + factor(job)))
Load the Library
Connect to the database “madlib” on port 14526
List all the tables in the active connection
Create an R object that references a table in the database
Report #/rows and #/columns in the table
Column names within the table
Database query object representing “select rings from madlibtestdata.dt_abalone”
Pull 10 rows of data from the table back into the R environment
query object representing “select avg(rings) from madlibtestdata.dt_abalone”
execute the query and report back the result
Run a linear regression within the database and return a model object
Create a query object representing scoring the model in the database
Query object calculating the mean square error of the model
Add a calculated factor column to the database query object
Calculate a logistic regression model
Perform stepwise feature selection
Demonstration
44 Pivotal Confidential–Internal Use Only
We’re looking for contributors
• Browse our help pages – Start page: madlib.net – Github pages
• github.com/madlib/madlib (SQL) • github.com/gopivotal/pivotalr (R) • github.com/gopivotal/pymadlib (Python)
– Use our product and report issues: • jira.madlib.net (Issue tracker) • [email protected] (User forum)
• Can use PostgreSQL or Greenplum Database Community Edition for installations on multiple platforms
45 Pivotal Confidential–Internal Use Only
Credits
Leaders and contributors:
Gavin Sherry Caleb Welton Joseph Hellerstein Christopher Ré Zhe Wang Florian Schoppmann
Hai Qian Shengwen Yang Aaron Feng and many others …
The&MADlib&Vision&
• Academic&and&industry&contribuOons&• Think&of&“CRAN&for&databases”&– Repository&of&open;source&ML&algorithms&– This&Ome&with&data¶llelism&in&mind&
• Open;Source&Framework&
Eigen&BSD&License&10&
The&MADlib&Vision&
• Academic&and&industry&contribuOons&• Think&of&“CRAN&for&databases”&– Repository&of&open;source&ML&algorithms&– This&Ome&with&data¶llelism&in&mind&
• Open;Source&Framework&
Eigen&BSD&License&10&
46 Pivotal Confidential–Internal Use Only 46 Pivotal Confidential–Internal Use Only
Thank you for your attention
Important links:
Product email: [email protected]
Product site: madlib.net
Speaker email: [email protected]