Pivotal OSS meetup - MADlib and PivotalR

1 Pivotal Confidential–Internal Use Only

BUILT FOR THE SPEED OF BUSINESS

2 Pivotal Confidential–Internal Use Only 2 Pivotal Confidential–Internal Use Only

Big Data Analytics MADlib and PivotalR: Scalable Machine Learning for Massively Parallel Databases

Rahul Iyer, Senior Software Developer, Predictive Analytics March, 4th 2014

Pivotal OSS Meetups


Agenda for the talk

•  Introduce MADlib, a distributed machine learning library for SQL users

•  How scalability is achieved by distributing the computation?

•  Performance metrics + comparisons with Mahout

•  A new R interface to access all of MADlib’s features

•  How does it get big-data results with small-data efforts?

•  Demo to showcase PivotalR


What is Big data?

•  Volumes of data … •  In various formats … •  From multiple sources …

and Analytics?

•  Generate insights … •  for informed decision-making


Data ---! Information ---! Insights Traditional analytics pipeline

sample.csv&

Time;to;Insights&

Data&Prep& DB&Extract& DB&Import&spec.docx& scores.csv&

3&


Data ---! Information ---! Insights The MAD approach

Enterprise)Data)

RDBMS& RDBMS&RDBMS& RDBMS&

Time-to-Insights

Data&Prep& Model& Score&

Reduced&Data&Movement&

Billions&of&rows&in&minutes&

4&


What is MADlib?

MADlib project was initiated in 2011 by Greenplum architects and Joe Hellerstein from Univ. of California, Berkeley.

•  MAD stands for:

•  lib stands for SQL library of: •  advanced (mathematical, statistical, machine learning) •  parallel & scalable in-database functions


What is MADlib?

MADlib project was initiated in 2011 by Greenplum architects and Joe Hellerstein from Univ. of California, Berkeley.

•  MAD stands for:

•  lib stands for SQL library of: •  advanced (mathematical, statistical, machine learning) •  parallel & scalable in-database functions

UrbanDictionary.com: mad (adj.): an adjective used to enhance a noun.

1- dude, you got skills. 2- dude, you got mad skills.


Which platforms does it run on?

HDFS

HAWQ Impala

GPDB PostgreSQL

(Partly ported)


MPP (Massively Parallel Processing)

Network Interconnect

... ...

... ... Master Servers

Query planning & dispatch

Segment Servers

Query processing & data storage

SQL MapReduce

External Sources Loading,

streaming, etc.

Shared-Nothing Database Architecture


Supervised Learning •  Generalized Linear models

•  Linear Regression •  Logistic Regression •  Multinomial logit …

•  Decision Trees and Random Forest •  Naive Bayes Classification •  Support Vector Machines •  Cox-Prop Hazards

and more …

Analytics Pipeline

Predictive Modeling Data Exploration

Summary function Sketch estimators Percentiles Correlation matrix

Data Prep

Aggregation Normalizing Pivoting Filtering

Text analytics •  CRF •  LDA

Unsupervised Learning • Association Rules •  k-Means Clustering • Low-rank Matrix Factorization • PCA • SVD Matrix Factorization

Data mining

Sampling methods •  Cross Validation

Scoring

Scoring •  Linear Regression •  Logistic Regression •  Naïve Bayes …

Statistical metrics • Descriptive statistics • Goodness of fit •  Inferential statistics • ROC

Model fitness

Support modules •  Array operations •  Sparse Vectors •  Probability functions


Example usage

Train a model

Predict for new data


How do we implement scalability? Example: Linear Regression

•  Finding linear dependencies between variables

y ≈ c0 + c1 · x1 + c2 · x2 ? y | x1 | x2 -------+------+----- 10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design

matrix X

Vector of dependent variables y

Predictor (x1)

Reg

ress

or (y

)


Challenges in computing OLS solution


Challenges in computing OLS solution

a b c d

a c b d

XT X

Segment 1

Segment 2

Segm

ent 1

Segm

ent 2


Challenges to compute OLS solution

a b c d

a c b d

XT X

a2 + c2 =

Data across nodes are multiplied!



a b c d

a c b d

XT X

a2 + c2 ab + cd =

Data across nodes are multiplied!



a b c d

a c b d

XT X

a2 + c2 ab + cd ba + dc b2 + d2

= Looks like the result can be decomposed



a b c d

a c b d

XT X

a2 + c2 ab + cd ba + dc b2 + d2

= Let’s change perspective

= + a b a b c

d c d


Linear Regression: Streaming Algorithm

How to compute with a single table scan?

XT

X

XT

y

-1

XTX XTy

! "

! "# $ ! "# $


Linear Regression: Parallel Computation

XT

y

XT 1 y 1 XT

2 y 2 Segment 1 Segment 2 XTy Master


Basic Building Block: User-defined aggregate Basic&Building&Block:&

User;Defined&Aggregates&

AggregaOon&phase&1&on&each&node:&

1.  IniOalize:&

2.  TransiOon&for&all&rows:&

&

3.  Send&(A,b)&&

x# y)(1,0,3,…,5)& 3&

(;2,4,5,…,2)& 2&

…& …&

(A,b) = (0,0)

(A,b) = (A,b)+ (x ⋅ xT ,x ⋅ y) map&

reduce&

(A,b)&…&

AggregaOon&phase&2&on&master&node:&

1.  Merge:&&

2.  Finalize:& β̂ = solve(A,b) = A−1 ⋅b(A,b) = (A,b)+ (A,b)

13&


Problem solved? … Not Yet

"  Many ML solutions are iterative without analytical formulations

Initialize problem

Perform optimization step

Has converged?

Return results

false

true


Use a convex optimization framework

# segments # variables # rows v0.3 v0.2.1beta v0.1alpha(million) (s) (s) (s)

6 10 10 4.447 9.501 1.3376 20 10 4.688 11.60 1.8746 40 10 6.843 17.96 3.8286 80 10 13.28 52.94 12.986 160 10 35.66 181.4 51.206 320 10 186.2 683.8 333.4

12 10 10 2.115 4.756 0.960012 20 10 2.432 5.760 1.21212 40 10 3.420 9.010 2.04612 80 10 6.797 26.48 6.46912 160 10 17.71 90.95 25.6712 320 10 92.41 341.5 166.6

18 10 10 1.418 3.206 0.619718 20 10 1.648 3.805 1.00318 40 10 2.335 5.994 1.18318 80 10 4.461 17.73 4.31418 160 10 11.90 60.58 17.1418 320 10 61.66 227.7 111.4

24 10 10 1.197 2.383 0.390424 20 10 1.276 2.869 0.476924 40 10 1.698 4.475 1.15124 80 10 3.363 13.35 3.26324 160 10 8.840 45.48 13.1024 320 10 46.18 171.7 84.59

Figure 4: Linear-regression execution times

search. In our prototype implementation in MADlib, we picked upone such simple greedy algorithm, called stochastic (or sometimes,“incremental”) gradient descent (SGD) [33, 6], that goes back to the1960s. SGD is an approximation of gradient methods that is usefulwhen the convex function we are considering, f (x), has the form:

f (x) =NX

i=1

fi(x)

If each of the fi is convex, then so is f [8, pg. 38]. Notice thatall problems in Table 2 are of this form: intuitively each of thesemodels is finding some model (i.e., a vector w) that is scored onmany di↵erent training examples. SGD leverages the above form toconstruct a rough estimate of the gradient of f using the gradient of asingle term: for example, the estimate if we select i is the gradient offi (that we denote Gi(x)). The resulting algorithm is then describedas:

x x � ↵N ·Gi(x) (1)

This approximation is guaranteed to converge to an optimal solu-tion [26].

Using the MADlib framework. In our setting, each tuple inthe input table for an analysis task encodes a single fi. We use themicro-programming interfaces of Sections 3.2 and 3.3 to performthe mapping from the tuples to the vector representation that is usedin Eq. 1. Then, we observe Eq. 1 is simply an expression over eachtuple (to compute Gi(x)) which is then averaged together. Instead ofaveraging a single number, we average a vector of numbers. Here,we use the macro-programming provided by MADlib to handleall data access, spills to disk, parallelized scans, etc. Finally, the

Figure 6: The Archetypical Convex Function f (x) = x2.

Application ObjectiveLeast Squares

P(u,y)2⌦(xT u � y)2

Lasso [38]P

(u,y)2⌦(xT u � y)2 + µkxk1Logisitic Regression

P(u,y)2⌦ log(1 + exp(�yxtu))

Classification (SVM)P

(u,y)2⌦(1 � yxT u)+Recommendation

P(i, j)2⌦(LT

i R j � Mi j)2 + µkL,Rk2FLabeling (CRF) [40]

Pk

hPj x jF j(yk , zk) � log Z(zk)

i

Table 2: Models currently Implemented in MADlib using theSGD-based approach.

macro programming layer helps us test for convergence (which isimplemented with either a python combination or C driver.) Usingthis approach, we were able to add in implementations of all themodels in Table 2 in a matter of days.

In an upcoming paper we report initial experiments showing thatour SGD based approach achieves higher performance than priordata mining tools for some datasets [13].

-  Each step has an analytical formulation that can be performed in parallel

1.&Lack&of&portable&mulO;pass&iteraOons&

•  WITH RECURSIVE&not&reliable&basis&for&portability&

•  User;defined&driver&funcOons&in&Python&– Outer&loops&not&performance;criOcal&

•  Compromise:&Different&user&interface&

CREATE TEMP TABLE temp !

INSERT INTO temp SELECT step(...) FROM ... !

SELECT converged(...) FROM temp, ... !

SELECT result(...) !FROM temp!

false&

true&

16&


Architecture

RDBMS Query Processing (Greenplum, PostgreSQL, Hadoop with SQL)

Low-level Abstraction Layer (matrix operations,

C++ to DB type bridge, …)

RDBMS Built-in

Functions

User Interface

High-level Abstraction Layer (iteration controller, ...)

Functions for Inner Loops (implements convex optimization)

Python

SQL, generated per specification

C++

3.&Lack&of&language&support&for&linear&algebra&

•  C++&AbstracOon&Layer&uses&Eigen&•  (Dense)&Vectors&and&matrices:&DOUBLE PRECISION[]!

•  Example:&AnyType!solve::run(AnyType& args) { ! MappedMatrix A = args[0].getAs<MappedMatrix>(); ! MappedColumnVector b = args[1].getAs<MappedColumnVector>(); ! ! MutableMappedColumnVector x = allocateArray<double>(A.cols()); ! x = A.colPivHouseholderQr().solve(b); ! return x; !} ! Performance:&

•  No&unnecessary&copying&•  No&internal&type&conversion&

18&

The&MADlib&Vision&

•  Academic&and&industry&contribuOons&•  Think&of&“CRAN&for&databases”&– Repository&of&open;source&ML&algorithms&– This&Ome&with&data&parallelism&in&mind&

•  Open;Source&Framework&

Eigen&BSD&License&10&


Performance trends Performance&Trends&

•  Disk&I/O&is&not&always&the&boLleneck&•  Performance&tuning&is&

essenOal&

•  Overhead&for&single&query&very&low&(fracOon&of&a&second)&

•  Greenplum&achieves&nearly&perfect&speedup&

0&

5&

10&

15&

20&

25&

30&

35&

40&

6& 12& 18& 24&

20& 40& 80& 160&

OLS&on&10&million&rows&(in&seconds)&

#&segments&

#&variables:&

22&

•  Overhead for a single row is very low (fraction of a second)

•  Able to achieve close to linear speedup


Performance Comparison with Apache Mahout

"  Analytics WorkBench (http://www.gopivotal.com/big-data/analytics-workbench) –  1000-node cluster located in Las Vegas –  Over 24,000 processors, 48 TB of Memory, and 24 PB of

raw disk storage –  8000+ Map Task Capacity, 5000+ Reduce Task Capacity –  Infrastructure: Pivotal HD 1.1

"  Mahout v0.7 "  Test matrix*

–  Data size ▪  KDD Cup 2009 Orange marketing churn data (16.5 GB) ▪  Enron data (1.9 GB) ▪  Census data 2000 (1.7 GB)

–  Algorithms: Logistic Regression and K-means –  Algorithm parameters (e.g. convergence threshold, # iterations)

Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)

* Reporting a subset of results from whitepaper.


Logistic Regression

0

100

200

300

400

500

600

700

1000000 10000000 100000000 1E+09

Tim

e in

Min

utes

log(Number of Rows)

MADlib & Mahout Logistic Regression Scalability Across Number of Attributes

Census data, 48 attributes [Mahout]

Census data, 48 attributes [MADlib]



Logistic Regression

0

1

2

3

4

5

6

7

8

9

1000000 10000000 100000000 1E+09

Tim

e in

Min

utes

log(Number of Rows)



K-Means

0

50

100

150

200

250

300

350

1000000 10000000 100000000 1E+09

Tim

e in

Min

log(Number of Rows)

MADlib & Mahout K-means Scalability Across Number of Rows





Random Forest

0

200

400

600

800

1000

1200

1400

1600

1000000 10000000 100000000 1E+09

Tim

e in

Min

log(Number of Rows)





Part 1 Summary

MADlib is a easy-to-use library that provides a SQL interface to fast, scalable machine learning algorithms …


But not all Data Scientists speak SQL … Accessing Scalability through R


Why R?

O’Reilly: 2013 Data Science Salary Survey

From the report: “The preponderance of R and Python usage is more surprising … two most commonly used individual tools, even above Excel. R and Python are likely popular because they are easily accessible and effective open source tools.”


Execution in Database

•  All data stays in DB: R objects merely point to DB objects

•  All model estimation and heavy lifting done in DB by MADlib

•  R → SQL translation done in the R client

• Only strings of SQL and model output transferred across DBI

SQL to execute MADlib

Model output

36 © Copyright 2014 Pivotal. All rights reserved.

PivotalR Design Overview

SQL to execute

Computation results

RPostgreSQL

Data lives here

R " SQL

PivotalR

No data here

Database w/ MADlib

•  Call MADlib’s in-DB machine learning functions directly from R

•  Syntax is analogous to native R function

•  Data doesn’t need to leave the database •  All heavy lifting, including model estimation

& computation, are done in the database

Woo Jung

http://gopivotal.github.io/PivotalR/ 36 © Copyright 2014 Pivotal. All rights reserved.

PivotalR Design Overview

SQL to execute

Computation results

RPostgreSQL

Data lives here

R " SQL

PivotalR

No data here

Database w/ MADlib

•  Call MADlib’s in-DB machine learning functions directly from R

•  Syntax is analogous to native R function

•  Data doesn’t need to leave the database •  All heavy lifting, including model estimation

& computation, are done in the database

Woo Jung

http://gopivotal.github.io/PivotalR/

Courtesy Woo Jung and Hai Qian


Some of current features

A wrapper of MADlib

•  Linear regression

•  Logistic regression

•  Elastic Net

•  ARIMA

•  Table summary

•  Categorial variable

as.factor()

•  $ [ [[ $<- [<- [[<-

•  is.na

+ - * / %% %/% ^

•  & | !

•  == != > < >= <=

•  merge

•  by

•  db.data.frame

•  as.db.data.frame

•  preview •  sort

•  c mean sum sd var min max length colMeans colSums

•  db.connect db.disconnect db.list db.objects

db.existsObject delete •  dim •  names

•  content

And more ... (SQL wrapper)

•  predict


library(PivotalR)

db.connect(port = 14526, dbname = "madlib")

db.objects()

x <- db.data.frame("madlibtestdata.dt_abalone")

dim(x)

names(x)

x$rings

lookat(x, 10) # look at a sample of table

mean(x$rings)

lookat(mean(x$rings))

fit <- madlib.lm(rings ~ . - id | sex, data = y)

predict(fit, x)

mean((x$rings - predict(fit, x))^2)

x$sex <- as.factor(v$sex)

m0 <- madlib.glm(resp ~ age,

family="binomial", data=dbbank)

mstep <- step(m0, scope=list(lower=~age, upper=~age + factor(marital) + factor(education) + factor(housing) + factor(loan) + factor(job)))

Load the Library

Connect to the database “madlib” on port 14526

List all the tables in the active connection

Create an R object that references a table in the database

Report #/rows and #/columns in the table

Column names within the table

Database query object representing “select rings from madlibtestdata.dt_abalone”

Pull 10 rows of data from the table back into the R environment

query object representing “select avg(rings) from madlibtestdata.dt_abalone”

execute the query and report back the result

Run a linear regression within the database and return a model object

Create a query object representing scoring the model in the database

Query object calculating the mean square error of the model

Add a calculated factor column to the database query object

Calculate a logistic regression model

Perform stepwise feature selection

Demonstration


We’re looking for contributors

•  Browse our help pages –  Start page: madlib.net –  Github pages

•  github.com/madlib/madlib (SQL) •  github.com/gopivotal/pivotalr (R) •  github.com/gopivotal/pymadlib (Python)

–  Use our product and report issues: •  jira.madlib.net (Issue tracker) •  [email protected] (User forum)

•  Can use PostgreSQL or Greenplum Database Community Edition for installations on multiple platforms


Credits

Leaders and contributors:

Gavin Sherry Caleb Welton Joseph Hellerstein Christopher Ré Zhe Wang Florian Schoppmann

Hai Qian Shengwen Yang Aaron Feng and many others …

The&MADlib&Vision&




The&MADlib&Vision&





Thank you for your attention

Important links:

Product email: [email protected]

Product site: madlib.net

Speaker email: [email protected]

Technology

Pivotal OSS meetup - MADlib and PivotalR