PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

Preview:

DESCRIPTION

These are slides from my talk @ DataDay Texas, in Austin on 30 Mar 2013 (http://2013.datadaytexas.com/schedule) Favorite and Fork PyMADlib on GitHub: https://github.com/gopivotal/pymadlib MADlib: http://madlib.net

Citation preview

1© Copyright 2011 EMC Corporation. All rights reserved.

Srivatsan RamanujamSenior Data Scientist

Greenplum

2© Copyright 2011 EMC Corporation. All rights reserved.

Agenda

• Greenplum UAP overview– Products: GPDB, GPHD, Chorus, Analytics Labs, Data Computing Appliance– GPDB Architecture

• MADlib– Overview– Algorithms– Working Mechanism– Performance Comparison with Mahout

• PyMADlib– Overview– Demo in IPython Notebook

• Future Directions– GPHD and HAWQ

3© Copyright 2011 EMC Corporation. All rights reserved.

Greenplum Overview

4© Copyright 2011 EMC Corporation. All rights reserved.

Products

5© Copyright 2011 EMC Corporation. All rights reserved.

MPP (Massively Parallel Processing) Shared-Nothing Architecture

NetworkInterconnect

... ...

......MasterServers

Query planning & dispatch

SegmentServers

Query processing & data storage

SQL

MapReduce

ExternalSources

Loading, streaming, etc.

Greenplum Database - Architecture

6© Copyright 2011 EMC Corporation. All rights reserved.

MADlib

7© Copyright 2011 EMC Corporation. All rights reserved.

MADlib: The Origin

UrbanDictionary.com:mad (adj.): an adjective used to enhance a noun.1- dude, you got skills.2- dude, you got mad skills.

• First mention of MAD analytics was at VLDB’09 – MAD Skills: New Analysis Practices for Big Data– Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein,

Caleb Welton http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf

• MADlib project initiated in late 2010– Maintained by Greenplum/EMC with significant

contributions from UW Madison, UFlorida and UC Berkeley.

8© Copyright 2011 EMC Corporation. All rights reserved.

Current Modules

Data Modeling

Supervised Learning• Naive Bayes Classification• Linear Regression• Logistic Regression• Multinomial Logistic Regression• Decision Tree• Random Forest• Support Vector Machines• Cox-Proportional Hazards Regression• Conditional Random Field

Unsupervised Learning• Association Rules• k-Means Clustering• Low-rank Matrix Factorization• SVD Matrix Factorization• Parallel Latent Dirichlet Allocation

Descriptive Statistics

Sketch-based Estimators• CountMin (Cormode-

Muthukrishnan)• FM (Flajolet-Martin)• MFV (Most Frequent Values)

Profile

Quantile

Support

Array Operations

Conjugate Gradient

Sparse Vectors

Probability Functions

Random Sampling

Inferential Statistics

Hypothesis tests

9© Copyright 2011 EMC Corporation. All rights reserved.

MADlib – User Doc• Check out the user guide with examples at: http://doc.madlib.net

10© Copyright 2011 EMC Corporation. All rights reserved.

How does it work ? : A Linear Regression Example• Finding linear dependencies between variables

– y ≈ c0 + c1 · x1 + c2 · x2 ?

# select y, x1, x2 from unm limit 6;

y | x1 | x2 -------+------+----- 10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design

matrix X

Vector of dependent variables y

11© Copyright 2011 EMC Corporation. All rights reserved.

Reminder: Linear-Regression Model• • If residuals i.i.d. Gaussians with standard deviation σ:

– max likelihood ⇔ min sum of squared residuals

• First-order conditions for the following quadratic objective (in c)

yield the minimizer

12© Copyright 2011 EMC Corporation. All rights reserved.

Linear Regression: Streaming Algorithm• How to compute with a single table scan?

XT

X

XT

y

-1

XTX XTy

13© Copyright 2011 EMC Corporation. All rights reserved.

Linear Regression: Parallel Computation

XT

y

XT

1y

1XT

2y

2

Segment 1

Segment 2 XTyMaster

14© Copyright 2011 EMC Corporation. All rights reserved.

Performance Comparison : Test Setup on AWB

• AWB– 1000-node cluster located in Las Vegas– Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk

storage– 8000+ Map Task Capacity, 5000+ Reduce Task Capacity– GPHD 1.1, GPDB 4.2.3

• Mahout v0.7

• MADlib v0.5– With small LMF change to allow 4-byte integer values

• Test matrix– Data size (# rows/records, # columns/features)– Algorithms– Algorithm parameters (e.g. convergence threshold, # iterations)– GPDB segment / MR (Map-Reduce) task configurations

15© Copyright 2011 EMC Corporation. All rights reserved.

Performance & Scalability Results (summary)

• Whitepaper coming out shortly!

16© Copyright 2011 EMC Corporation. All rights reserved.

Logistic Regression• Mahout only has sequential (i.e. single node) IGD implementation

1000000 10000000 100000000 10000000000

100

200

300

400

500

600

700

MADlib & Mahout Logistic Regression Scalability Across Number of Attributes

Census data, 48 attributes [Mahout]

Census data, 48 attributes [MADlib]

log(Number of Rows)

Tim

e in

Min

ute

s

17© Copyright 2011 EMC Corporation. All rights reserved.

Logistic Regression

50 100 150 200 2500

2

4

6

8

10

12

14

16

18

MADlib Scalability Across Number of GPDB Segments

Number of GPDB Segments

Tim

e in

Min

ute

s

18© Copyright 2011 EMC Corporation. All rights reserved.

K-Means Clustering

1000000 10000000 100000000 10000000000

50

100

150

200

250

300

350

MADlib & Mahout K-means Scalability Across Number of Rows

Census data, 48 attributes [Mahout]

Census data, 48 attributes [MADlib]

log(Number of Rows)

Tim

e in

Min

19© Copyright 2011 EMC Corporation. All rights reserved.

K-Means Clustering

50 100 150 200 2500

1

2

3

4

5

6

7

8

9

10

MADlib K-means Scalability Across Number of GPDB Segments

Number of GPDB Segments

Tim

e in

Min

20© Copyright 2011 EMC Corporation. All rights reserved.

PyMADlib : Python + MADlib = Awesome!

21© Copyright 2011 EMC Corporation. All rights reserved.

Motivation

• Undeniably the most straightforward way to query data

• But not necessarily designed for data science

• SQL is great for many things, but it’s not nearly enough

22© Copyright 2011 EMC Corporation. All rights reserved.

MADlib is a godsend!

• So why do we need anything else? – UI is still all in SQL– Need to tap into rich visualization libraries

• Empowers data scientists to run canned machine learning routines – focus less on coding, more on science

• In-database, explicitly parallel.

23© Copyright 2011 EMC Corporation. All rights reserved.

Then which interface is favored by and familiar to data scientists?

• Depends on who you ask

• Left survey is for “higher level languages,” and right survey is for “lower level languages”

24© Copyright 2011 EMC Corporation. All rights reserved.

Wait, don’t we already have this (PL/R, PL/Python, SAS HPA)?

• PL/X’s are wonderful, but:– It still requires non-trivial knowledge of SQL to use effectively– Mostly limited to explicitly parallel jobs– Primarily a SQL interface to the end user

• Need an interface that is:– Less SQL, more R/Python/SAS– Implicitly parallelized– More scalable

• SAS HPA = $$$$$

25© Copyright 2011 EMC Corporation. All rights reserved.

The challenge

• MADlib – Open source– Extremely powerful/scalable– Growing algorithm breadth– SQL

• Python/R– Open source– Memory limited– High algorithm breadth– Language/interface purpose-designed for data science

• SAS– High user loyalty– Non-HPA is memory limited, HPA requires investment– High algorithm breadth– Language/interface purpose-designed for data science

• Want to leverage both the performance benefits of MADlib and the usability of languages like Python, SAS, and R

26© Copyright 2011 EMC Corporation. All rights reserved.

Simple solution: Translate Python code into SQL

• All data stays in DB and all model estimation and heavy lifting done in DB by MADlib

• Only strings of SQL and model output transferred across ODBC/JDBC

• Best of both worlds: number crunching power of MADlib along with rich set of visualizations of Matplotlib, NetworkX and all your other favorite Python libraries.  Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL database, while you program in your favorite language – Python.

SQL to execute MADlib

Model output

ODBC/JDBC

Python SQL

27© Copyright 2011 EMC Corporation. All rights reserved.

Demo

PyMADlib Tutorial – IPython Notebook Viewer Link

http://nbviewer.ipython.org/5275846

28© Copyright 2011 EMC Corporation. All rights reserved.

Where do I get it ?

$pip install pymadlib

29© Copyright 2011 EMC Corporation. All rights reserved.

I don’t have GPDB or MADlib – What do I do ?

• Greenplum Database Community Edition is freely available for single node installations on multiple platforms

– Written permission may be requested from EMC/Greenplum for research use for multi-node installations

• MADlib is free and open-source– Downloadable for multiple platforms from https://github.com/madlib/

madlib

• PyMADlib is also free and open-source – Downloadable from https://github.com/vatsan/pymadlib

30© Copyright 2011 EMC Corporation. All rights reserved.

Future Directions

31© Copyright 2011 EMC Corporation. All rights reserved.

Greenplum HD

• HAWQ – Parallel SQL query engine that combines the key technological advantages of industry-leading Greenplum Database with scalability and convenience of Hadoop

• SQL Standards Compliant– Supports Correlated Sub-queries, Window Functions, Roll-ups, Cubes + range of

scalar and aggregate functions

• ACID Compliant

32© Copyright 2011 EMC Corporation. All rights reserved.

HAWQ – Architecture

33© Copyright 2011 EMC Corporation. All rights reserved.

Performance : HAWQ1 Vs. Hive Vs. Impala2

All experiments were run on a 60 node deployment with Analytics Workbench3

1 http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf2 https://github.com/cloudera/impala/3 http://www.analyticsworkbench.com/

34© Copyright 2011 EMC Corporation. All rights reserved.

• Linear Regression

• Logistic Regression

• Multinomial Logistic Regression

• K-Means

• Association Rules

• Latent Dirichlet Allocation

HAWQ: Deep Scalable AnalyticsWhat’s inside the box?

• Users can connect to HAWQ via popular programming languages and it also supports JDBC and ODBC.

• Most tools will work out of the box with HAWQ, including PyMADlib

35© Copyright 2011 EMC Corporation. All rights reserved.

Questions?

@being_bayesianvatsan.cs@utexas.edu

https://github.com/vatsan/pymadlib

36© Copyright 2011 EMC Corporation. All rights reserved.

Appendix

37© Copyright 2011 EMC Corporation. All rights reserved.

Datasets

The following datasets were used in comparing the performance of MADlib with Mahout

– KDD Cup 2009 Orange marketing churn data (16.5 MB)• About 500,000 records and 15,000 numerical and categorical attributes

– Census 2000 data (1.7 GB)• About 14 million records and 48 numerical and categorical attributes

– Enron data (1.9 GB)• About 700,000 documents with a vocabulary size of 200,000

– KDD Cup 2011 Yahoo! Music Webscope data (4.16 GB)• About 1 million users, 600,000 songs, and 250 million ratings

– Netflix Prize 2009 data (52.7 MB)• About 400,000 users, 900 movies, and 4.5 million ratings

Recommended