37
1 © Copyright 2011 EMC Corporation. All rights reserved. Srivatsan Ramanujam Senior Data Scientist Greenplum

PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

Embed Size (px)

DESCRIPTION

These are slides from my talk @ DataDay Texas, in Austin on 30 Mar 2013 (http://2013.datadaytexas.com/schedule) Favorite and Fork PyMADlib on GitHub: https://github.com/gopivotal/pymadlib MADlib: http://madlib.net

Citation preview

Page 1: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

1© Copyright 2011 EMC Corporation. All rights reserved.

Srivatsan RamanujamSenior Data Scientist

Greenplum

Page 2: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

2© Copyright 2011 EMC Corporation. All rights reserved.

Agenda

• Greenplum UAP overview– Products: GPDB, GPHD, Chorus, Analytics Labs, Data Computing Appliance– GPDB Architecture

• MADlib– Overview– Algorithms– Working Mechanism– Performance Comparison with Mahout

• PyMADlib– Overview– Demo in IPython Notebook

• Future Directions– GPHD and HAWQ

Page 3: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

3© Copyright 2011 EMC Corporation. All rights reserved.

Greenplum Overview

Page 4: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

4© Copyright 2011 EMC Corporation. All rights reserved.

Products

Page 5: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

5© Copyright 2011 EMC Corporation. All rights reserved.

MPP (Massively Parallel Processing) Shared-Nothing Architecture

NetworkInterconnect

... ...

......MasterServers

Query planning & dispatch

SegmentServers

Query processing & data storage

SQL

MapReduce

ExternalSources

Loading, streaming, etc.

Greenplum Database - Architecture

Page 6: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

6© Copyright 2011 EMC Corporation. All rights reserved.

MADlib

Page 7: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

7© Copyright 2011 EMC Corporation. All rights reserved.

MADlib: The Origin

UrbanDictionary.com:mad (adj.): an adjective used to enhance a noun.1- dude, you got skills.2- dude, you got mad skills.

• First mention of MAD analytics was at VLDB’09 – MAD Skills: New Analysis Practices for Big Data– Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein,

Caleb Welton http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf

• MADlib project initiated in late 2010– Maintained by Greenplum/EMC with significant

contributions from UW Madison, UFlorida and UC Berkeley.

Page 8: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

8© Copyright 2011 EMC Corporation. All rights reserved.

Current Modules

Data Modeling

Supervised Learning• Naive Bayes Classification• Linear Regression• Logistic Regression• Multinomial Logistic Regression• Decision Tree• Random Forest• Support Vector Machines• Cox-Proportional Hazards Regression• Conditional Random Field

Unsupervised Learning• Association Rules• k-Means Clustering• Low-rank Matrix Factorization• SVD Matrix Factorization• Parallel Latent Dirichlet Allocation

Descriptive Statistics

Sketch-based Estimators• CountMin (Cormode-

Muthukrishnan)• FM (Flajolet-Martin)• MFV (Most Frequent Values)

Profile

Quantile

Support

Array Operations

Conjugate Gradient

Sparse Vectors

Probability Functions

Random Sampling

Inferential Statistics

Hypothesis tests

Page 9: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

9© Copyright 2011 EMC Corporation. All rights reserved.

MADlib – User Doc• Check out the user guide with examples at: http://doc.madlib.net

Page 10: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

10© Copyright 2011 EMC Corporation. All rights reserved.

How does it work ? : A Linear Regression Example• Finding linear dependencies between variables

– y ≈ c0 + c1 · x1 + c2 · x2 ?

# select y, x1, x2 from unm limit 6;

y | x1 | x2 -------+------+----- 10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design

matrix X

Vector of dependent variables y

Page 11: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

11© Copyright 2011 EMC Corporation. All rights reserved.

Reminder: Linear-Regression Model• • If residuals i.i.d. Gaussians with standard deviation σ:

– max likelihood ⇔ min sum of squared residuals

• First-order conditions for the following quadratic objective (in c)

yield the minimizer

Page 12: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

12© Copyright 2011 EMC Corporation. All rights reserved.

Linear Regression: Streaming Algorithm• How to compute with a single table scan?

XT

X

XT

y

-1

XTX XTy

Page 13: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

13© Copyright 2011 EMC Corporation. All rights reserved.

Linear Regression: Parallel Computation

XT

y

XT

1y

1XT

2y

2

Segment 1

Segment 2 XTyMaster

Page 14: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

14© Copyright 2011 EMC Corporation. All rights reserved.

Performance Comparison : Test Setup on AWB

• AWB– 1000-node cluster located in Las Vegas– Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk

storage– 8000+ Map Task Capacity, 5000+ Reduce Task Capacity– GPHD 1.1, GPDB 4.2.3

• Mahout v0.7

• MADlib v0.5– With small LMF change to allow 4-byte integer values

• Test matrix– Data size (# rows/records, # columns/features)– Algorithms– Algorithm parameters (e.g. convergence threshold, # iterations)– GPDB segment / MR (Map-Reduce) task configurations

Page 15: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

15© Copyright 2011 EMC Corporation. All rights reserved.

Performance & Scalability Results (summary)

• Whitepaper coming out shortly!

Page 16: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

16© Copyright 2011 EMC Corporation. All rights reserved.

Logistic Regression• Mahout only has sequential (i.e. single node) IGD implementation

1000000 10000000 100000000 10000000000

100

200

300

400

500

600

700

MADlib & Mahout Logistic Regression Scalability Across Number of Attributes

Census data, 48 attributes [Mahout]

Census data, 48 attributes [MADlib]

log(Number of Rows)

Tim

e in

Min

ute

s

Page 17: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

17© Copyright 2011 EMC Corporation. All rights reserved.

Logistic Regression

50 100 150 200 2500

2

4

6

8

10

12

14

16

18

MADlib Scalability Across Number of GPDB Segments

Number of GPDB Segments

Tim

e in

Min

ute

s

Page 18: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

18© Copyright 2011 EMC Corporation. All rights reserved.

K-Means Clustering

1000000 10000000 100000000 10000000000

50

100

150

200

250

300

350

MADlib & Mahout K-means Scalability Across Number of Rows

Census data, 48 attributes [Mahout]

Census data, 48 attributes [MADlib]

log(Number of Rows)

Tim

e in

Min

Page 19: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

19© Copyright 2011 EMC Corporation. All rights reserved.

K-Means Clustering

50 100 150 200 2500

1

2

3

4

5

6

7

8

9

10

MADlib K-means Scalability Across Number of GPDB Segments

Number of GPDB Segments

Tim

e in

Min

Page 20: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

20© Copyright 2011 EMC Corporation. All rights reserved.

PyMADlib : Python + MADlib = Awesome!

Page 21: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

21© Copyright 2011 EMC Corporation. All rights reserved.

Motivation

• Undeniably the most straightforward way to query data

• But not necessarily designed for data science

• SQL is great for many things, but it’s not nearly enough

Page 22: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

22© Copyright 2011 EMC Corporation. All rights reserved.

MADlib is a godsend!

• So why do we need anything else? – UI is still all in SQL– Need to tap into rich visualization libraries

• Empowers data scientists to run canned machine learning routines – focus less on coding, more on science

• In-database, explicitly parallel.

Page 23: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

23© Copyright 2011 EMC Corporation. All rights reserved.

Then which interface is favored by and familiar to data scientists?

• Depends on who you ask

• Left survey is for “higher level languages,” and right survey is for “lower level languages”

Page 24: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

24© Copyright 2011 EMC Corporation. All rights reserved.

Wait, don’t we already have this (PL/R, PL/Python, SAS HPA)?

• PL/X’s are wonderful, but:– It still requires non-trivial knowledge of SQL to use effectively– Mostly limited to explicitly parallel jobs– Primarily a SQL interface to the end user

• Need an interface that is:– Less SQL, more R/Python/SAS– Implicitly parallelized– More scalable

• SAS HPA = $$$$$

Page 25: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

25© Copyright 2011 EMC Corporation. All rights reserved.

The challenge

• MADlib – Open source– Extremely powerful/scalable– Growing algorithm breadth– SQL

• Python/R– Open source– Memory limited– High algorithm breadth– Language/interface purpose-designed for data science

• SAS– High user loyalty– Non-HPA is memory limited, HPA requires investment– High algorithm breadth– Language/interface purpose-designed for data science

• Want to leverage both the performance benefits of MADlib and the usability of languages like Python, SAS, and R

Page 26: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

26© Copyright 2011 EMC Corporation. All rights reserved.

Simple solution: Translate Python code into SQL

• All data stays in DB and all model estimation and heavy lifting done in DB by MADlib

• Only strings of SQL and model output transferred across ODBC/JDBC

• Best of both worlds: number crunching power of MADlib along with rich set of visualizations of Matplotlib, NetworkX and all your other favorite Python libraries.  Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL database, while you program in your favorite language – Python.

SQL to execute MADlib

Model output

ODBC/JDBC

Python SQL

Page 27: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

27© Copyright 2011 EMC Corporation. All rights reserved.

Demo

PyMADlib Tutorial – IPython Notebook Viewer Link

http://nbviewer.ipython.org/5275846

Page 28: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

28© Copyright 2011 EMC Corporation. All rights reserved.

Where do I get it ?

$pip install pymadlib

Page 29: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

29© Copyright 2011 EMC Corporation. All rights reserved.

I don’t have GPDB or MADlib – What do I do ?

• Greenplum Database Community Edition is freely available for single node installations on multiple platforms

– Written permission may be requested from EMC/Greenplum for research use for multi-node installations

• MADlib is free and open-source– Downloadable for multiple platforms from https://github.com/madlib/

madlib

• PyMADlib is also free and open-source – Downloadable from https://github.com/vatsan/pymadlib

Page 30: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

30© Copyright 2011 EMC Corporation. All rights reserved.

Future Directions

Page 31: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

31© Copyright 2011 EMC Corporation. All rights reserved.

Greenplum HD

• HAWQ – Parallel SQL query engine that combines the key technological advantages of industry-leading Greenplum Database with scalability and convenience of Hadoop

• SQL Standards Compliant– Supports Correlated Sub-queries, Window Functions, Roll-ups, Cubes + range of

scalar and aggregate functions

• ACID Compliant

Page 32: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

32© Copyright 2011 EMC Corporation. All rights reserved.

HAWQ – Architecture

Page 33: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

33© Copyright 2011 EMC Corporation. All rights reserved.

Performance : HAWQ1 Vs. Hive Vs. Impala2

All experiments were run on a 60 node deployment with Analytics Workbench3

1 http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf2 https://github.com/cloudera/impala/3 http://www.analyticsworkbench.com/

Page 34: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

34© Copyright 2011 EMC Corporation. All rights reserved.

• Linear Regression

• Logistic Regression

• Multinomial Logistic Regression

• K-Means

• Association Rules

• Latent Dirichlet Allocation

HAWQ: Deep Scalable AnalyticsWhat’s inside the box?

• Users can connect to HAWQ via popular programming languages and it also supports JDBC and ODBC.

• Most tools will work out of the box with HAWQ, including PyMADlib

Page 35: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

35© Copyright 2011 EMC Corporation. All rights reserved.

Questions?

@[email protected]

https://github.com/vatsan/pymadlib

Page 36: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

36© Copyright 2011 EMC Corporation. All rights reserved.

Appendix

Page 37: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

37© Copyright 2011 EMC Corporation. All rights reserved.

Datasets

The following datasets were used in comparing the performance of MADlib with Mahout

– KDD Cup 2009 Orange marketing churn data (16.5 MB)• About 500,000 records and 15,000 numerical and categorical attributes

– Census 2000 data (1.7 GB)• About 14 million records and 48 numerical and categorical attributes

– Enron data (1.9 GB)• About 700,000 documents with a vocabulary size of 200,000

– KDD Cup 2011 Yahoo! Music Webscope data (4.16 GB)• About 1 million users, 600,000 songs, and 250 million ratings

– Netflix Prize 2009 data (52.7 MB)• About 400,000 users, 900 movies, and 4.5 million ratings