PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

Srivatsan RamanujamSenior Data Scientist

Greenplum

Agenda

• Greenplum UAP overview– Products: GPDB, GPHD, Chorus, Analytics Labs, Data Computing Appliance– GPDB Architecture

• MADlib– Overview– Algorithms– Working Mechanism– Performance Comparison with Mahout

• PyMADlib– Overview– Demo in IPython Notebook

• Future Directions– GPHD and HAWQ

Greenplum Overview

Products

MPP (Massively Parallel Processing) Shared-Nothing Architecture

NetworkInterconnect

... ...

......MasterServers

Query planning & dispatch

SegmentServers

Query processing & data storage

MapReduce

ExternalSources

Loading, streaming, etc.

Greenplum Database - Architecture

MADlib

MADlib: The Origin

UrbanDictionary.com:mad (adj.): an adjective used to enhance a noun.1- dude, you got skills.2- dude, you got mad skills.

• First mention of MAD analytics was at VLDB’09 – MAD Skills: New Analysis Practices for Big Data– Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein,

Caleb Welton http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf

• MADlib project initiated in late 2010– Maintained by Greenplum/EMC with significant

contributions from UW Madison, UFlorida and UC Berkeley.

Current Modules

Data Modeling

Supervised Learning• Naive Bayes Classification• Linear Regression• Logistic Regression• Multinomial Logistic Regression• Decision Tree• Random Forest• Support Vector Machines• Cox-Proportional Hazards Regression• Conditional Random Field

Unsupervised Learning• Association Rules• k-Means Clustering• Low-rank Matrix Factorization• SVD Matrix Factorization• Parallel Latent Dirichlet Allocation

Descriptive Statistics

Sketch-based Estimators• CountMin (Cormode-

Muthukrishnan)• FM (Flajolet-Martin)• MFV (Most Frequent Values)

Profile

Quantile

Support

Array Operations

Conjugate Gradient

Sparse Vectors

Probability Functions

Random Sampling

Inferential Statistics

Hypothesis tests

MADlib – User Doc• Check out the user guide with examples at: http://doc.madlib.net

How does it work ? : A Linear Regression Example• Finding linear dependencies between variables

– y ≈ c0 + c1 · x1 + c2 · x2 ?

# select y, x1, x2 from unm limit 6;

y | x1 | x2 -------+------+----- 10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design

matrix X

Vector of dependent variables y

Reminder: Linear-Regression Model• • If residuals i.i.d. Gaussians with standard deviation σ:

– max likelihood ⇔ min sum of squared residuals

• First-order conditions for the following quadratic objective (in c)

yield the minimizer

Linear Regression: Streaming Algorithm• How to compute with a single table scan?

XTX XTy

Linear Regression: Parallel Computation

Segment 1

Segment 2 XTyMaster

Performance Comparison : Test Setup on AWB

• AWB– 1000-node cluster located in Las Vegas– Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk

storage– 8000+ Map Task Capacity, 5000+ Reduce Task Capacity– GPHD 1.1, GPDB 4.2.3

• Mahout v0.7

• MADlib v0.5– With small LMF change to allow 4-byte integer values

• Test matrix– Data size (# rows/records, # columns/features)– Algorithms– Algorithm parameters (e.g. convergence threshold, # iterations)– GPDB segment / MR (Map-Reduce) task configurations

Performance & Scalability Results (summary)

• Whitepaper coming out shortly!

Logistic Regression• Mahout only has sequential (i.e. single node) IGD implementation

1000000 10000000 100000000 10000000000

MADlib & Mahout Logistic Regression Scalability Across Number of Attributes

Census data, 48 attributes [Mahout]

Census data, 48 attributes [MADlib]

log(Number of Rows)

Logistic Regression

50 100 150 200 2500

MADlib Scalability Across Number of GPDB Segments

Number of GPDB Segments

K-Means Clustering

1000000 10000000 100000000 10000000000

MADlib & Mahout K-means Scalability Across Number of Rows

Census data, 48 attributes [Mahout]

Census data, 48 attributes [MADlib]

log(Number of Rows)

K-Means Clustering

50 100 150 200 2500

MADlib K-means Scalability Across Number of GPDB Segments

Number of GPDB Segments

PyMADlib : Python + MADlib = Awesome!

Motivation

• Undeniably the most straightforward way to query data

• But not necessarily designed for data science

• SQL is great for many things, but it’s not nearly enough

MADlib is a godsend!

• So why do we need anything else? – UI is still all in SQL– Need to tap into rich visualization libraries

• Empowers data scientists to run canned machine learning routines – focus less on coding, more on science

• In-database, explicitly parallel.

Then which interface is favored by and familiar to data scientists?

• Depends on who you ask

• Left survey is for “higher level languages,” and right survey is for “lower level languages”

Wait, don’t we already have this (PL/R, PL/Python, SAS HPA)?

• PL/X’s are wonderful, but:– It still requires non-trivial knowledge of SQL to use effectively– Mostly limited to explicitly parallel jobs– Primarily a SQL interface to the end user

• Need an interface that is:– Less SQL, more R/Python/SAS– Implicitly parallelized– More scalable

• SAS HPA = $$$$$

The challenge

• MADlib – Open source– Extremely powerful/scalable– Growing algorithm breadth– SQL

• Python/R– Open source– Memory limited– High algorithm breadth– Language/interface purpose-designed for data science

• SAS– High user loyalty– Non-HPA is memory limited, HPA requires investment– High algorithm breadth– Language/interface purpose-designed for data science

• Want to leverage both the performance benefits of MADlib and the usability of languages like Python, SAS, and R

Simple solution: Translate Python code into SQL

• All data stays in DB and all model estimation and heavy lifting done in DB by MADlib

• Only strings of SQL and model output transferred across ODBC/JDBC

• Best of both worlds: number crunching power of MADlib along with rich set of visualizations of Matplotlib, NetworkX and all your other favorite Python libraries. Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL database, while you program in your favorite language – Python.

SQL to execute MADlib

Model output

ODBC/JDBC

Python SQL

PyMADlib Tutorial – IPython Notebook Viewer Link

http://nbviewer.ipython.org/5275846

Where do I get it ?

$pip install pymadlib

I don’t have GPDB or MADlib – What do I do ?

• Greenplum Database Community Edition is freely available for single node installations on multiple platforms

– Written permission may be requested from EMC/Greenplum for research use for multi-node installations

• MADlib is free and open-source– Downloadable for multiple platforms from https://github.com/madlib/

madlib

• PyMADlib is also free and open-source – Downloadable from https://github.com/vatsan/pymadlib

Future Directions

Greenplum HD

• HAWQ – Parallel SQL query engine that combines the key technological advantages of industry-leading Greenplum Database with scalability and convenience of Hadoop

• SQL Standards Compliant– Supports Correlated Sub-queries, Window Functions, Roll-ups, Cubes + range of

scalar and aggregate functions

• ACID Compliant

HAWQ – Architecture

Performance : HAWQ1 Vs. Hive Vs. Impala2

All experiments were run on a 60 node deployment with Analytics Workbench3

1 http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf2 https://github.com/cloudera/impala/3 http://www.analyticsworkbench.com/

• Linear Regression

• Logistic Regression

• Multinomial Logistic Regression

• K-Means

• Association Rules

• Latent Dirichlet Allocation

HAWQ: Deep Scalable AnalyticsWhat’s inside the box?

• Users can connect to HAWQ via popular programming languages and it also supports JDBC and ODBC.

• Most tools will work out of the box with HAWQ, including PyMADlib

Questions?

@being_bayesianvatsan.cs@utexas.edu

https://github.com/vatsan/pymadlib

Appendix

Datasets

The following datasets were used in comparing the performance of MADlib with Mahout

– KDD Cup 2009 Orange marketing churn data (16.5 MB)• About 500,000 records and 15,000 numerical and categorical attributes

– Census 2000 data (1.7 GB)• About 14 million records and 48 numerical and categorical attributes

– Enron data (1.9 GB)• About 700,000 documents with a vocabulary size of 200,000

– KDD Cup 2011 Yahoo! Music Webscope data (4.16 GB)• About 1 million users, 600,000 songs, and 250 million ratings

– Netflix Prize 2009 data (52.7 MB)• About 400,000 users, 900 movies, and 4.5 million ratings

PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library

Technology

Interfacing C/C++ and Python with SWIG - Simplified Wrapper

Conversation Templates Using Targeted Grammar and Expansion Words With Madlib Edition

MADlib Analytics Library Contributions · MADlib Analytics Library Contributions Babak Alipour, Aditya Nain, Giang Nguyen CISE department, University of Florida •The philosophy

PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012) 23 ...conference.scipy.org/proceedings/scipy2012/pdfs/cyrus_harrison.pdf · we choose PySide, an LGPL Python Qt wrapper, as

GDC - Wrapper

CSE 142 Python Slides · 12 Wrapper classes •A wrapper is an object whose sole purpose is to hold a primitive value. ... •The standard way for a Java class to define a comparison

Is it a Package or a Wrapper? Designing, Documenting, and Distributing a Python Helper Library for the Cicero API

Wrapper Generation

VisBAR Wave Batch v1.0.0 マニュアルhoshi/visbar/visbar_wave_batch_manual_jp... · 3. インターネット上にアップロードしてあるVTKのPythonのWrapper版インストール

Interfacing C/C++ and Python with SWIG - Simplified Wrapper and

The MADlib Analytics Library or MAD Skills, the SQL · 2012-08-22 · The MADlib Analytics Library or MAD Skills, the SQL Joseph M. Hellerstein hellerstein@berkeley.edu U.C. Berkeley

Rapid development of fast and ﬂexible environmental models: … · 2021. 4. 9. · Mobius Python wrapper (Sect. 3.1.2), which provides Python bindings to core Mobius functionality

Windows Kernel Fuzzing - F-Secure Labs › assets › BlogFiles › bg... · Labs.mwrinfosecurity.com | © MWR Labs 20 Procedures –Example Window Proc •Python wrapper functions

The MADlib Analytics Library

WRAPPER MAINTENANCE

opan Documentation · Open Anharmonic is a Python 3 wrapper for computational chemistry software packages intended to enable VPT2 computation of anharmonic vibrational constants

Apache HAWQ and Apache MADlib: Journey to Apache

A Python wrapper for NASA's Radar Software Library...A Python wrapper for NASA's Radar Software Library Eric Bruning TTU Department of Geosciences Atmospheric Science Group ... COMPLETE

Design and Implementation of an Embedded Python Run-Time … · rice computer architecture group call_uart_init()-Wrappers for functions-Interpreter calls wrapper-Wrapper converts

Using Python to Dive into Signalling Data with …the user experience. In order to be able to use the Python language, we therefore decided to also provide a python wrapper. To do