23
Bring Your Code to Your Data Ian Huston @ianhuston

Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

Embed Size (px)

Citation preview

Page 1: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

Bring Your Code to Your Data

Ian Huston @ianhuston

Page 2: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

High Performance Computing

Page 3: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

HPC BIG DATA  

Page 4: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL
Page 5: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

In-Database Computing

Page 6: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

In-Database Distributed Computing

Page 7: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

PostgreSQL

PostgreSQL PostgreSQL PostgreSQL PostgreSQL

Page 8: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

1.  SQL for analytics 2.  Packaged libraries 3.  In-Database Python

Page 9: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

1.  SQL for analytics

Page 10: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

SQL is more than SELECT • Window Functions • WITH queries (Common Table Expressions) • User Defined Aggregations • User Defined (SQL) Functions Examples: •  Time Series using Windowing: http://blog.pivotal.io/author/caleb-welton •  Heroku’s Postgres bits: http://postgres-bits.herokuapp.com

Page 11: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

CTE

Normal Query

Window Function

WITH  decile  as  (      SELECT  *,              ntile(10)  OVER  (  ORDER  BY  score  )      FROM  mytable)    SELECT  *    FROM  decile    where  ntile  =  6;  

Get rows in 6th decile of scores

http://postgres-bits.herokuapp.com/#60

Page 12: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

Declaration

Language

SQL Statements

User Defined (SQL) Functions CREATE  FUNCTION  times2(INT)  RETURNS  INT  AS  $$          SELECT  2  *  $1  $$    LANGUAGE  sql;    SELECT  times2(1);    times2    -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐              2    

Execution

Page 13: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

2.  Packaged libraries

Page 14: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

I N - D ATA B A S E MACHINE LEARNING

http://madlib.net

Page 15: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

I N - D ATA B A S E GEOGRAPHIC QUERIES

http://postgis.net

Page 16: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

3.  In-Database Python"(and R, Java, C, etc)

Page 17: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

PostgreSQL

PL/X : X in {Python, R, Java, C, JavaScript, etc.}"

Page 18: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

Data Parallelism �  Little or no effort is required to break up the problem into a

number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks.

�  Examples: –  Measure the height of each student in a classroom (explicitly

parallelizable by student) –  MapReduce –  map() function in Python

Page 19: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

CREATE  FUNCTION        pymax  (a  integer,  b  integer)  RETURNS  integer  AS  $$      if  a  >  b:          return  a      return  b  $$  LANGUAGE  plpythonu;  

 

SQL wrapper

Language

Normal Python

Page 20: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

BENEFITS: Reuse Python & R code Access Python & R libraries Implicit parallelism

Page 21: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

Financial Examples

Insurance Risk Analysis Stress Testing Asset Management and Churn

Page 22: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

MORE DETAILS

http://tinyurl.com/ih-plpython

Page 23: Bring Your Code to Your Data - datapark.io – data ...datapark.io/docs/bringing_code_to_data.pdf · BIG DATA! In-Database Computing. In-Database Distributed Computing. PostgreSQL

@ianhuston