Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... ·...

Preview:

Citation preview

Split Query Processing in Polybase

Varun SriramFrederick Widjaja

Problem: Querying Data in Multiple Formats

Relational “Structured” Distributed File System “Unstructured”

Problem: Querying Data in Multiple Formats

Relational “Structured” Distributed File System “Unstructured”

When do we use each?

Problem: Querying Data in Multiple Formats

Relational “Structured” Distributed File System “Unstructured”

In what situations (if ever) do we need both?

Problem: Querying Data in Multiple Formats“SQL-on-Hadoop”

Native Hadoop systems Database-Hadoop hybrids

Problem: Querying Data in Multiple Formats“SQL-on-Hadoop”

Native Hadoop systems Database-Hadoop hybrids

Why do we need SQL to query each?

Existing Solution: EXTERNAL TABLES

Existing Solution: Hadapt

Hadapt: 2 selects and 1 join

HDFS

DB

Filter

Filter

Join via MapReduce

Join in PostgreSQL

Polybase: PDW Architecture

Polybase: EXTERNAL TABLES

Polybase: Communicating With HDFS

Polybase USe CASES

QUERY OPTIMIZATION

SELECT count (*) from CustomerWHERE acctbal < 0GROUP BY nationkey

Table Customer is stored on HDFS

QUERY OPTIMIZATION

QUERY OPTIMIZATION

QUERY OPTIMIZATION

QUERY OPTIMIZATION

JOIN ON PDW/HDFS

Perform Join with Map-Reduce Perform Join in PDW

JOIN ON HDFS/HDFS

Perform Join with Map-Reduce Perform Join in PDW

EXPERIMENT GOALS

EXPERIMENT GOALS

Is this the right approach?

EXPERIMENT QUERY 1

SELECT TOP 10 unique1, unique2, unique4, stringu1, stringu2, string4FROM T1WHERE (unique1 % 100) < T1-SFORDER BY unique1

Table T1 is stored on HDFS

EXPERIMENT QUERY 1 - Results

16 node PDW cluster and48 node Hadoop cluster(C-16/48)

30 node PDW cluster and30 node Hadoop cluster(C-30/30)

60 node PDW cluster and60 node Hadoop clusterco-located on the same nodes(C60)

EXPERIMENT QUERY 2SELECT TOP 10 T1.unique1, T1.unique2, T2.unique3, T2.stringu1, T2.stringu2FROM T1 INNER JOIN T2 ON (T1.unique1 = T2.unique2)WHERE T1.onePercent < T1-SF AND T2.onePercent < T2-SFORDER BY T1.unique2

“Independent” join of T1 and T2

EXPERIMENT QUERY 2

EXPERIMENT QUERY 2 - Results

C-16/48 C-30/30 C60

EXPERIMENT QUERY 2 - Results

C-16/48 C-30/30 C60

EXPERIMENT QUERY 3SELECT TOP 10 T1.unique1, T1.unique2, T2.unique3, T2.stringu1, T2.stringu2FROM T1 INNER JOIN T2 ON (T1.unique1 = T2.unique1)WHERE T1.onePercent < T1-SF AND T2.onePercent < T2-SFORDER BY T1.unique2

“Correlated” join of T1 and T2

EXPERIMENT QUERY 3

EXPERIMENT QUERY 3 - Results

C-16/48 C-30/30 C60

NEXT STEPS

NEXT STEPS● Realistic workload experiments comparing to other versions of

database/Hadoop hybrid systems● More investigation into optimal cost-based query optimizers, and what

factors should go into it

Recommended