Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and Enterprise Data Warehouse (EDW)

Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and Enterprise WarehouseYUANYUAN TIAN ([email protected])IBM RESEARCH -- ALMADEN

Publications: Tian et al EDBT 2015, Tian et al TODS 2016 (invited as Best of EDBT 2015)

mailto:[email protected]

Big Data in The Enterprise

Hadoop + Spark

ETL/ELT Graph ML

Analytics Streams SQL

HDFS

social

SQL Queries

EDW

SQL Engine

Example Scenario

SELECT L.url_prex, COUNT(*)FROM Transaction T, Logs LWHERE T.category = ‘Canon Camera’AND region(L.ip)= ‘East Coast’AND T.uid=L.uidAND T.date >= L.date AND T.date <= L.date+1GROUP BY L.url_prex

Find out the number of views of the urls visited by customers with IP addresses from East Coast who bought Canon Camera within one day of their online visits

Hadoop + Spark

SQL

HDFS

EDW

Logs Transactions

Table L Table T

Correlate customers’ online behaviors with sales

Hybrid Warehouse What is a Hybrid Warehouse?

A special federation between Hadoop-like big data platforms and EDWs Two asymmetric, heterogeneous, and independent distributed systems. Existing federation solutions are inadequate

Client-server model to access remote databases and move data Single connection for data transmission

EDW SQL-on-HadoopData Ownership • Own its data

• Control data organization and partitioning• Work with existing files on HDFS• Cannot dictate data layout

Index Support Build and exploit index Scan-based only, no index supportUpdate Support update-in-place Append only

Capacity • High-end servers• Smaller cluster size

• Commodity machines• Larger cluster size (up to 10,000s nodes)

Joins in Hybrid Warehouse Focus an equi-join between two big tables in the hybrid warehouse

Table T in an EDW (a shared-nothing full-fledged parallel database) Table L on HDFS, with a scan-based distributed data processing engine (HQP) Both tables are large, but generally |L|>>|T| Data not distributed/partitioned by join key at either side Queries are issued and results are returned at EDW side Final result is relatively small due to aggregation

SELECT L.url_prex, COUNT(*)FROM Transaction T, Logs LWHERE T.category = ‘Canon Camera’AND region(L.ip)= ‘East Coast’AND T.uid=L.uidAND T.date >= L.date AND T.date <= L.date+1GROUP BY L.url_prex

Existing Hybrid Solutions Data of one system is entirely loaded into the other.

DB HDFS: DB data gets updated frequently, HDFS doesn’t support update properly HDFS DB: HDFS data is often too big to be moved into DB

Dynamically ingest needed data from HDFS into DB e.g. Microsoft Polybase, Pivotal Hawq, TeraData SQL-H, Oracle Big Data SQL Selection and projection pushdown to the HDFS side Joins executed in the DB side only

Heavy burden on the DB side Assume that SQL-on-Hadoop systems are not efficient at join processing

NOT TRUE ANYMORE! (IBM Big SQL, Impala, Presto, etc.)

Split querying processing between DB and HDFS Microsoft Polybase Joins executed in Hadoop, only when both tables are on HDFS

Goals and Contributions Goals:

Fully utilize the processing power and massive parallelism of both systems Minimize data movement across the network

Exploit the use of Bloom filters Consider performing joins both at the DB side and the HDFS side

Contributions: Adapt and extend well-know distributed join algorithms to work in the hybrid warehouse Propose a new zigzag join algorithm that is proved to work well in most cases Implement the algorithms in a prototype of the hybrid warehouse architecture with DB2 DPF and our

join engine on HDFS Empirically compare all join algorithms in different selectivity settings Develop a sophisticated cost model for all the join algorithms

DB-Side Join

Move the HDFS data after selection & projection to DB Used in most existing hybrid systems: Polybase, Pivotal Hawq. etc HDFS table after selection & projection can still be big

Bloom filters to exploit join selectivity

HDFS-Side Broadcast JoinIf the DB table after selection & projection is very small

Broadcast the DB table to HDFS side to avoid shuffling the HDFS table.

HDFS-Side Repartition JoinWhen the DB table after selection & projection is still large

Both sides agree on a hash function for data shuffling Bloom filter to exploit join selectivity

HDFS-Side Zigzag Join 2-way Bloom filter can further reduces the DB data transferred to HDFS side.

Implementation EDW: DB2 DPF extended with unfenced C UDFs

Computing & applying Bloom filters Different ways of transferring data between DB2 and JEN

HQP: Our own C++ join execution engine, called JEN Sophisticated HDFS-side join engine using multi-threading, pipelining, hash-based aggregations, etc Coordination between DB2 and the HDFS-side engine Parallel communication layer between DB2 and the HDFS-side engine

HCatalog: For storing the meta data of HDFS tables

Each join algorithm is invoked by issuing a single query to DB2

JEN Overview Built with a prototype of the IO layer and the scheduler from an early version of IBM Big SQL 3.0

A JEN cluster consists one coordinator and n workers

Each JEN worker: Multi-threaded, run on each HDFS DataNode Read parts of HDFS tables (leveraging IO layer of IBM Big SQL 3.0) Execute local query plans Communicate in parallel with other JEN workers (MPI-based) Communicate in parallel with DB2 agents : through TCP/IP sockets.

JEN coordinator: Manage JEN workers Orchestrate connection and communication between JEN workers and DB2 agents Retrieve meta data for HDFS tables Assign HDFS blocks to JEN workers (leveraging the scheduler of IBM Big SQL 3.0)

Experimental Setup HDFS cluster:

30 DataNodes, each runs 1 JEN worker Each server: 8 cores, 32 GB RAM, 1 Gbit Ethernet, 4 disks for HDFS

DB2 DPF: 5 severs, each runs 6 DB2 agents Each server: 12 cores, 96 GB RAM, 10 Gbit Ethernet, 11 disks for DB2 data storage

Interconnection: 20Gbit switch

Dataset: Log table L on HDFS (15 billion records)

1TB in text format 421GB in Parquet format (default)

Transaction table T in DB2 (1.6 billion records, 97GB) Bloom filter: 128 million bits with 2 hash function

# join keys: 16 million false positive rate: 5%

DB-Side Joins vs HDFS-Side Joins DB-Side joins work well only when selectivity on

L is small (σL<= 0.01 ).

HDFS-side joins show very steady performance with increasing L’.

HDFS-side join (especially zigzag join) is a very reliable choice for joins in the hybrid warehouse!

Transaction Table Selectivity = 0.1

DB-side joins deteriorate fast !

Broadcast Join vs Repartition Join Broadcast join only works for very limited cases, e.g when σT<= 0.001 (T’ <=25MB).

Tradeoff: broadcasting T’ (30*T’) via interconnection vs sending T’ via interconnection + shuffling L’ within HDFS cluster

0.001 0.01 0.1 0.20

50

100

150

200

250

broadcast repartition

Log Table Selectivity

Tim

e (s

ec)

0.001 0.01 0.1 0.20

50

100

150

200

250

broadcast repartition

Log Table Selectivity

Tim

e (s

ec)

Transaction table selectivity = 0.001 Transaction table selectivity = 0.01

Zigzag Join vs Repartition Joins

Transaction Table Selectivity = 0.1

HDFS tuples shuffled DB tuples sent

Repartition 5,854 million 165 million

Repartition (BF) 591 million 165 million

Zigzag 591 million 30 million

Zigzag join is most efficient Up to 2.1x faster than repartition join, up to 1.8x faster than

repartition join with BF

Zigzag join significantly reduces data movement 9.9x less HDFS data shuffled, 5.5x less DB data sent

Zigzag join is the best HDFS-side join algorithm!

Cost Model of Join Algorithms Goal:

Capture the relative performance of the join algorithms Enable a query optimizer in the hybrid warehouse to choose the right join strategy

Estimate total resource time (disk IO, network IO, CPU) in milliseconds

Parameters used in cost formulas: System parameters: only related to the system environment (common to all queries)

# DB nodes, # HDFS nodes, DB buffer pool size, disk IO speeds (DB, HDFS), network IO speeds (DB, HDFS, in-between), etc Estimated through a learning suite which runs a number of test programs

Query parameters: query-specific parameters Table cardinalities, table sizes, local predicate selectivity, join selectivity, Bloom filter size, Bloom filter false-positive rate, etc DB table: leverage DB stats HDFS table: estimate through sampling or Hive Analyze Table command if possible Join selectivity: estimate through sampling

Validation of Cost Models

Selectivity on T

Selectivity on L

Join Selectivity on T

Join Selectivity on L

Best from Cost Model

Best from Experiment

Intersection Metric

0.05 0.001 0.0005 0.05 db(BF) db(BF) 00.05 0.01 0.005 0.05 db(BF) db(BF) 0.180.05 0.1 0.05 0.05 zigzag zigzag 0.080.05 0.2 0.1 0.05 zigzag zigzag 00.1 0.001 0.0005 0.1 db(BF) db(BF) 00.1 0.01 0.005 0.1 db(BF) db(BF) 0.180.1 0.1 0.05 0.1 zigzag zigzag 0.140.1 0.2 0.1 0.1 zigzag zigzag 0.06

Cost model correctly finds the best algorithm in every case!

Even the ranking of the algorithms is similar or identical to that of empirical observation!

Concluding Remarks Emerging need for hybrid warehouse: enterprise warehouses will co-exist with big data systems

Bloom filter is a good way to filter data, and use them both ways

Powerful SQL processing capability on the HDFS side IBM Big SQL, Impala, Hive 14, … Existing SQL-on-Hadoop systems can be augmented with the capabilities of JEN

More capacity and investment on the big data side Exploit capacity without moving data

It is better to do the joins on the Hadoop side

More complex usage patterns are emerging EDW on premise, Hadoop on cloud

Data & Analytics

Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and Enterprise Data Warehouse (EDW)