Apache Hive 2.0: SQL, Speed, Scale

Apache Hive 2.0: SQL, Speed, ScaleAlan GatesHive PMC MemberCo-founder HortonworksMay 2016

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Acknowledgements

The Apache Hive community for building all this awesome tech Content of some of these slides based on earlier presentations by Sergey Shelukhin

and Siddarth Seth alias Hive=‘Apache Hive’

alias Hadoop=‘Apache Hadoop’alias Spark=‘Apache Spark’alias Tez=‘Apache Tez’alias Parquet=‘Apache Parquet’alias ORC=‘Apache ORC’alias Omid=‘Apache Omid (incubating)’alias Calcite=‘Apache Calcite’


Apache Hive History

Initially Hive provided SQL on Hadoop– Provided a table view instead of file view of data– Translated SQL to MapReduce– Mostly used for ETL (Extract Transform Load)– Big, batch, high start up time

Around 2012 it became clear users wanted to do all data warehousing on Hadoop, not just batch ETL

Hive has shifted over time to focus on traditional data warehousing problems– Still does large ETL well– Now also can be used for analytics, reporting– Work being done to better support BI (Business Intelligence) tools

Not OLTP, very focused on backend analytics


Hive 1.x and 2.x

New feature development in Hive moving at a fast pace– Stressful for those who use Hive for its original purpose (ETL type SQL on MapReduce)– Realizing the full potential of Hive as data warehouse on Hadoop requires more changes

Compromise: follow Hadoop’s example, split into stable and new feature lines 1.x

– Stable– Backwards compatible– Ongoing bug fixes

2.x– Major new features– Backwards compatible where possible, but some things will be broken– Hive 2.0 released February 15, 2016 – Not considered production ready– Hive 2.1 released June 20, 2016 – Getting closer, but still beta


Hive 2.0 New Features Overview

1039 JIRAs resolved with 2.0 as fix version– 666 bugs– 140 improvements or new features– 625 more issues resolved in 2.1, mostly all bug fixes

HPLSQL LLAP HBase Metastore Hive-On-Spark Improvements Cost Based Optimizer Improvements Many, many new features and bug fixes I will not have time to cover


Adding Procedural SQL: HPLSQL

Procedural SQL, akin to Oracle’s PL/SQL and Teradata’s stored procedures– Adds cursors, loops (FOR, WHILE, LOOP), branches (IF), HPLSQL procedures, exceptions (SIGNAL)

Aims to be compatible with all major dialects of procedural SQL to maximize re-use of existing scripts

Currently external to Hive, communicates with Hive via JDBC. – User runs command using hplsql binary– Goal is to tightly integrate it so that Hive’s parser can execute HPLSQL, store HPLSQL procedures,

etc.


Sub-second Queries in Hive: LLAP (Live Long and Process)

Persistent daemons– Saves time on process start up (eliminates container allocation and JVM start up time)– All code JITed within a query or two

Data caching with an async I/O elevator– Hot data cached in memory (columnar aware, so only hot columns cached)– When possible work scheduled on node with data cached, if not work will be run in other node

Operators can be executed inside LLAP when it makes sense– Large, ETL style queries usually don’t make sense– User code not run in LLAP for security

Working on interface to allow other data engines to read securely in parallel Beta in 2.0


Hive With LLAP Execution Options

AM AM

T T T

R R

R

T T

T

R

M M M

R R

R

M M

R

R

Tez Only LLAP + Tez

T T T

R R

R

T T

T

R

LLAP only


LLAP Performance

query3 query12 query20 query21 query26 query27 query42 query52 query55 query73 query89 query91 query980

5

10

15

20

25

30

35

40

45

50

LLAP vs Hive 1.x 10TB Scale

LLAP Hive 1.x

Tim

e (s

econ

ds)


LLAP Performance Continued

query3

query12

query17

query21

query26

query28

query43

query48

query55

query65

query73

query82

query89

query98

query18

query25

query31

query34

query40

query49

query51

query56

query66

query71

query79

query85

query88

query92

query94

query96

0

50

100

150

200

250

300

350

400

450

500

LLAP Hive 1.2.1

Tim

e (s

econ

ds)

Hive / LLAP, Hive 1.2.1 Query Times

38 out of 61 queries ran 50% faster 25 out of 61 queries ran 70% faster12 out of 61 queries ran 80% faster1 query ran 90% faster


LLAP Limitations

Currently in Beta Read only, no write path yet Does not work with ACID yet (see previous bullet) User must decide whether query runs fully in LLAP, mixed mode, or not at all

– Should be handled by CBO

Currently only reads ORC files Currently only integrates with Tez as an engine


Speeding up Query Planning: HBase Metastore

Add option to use HBase to store Hive’s metadata Why?

– Planning a query that reads several thousand partitions in Hive 1.2 takes 5+ seconds, mostly for metadata acquisition– ORM layer produces complex, slow schema (40+ tables)– The need to work across 5 different databases limits performance optimizations and maximizes test matrix for

developers– Limits caching opportunities as we cannot store too much data in a single node RDBMS– The need to limit number of concurrent connections forces all metadata operations to be done during query

planning– HBase addresses each of these

Goal: cut metadata access time for query with thousands of partitions to 200 milliseconds– Not there yet, currently at 1-1.5 seconds

Challenges– HBase lacks transactions, addressing via Apache Omid (incubating)

Alpha in Hive 2.0


Improvements to Hive on Spark

Dynamic partition pruning Make use of spark persistence for self-join, self-union, and CTEs Vectorized map-join and other map-join improvements Parallel order by Pre-warming of containers Support for Spark 1.5 Many bug fixes


Cost Base Optimizer (CBO) Improvements

Hive’s CBO uses Calcite– Not all optimization rules migrated yet, but 2.0 continues work towards that

CBO on by default in 2.0 (wasn’t in in 1.x) Main focus of CBO work has been BI queries (using TPC-DS as guide)

– Some work on machine generated queries, since tools generate some funky queries

Focus on improving stats collection and estimating stats more accurately between operators in the plan


And Many, Many More

• SQL Standard Auth is the default authorization (actually works)• CLI mode for beeline (WIP to replace and deprecate CLI in Hive 2.*)• Codahale-based metrics (also in 1.3)• HS2 Web UI• Stability Improvements and bugfixes for ACID (almost production ready now)• Native vectorized mapjoin, vectorized reducesink, improved vectorized GBY, etc.• Improvements to Parquet performance (PPD, memory manager, etc.)• ORC schema evolution (beta)• Improvement to windowing functions, refactoring ORC before split, SIMD

optimizations, new LIMIT syntax, parallel compilation in HS2, improvements to Tez session management, many more


Hive 2.0 Incompabilities

Java 7 & 8 supported, 6 no longer supported Requires Hadoop 2.x, Hadoop 1.x no longer supported MapReduce deprecated, Tez or Spark recommended instead

– At some future date MR will be removed

Some configuration defaults changed, e.g.– bucketing enforced by default– metadata schema no longer created if it is missing– SQL Standard authorization used by default

We plan to remove Hive CLI in the future and replace with beeline CLI– Why?

• Makes it easier for users to deploy secure clusters where all access is via [OJ]DBC• It is cleaner to maintain one code path

– Does not require HiveServer2, can run HS2 embedded in beeline


Thank You

Technology

Apache Hive 2.0: SQL, Speed, Scale