Apache Hive on ACID

Apache Hive on ACIDAlan GatesHive PMC MemberCo-founder HortonworksMay 2016

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

History

Hive only updated partitions– INSERT...OVERWRITE rewrote an entire partition– Forced daily or even hourly partitions– Could add files to partition directory, file compaction was manual

What about concurrent readers?– Ok for inserts, but overwrite caused races– There is a zookeeper lock manager, but…

No way to delete or update rows No INSERT INTO T VALUES…

– Breaks some tools


Why Do You Need ACID?

Hadoop and Hive have always…– Just said no to ACID– Perceived as tradeoff for performance

But, your data isn’t static– It changes daily, hourly, or faster– Sometimes it needs restated (late arriving data) or facts change (e.g. a user’s physical address)– Loading data into Hive every hour is so 2010; data should be available in Hive as soon as it arrives

We saw users implementing ad hoc solutions– This is a lot of work and hard to get right– Hive should support this as a first class feature


When Should You Use Hive’s ACID?

NOT OLTP!!! Updating a Dimension Table

– Changing a customer’s address

Delete Old Records– Remove records for compliance

Update/Restate Large Fact Tables– Fix problems after they are in the warehouse

Streaming Data Ingest– A continual stream of data coming in– Typically from Flume or Storm

NOT OLTP!!!


SQL Changes for ACID

Since Hive 0.14 New DML

– INSERT INTO T VALUES(1, ‘fred’, ...);– UPDATE T SET (x = 5[, ...]) [WHERE ...]– DELETE FROM T [WHERE ...]– Supports partitioned and non-partitioned tables, WHERE clause can specify partition but not required

Restrictions– Table must have format that extends AcidInputFormat

• currently ORC• work started on Parquet (HIVE-8123)

– Table must be bucketed and not sorted• can use 1 bucket but this will restrict write parallelism

– Table must be marked transactional

• create table T(...) clustered by (a) into 2 buckets stored as orc TBLPROPERTIES ('transactional'='true');• Existing ORC tables that are bucketed can be marked transactional via ALTER


Ingesting Data Into Hive From a Stream

Data is flowing in from generators in a stream Without this, you have to add it to Hive in batches, often every hour

– Thus your users have to wait an hour before they can see their data

New interface in hive.hcatalog.streaming lets applications write small batches of records and commit them– Users can now see data within a few seconds of it arriving from the data generators

Available for Apache Flume and Apache Storm


Design

HDFS does not allow arbitrary writes– Store changes as delta files– Stitched together by client on read

Writes get a transaction ID– Sequentially assigned by metastore

Reads get highest committed transaction & list of open/aborted transactions– Provides snapshot consistency– No exclusive locks required


Why Not HBase

Good– Handles compactions for us– Already has similar data model with LSM

Bad– When we started this there were no transaction managers for HBase, this requires transactions– Hfile is column family based rather than columnar– HBase focused on point lookups and range scans

• Warehousing requires full scans


Stitching Buckets Together


HDFS Layout

Partition locations remain unchanged– Still warehouse/$db/$tbl/$part

Bucket Files Structured By Transactions– Base files $part/base_$tid/bucket_*– Delta files $part/delta_$tid_$tid/bucket_*


Input and Output Formats

Created new AcidInput/OutputFormat– Unique key is original transaction id, bucket, row id

Reader returns correct version of row based on transaction state Also added raw API for compactor

– Provides previous events as well

ORC implements new API– Extends records with change metadata

• Add operation (d, u, i), latest transaction id, and key


Transaction Manager

Existing lock managers– In memory - not durable– ZooKeeper - requires additional components to install, administer, etc.

Locks need to be integrated with transactions– commit/rollback must atomically release locks

We sort of have this database lying around which has ACID characteristics (metastore) Transactions and locks stored in metastore Uses metastore DB to provide unique, ascending ids for transactions and locks


Transaction & Locking Model

DML statements are auto-commit Snapshot isolation

– Reader will see consistent data for the duration of a query

Current transactions can be displayed using SHOW TRANSACTIONS Three types of locks

– shared read– shared write (can co-exist with shared read, but not other shared write)– exclusive

Operations require different locks– SELECT, INSERT – shared read (inserts cannot conflict because there is no primary key)– UPDATE, DELETE – shared write– DROP, INSERT OVERWRITE – exclusive


Compaction

Each transaction (or batch of transactions in streaming) creates a new delta directory Too many files = NameNode and poor read performance due to fan in on merge Need to automatically compact files

– Initiated by metastore server, run as MR jobs in the cluster– Can be manually initiated by user via ALTER TABLE COMPACT

Minor compaction merges many deltas into one– Run when there are more than 10 delta directories (configurable)

Major compaction merges deltas with base and rewrites base– Run when size of the deltas > 10% of the size of the base (configurable)

Old files kept around until all readers are done with their snapshots, then cleaned up– Compaction and data read/writes can be done in parallel with no need to pause the world


Issues Found and (Some) Fixed Not GA ready in Hive 1.2 or 2.0, hope to have GA ready by 1.3 and 2.1 Deadlocks in the RDBMS

– The way the Hive metastore used the RDBMS caused a lot of deadlocks – greatly improved

Usability– SHOW COMPACTIONS and SHOW LOCKS did not give users/admins enough information to successfully determine who

was blocking whom or what was getting compacted – improved, some work still to do here

Resilience– System was easy to knock over when clients did silly things (like open 1M+ transactions) – improved, though I am sure

there are still some ways to kill it– Initially compactor threads only run in 1 metastore instance – resolved, now can run in multiple instances

Correctness– Streaming ingest did not enforce proper bucket spraying – resolved– Initial versions of the compactor had a race condition that resulted in record loss – resolved– Adding a column to a table or changing a column’s type caused read time errors - resolved– Updates can get lost when overlapping transactions update the same partition – HIVE-13395

Performance– Some work done here (e.g. making predicate push down work, efficient split combinations)– Much still to be done

https://issues.apache.org/jira/browse/HIVE-13395


Next: MERGE

Standard SQL, added in SQL 2003 Problem, today each UPDATE requires a scan of the partition or table

– There is no way to apply separate updates in a batch

Allows upserts Use case:

– bring in batch from transactional/front end systems– Apply as insert or updates (as appropriate) in one read/write pass


Future Work

Multi-statement transactions (BEGIN, COMMIT, ROLLBACK) Integration with LLAP

– Figure out how MVCC works with LLAP’s caching– Build a write path through LLAP

Lower the user burden– Make the bucketing automatic so the user does not have to be aware of it– Allow user to determine sort order of the table– Eventually remove the transactional/non-transactional distinction in tables

Improve monitoring and alerting facilities– Make is easier for an admin to determine when the system is in trouble, e.g. the compactor is not

running or is failing on every run, there are too many open transactions, etc.


Thank You

Technology

Apache Hive on ACID