39
© Hortonworks Inc. 2011 Integration of Apache Hive and HBase Enis Soztutar enis [at] apache [dot] org @enissoz Page 1 Architecting the Future of Big Data

Integration of Hive and HBase

Embed Size (px)

DESCRIPTION

Apache Hive provides SQL-like access to your stored data in Apache Hadoop. Apache HBase stores tabular data in Hadoop and supports update operations. The combination of these two capabilities is often desired, however, the current integration show limitations such as performance issues. In this talk, Enis Soztutar will present an overview of Hive and HBase and discuss new updates/improvements from the community on the integration of these two projects. Various techniques used to reduce data exchange and improve efficiency will also be provided.

Citation preview

Page 1: Integration of Hive and HBase

© Hortonworks Inc. 2011

Integration of Apache Hive and HBase Enis Soztutar enis [at] apache [dot] org @enissoz

Page 1 Architecting the Future of Big Data

Page 2: Integration of Hive and HBase

© Hortonworks Inc. 2011

About Me

Page 2 Architecting the Future of Big Data

•  User and committer of Hadoop since 2007

•  Contributor to Apache Hadoop, HBase, Hive and Gora

•  Joined Hortonworks as Member of Technical Staff

•  Twitter: @enissoz

Page 3: Integration of Hive and HBase

© Hortonworks Inc. 2011

Agenda

Page 3 Architecting the Future of Big Data

•  Overview of Hive and HBase

•  Hive + HBase Features and Improvements

•  Future of Hive and HBase

•  Q&A

Page 4: Integration of Hive and HBase

© Hortonworks Inc. 2011

Apache Hive Overview

• Apache Hive is a data warehouse system for Hadoop • SQL-like query language called HiveQL

• Built for PB scale data

• Main purpose is analysis and ad hoc querying

• Database / table / partition / bucket – DDL Operations

• SQL Types + Complex Types (ARRAY, MAP, etc)

• Very extensible

• Not for : small data sets, low latency queries, OLTP

Page 4 Architecting the Future of Big Data

Page 5: Integration of Hive and HBase

© Hortonworks Inc. 2011

Apache Hive Architecture

Page 5 Architecting the Future of Big Data

Metastore

RDBMS

Hive Thrift Server

Driver

CLI

JDBC/ODBC

Hive Web Interface

HDFS

MapReduce

Execution

Parser Planner

Optimizer

MSClient

Page 6: Integration of Hive and HBase

© Hortonworks Inc. 2011

Overview of Apache HBase

• Apache HBase is the Hadoop database • Modeled after Google’s BigTable

• A sparse, distributed, persistent multi- dimensional sorted map

• The map is indexed by a row key, column key, and a timestamp

• Each value in the map is an un-interpreted array of bytes

• Low latency random data access

Page 6 Architecting the Future of Big Data

Page 7: Integration of Hive and HBase

© Hortonworks Inc. 2011

Overview of Apache HBase

• Logical view:

Page 7 Architecting the Future of Big Data

From: Bigtable: A Distributed Storage System for Structured Data, Chang, et al.

Page 8: Integration of Hive and HBase

© Hortonworks Inc. 2011

Apache HBase Architecture

Page 8 Architecting the Future of Big Data

Client

Zookeeper

HMaster

Region server

Region

Region

Region server

Region

Region

Region server

Region

Region

HDFS

Page 9: Integration of Hive and HBase

© Hortonworks Inc. 2011

Hive + HBase Features and Improvements

Architecting the Future of Big Data Page 9

Page 10: Integration of Hive and HBase

© Hortonworks Inc. 2011

Hive + HBase Motivation

• Hive and HBase has different characteristics:

• Hive datawarehouses on Hadoop are high latency – Long ETL times – Access to real time data

• Analyzing HBase data with MapReduce requires custom coding

• Hive and SQL are already known by many analysts

Page 10 Architecting the Future of Big Data

High latency vs.

Low latency Structured Unstructured Analysts Programmers

Page 11: Integration of Hive and HBase

© Hortonworks Inc. 2011

Use Case 1: HBase as ETL Data Sink

Page 11 Architecting the Future of Big Data

From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010

Page 12: Integration of Hive and HBase

© Hortonworks Inc. 2011

Use Case 2: HBase as Data Source

Page 12 Architecting the Future of Big Data

From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010

Page 13: Integration of Hive and HBase

© Hortonworks Inc. 2011

Use Case 3: Low Latency Warehouse

Page 13 Architecting the Future of Big Data

From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010

Page 14: Integration of Hive and HBase

© Hortonworks Inc. 2011

Example: Hive + Hbase (HBase table)

hbase(main):001:0> create 'short_urls', {NAME => 'u'}, {NAME=>'s'}

hbase(main):014:0> scan 'short_urls'

ROW COLUMN+CELL

bit.ly/aaaa column=s:hits, value=100 bit.ly/aaaa column=u:url, value=hbase.apache.org/ bit.ly/abcd column=s:hits, value=123

bit.ly/abcd column=u:url, value=example.com/foo

Page 14 Architecting the Future of Big Data

Page 15: Integration of Hive and HBase

© Hortonworks Inc. 2011

Example: Hive + HBase (Hive table) CREATE TABLE short_urls( short_url string,

url string, hit_count int

)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES

("hbase.columns.mapping" = ":key, u:url, s:hits")

TBLPROPERTIES

("hbase.table.name" = ”short_urls");

Page 15 Architecting the Future of Big Data

Page 16: Integration of Hive and HBase

© Hortonworks Inc. 2011

Storage Handler

• Hive defines HiveStorageHandler class for different storage backends: HBase/ Cassandra / MongoDB/ etc

• Storage Handler has hooks for –  Getting input / output formats –  Meta data operations hook: CREATE TABLE, DROP TABLE, etc

• Storage Handler is a table level concept –  Does not support Hive partitions, and buckets

Page 16 Architecting the Future of Big Data

Page 17: Integration of Hive and HBase

© Hortonworks Inc. 2011

Apache Hive + HBase Architecture

Page 17 Architecting the Future of Big Data

Metastore

RDBMS

Hive Thrift Server

Driver

CLI Hive Web Interface

HDFS

MapReduce

Execution

Parser Planner

Optimizer

MS Cl i en t

HBase

StorageHandler

Page 18: Integration of Hive and HBase

© Hortonworks Inc. 2011

Hive + HBase Integration

• For Input/OutputFormat, getSplits(), etc underlying HBase classes are used

• Column selection and certain filters can be pushed down

• HBase tables can be used with other(Hadoop native) tables and SQL constructs

• Hive DDL operations are converted to HBase DDL operations via the client hook. – All operations are performed by the client – No two phase commit

Page 18 Architecting the Future of Big Data

Page 19: Integration of Hive and HBase

© Hortonworks Inc. 2011

Schema / Type Mapping

Architecting the Future of Big Data Page 19

Page 20: Integration of Hive and HBase

© Hortonworks Inc. 2011

Schema Mapping

• Hive table + columns + column types <=> HBase table + column families (+ column qualifiers)

• Every field in Hive table is mapped in order to either – The table key (using :key as selector) – A column family (cf:) -> MAP fields in Hive – A column (cf:cq)

•  Hive table does not need to include all columns in HBase • 

Page 20 Architecting the Future of Big Data

CREATE TABLE short_urls( short_url string, url string, hit_count int, props, map<string,string> ) WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, u:url, s:hits, p:")

Page 21: Integration of Hive and HBase

© Hortonworks Inc. 2011

Type Mapping

• Recently added to Hive (0.9.0) • Previously all types were being converted to strings in HBase • Hive has: – Primitive types: INT, STRING, BINARY, DATE, etc – ARRAY<Type> – MAP<PrimitiveType, Type> – STRUCT<a:INT, b:STRING, c:STRING>

• HBase does not have types – Bytes.toBytes()

Page 21 Architecting the Future of Big Data

Page 22: Integration of Hive and HBase

© Hortonworks Inc. 2011

Type Mapping

• Table level property "hbase.table.default.storage.type” = “binary”

• Type mapping can be given per column after # – Any prefix of “binary” , eg u:url#b – Any prefix of “string” , eg u:url#s – The dash char “-” , eg u:url#-

Page 22

CREATE TABLE short_urls( short_url string, url string, hit_count int, props, map<string,string> ) WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b,u:url#b,s:hits#b,p:#s")

Architecting the Future of Big Data

Page 23: Integration of Hive and HBase

© Hortonworks Inc. 2011

Type Mapping

• If the type is not a primitive or Map, it is converted to a JSON string and serialized

• Still a few rough edges for schema and type mapping: – No Hive BINARY support in HBase mapping – No mapping of HBase timestamp (can only provide put

timestamp) – No arbitrary mapping of Structs / Arrays into HBase schema

Page 23 Architecting the Future of Big Data

Page 24: Integration of Hive and HBase

© Hortonworks Inc. 2011

Bulk Load

• Steps to bulk load: – Sample source data for range partitioning – Save sampling results to a file – Run CLUSTER BY query using HiveHFileOutputFormat and

TotalOrderPartitioner – Import Hfiles into HBase table

• Ideal setup should be SET hive.hbase.bulk=true INSERT OVERWRITE TABLE web_table SELECT ….

Page 24 Architecting the Future of Big Data

Page 25: Integration of Hive and HBase

© Hortonworks Inc. 2011

Filter Pushdown

Architecting the Future of Big Data Page 25

Page 26: Integration of Hive and HBase

© Hortonworks Inc. 2011

Filter Pushdown

• Idea is to pass down filter expressions to the storage layer to minimize scanned data

• To access indexes at HDFS or HBase

• Example: CREATE EXTERNAL TABLE users (userid LONG, email STRING, … )

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,…")

SELECT ... FROM users WHERE userid > 1000000 and email LIKE ‘%@gmail.com’;

-> scan.setStartRow(Bytes.toBytes(1000000))

Page 26 Architecting the Future of Big Data

Page 27: Integration of Hive and HBase

© Hortonworks Inc. 2011

Filter Decomposition

• Optimizer pushes down the predicates to the query plan • Storage handlers can negotiate with the Hive optimizer to decompose the filter

x > 3 AND upper(y) = 'XYZ’

• Handle x > 3, send upper(y) = ’XYZ’ as residual for Hive

• Works with: key = 3, key > 3, etc key > 3 AND key < 100

• Only works against constant expressions

Page 27 Architecting the Future of Big Data

Page 28: Integration of Hive and HBase

© Hortonworks Inc. 2011

Security Aspects Towards fully secure deployments

Architecting the Future of Big Data Page 28

Page 29: Integration of Hive and HBase

© Hortonworks Inc. 2011

Security – Big Picture • Security becomes more important to support enterprise level and multi tenant applications

• 5 Different Components to ensure / impose security – HDFS – MapReduce – HBase – Zookeeper – Hive

• Each component has: – Authentication – Authorization

Page 29 Architecting the Future of Big Data

Page 30: Integration of Hive and HBase

© Hortonworks Inc. 2011

HBase Security – Closer look

• Released with HBase 0.92 • Fully optional module, disabled by default • Needs an underlying secure Hadoop release • SecureRPCEngine: optional engine enforcing SASL authentication – Kerberos – DIGEST-MD5 based tokens – TokenProvider coprocessor

• Access control is implemented as a Coprocessor: AccessController

• Stores and distributes ACL data via Zookeeper – Sensitive data is only accessible by HBase daemons – Client does not need to authenticate to zk

Page 30 Architecting the Future of Big Data

Page 31: Integration of Hive and HBase

© Hortonworks Inc. 2011

Hive Security – Closer look

• Hive has different deployment options, security considerations should take into account different deployments

• Authentication is only supported at Metastore, not on HiveServer, web interface, JDBC

• Authorization is enforced at the query layer (Driver) • Pluggable authorization providers. Default one stores global/table/partition/column permissions in Metastore

GRANT ALTER ON TABLE web_table TO USER bob; CREATE ROLE db_reader

GRANT SELECT, SHOW_DATABASE ON DATABASE mydb TO ROLE db_reader

Page 31 Architecting the Future of Big Data

Page 32: Integration of Hive and HBase

© Hortonworks Inc. 2011

Hive Deployment Option 1

Page 32 Architecting the Future of Big Data

Client

Metastore

RDBMS

Driver

CLI

HDFS

MapReduce

Execution

Parser Planner

Optimizer

Authorization

Authentication

A12n/A11N A12n/A11N

A/A HBase A/A

MS C l i en t

Page 33: Integration of Hive and HBase

© Hortonworks Inc. 2011

Hive Deployment Option 2

Page 33 Architecting the Future of Big Data

Client

Metastore

RDBMS

Driver

CLI

HDFS

MapReduce

Execution

Parser Planner

Optimizer

Authentication

Authorization

A12n/A11N A12n/A11N

MS Cl i en t

HBase A/A A/A

Page 34: Integration of Hive and HBase

© Hortonworks Inc. 2011

Hive Deployment Option 3

Page 34 Architecting the Future of Big Data

Client

Metastore

RDBMS

Hive Thrift Server

Driver

CLI

JDBC/ODBC

Hive Web Interface

HDFS

MapReduce

Execution

Parser Planner

Optimizer

Authentication

Authorization

A12n/A11N A12n/A11N

MS C l i e n t

HBase A/A A/A

Page 35: Integration of Hive and HBase

© Hortonworks Inc. 2011

Hive + HBase + Hadoop Security

• Regardless of Hive’s own security, for Hive to work on secure Hadoop and HBase, we should: – Obtain delegation tokens for Hadoop and HBase jobs

– Ensure to obey the storage level (HDFS, HBase) permission checks

– In HiveServer deployments, authenticate and impersonate the user

• Delegation tokens for Hadoop are already working

• Obtaining HBase delegation tokens are released in Hive 0.9.0

Page 35 Architecting the Future of Big Data

Page 36: Integration of Hive and HBase

© Hortonworks Inc. 2011

Future of Hive + HBase

• Improve on schema / type mapping • Fully secure Hive deployment options

• HBase bulk import improvements

• Sortable signed numeric types in HBase

• Filter pushdown: non key column filters

• Hive random access support for HBase

– https://cwiki.apache.org/HCATALOG/random-access-framework.html

Page 36 Architecting the Future of Big Data

Page 37: Integration of Hive and HBase

© Hortonworks Inc. 2011

References • Security

– https://issues.apache.org/jira/browse/HIVE-2764 – https://issues.apache.org/jira/browse/HBASE-5371 – https://issues.apache.org/jira/browse/HCATALOG-245 – https://issues.apache.org/jira/browse/HCATALOG-260 – https://issues.apache.org/jira/browse/HCATALOG-244 – https://cwiki.apache.org/confluence/display/HCATALOG/Hcat+Security

+Design

• Type mapping / Filter Pushdown – https://issues.apache.org/jira/browse/HIVE-1634 – https://issues.apache.org/jira/browse/HIVE-1226 – https://issues.apache.org/jira/browse/HIVE-1643 – https://issues.apache.org/jira/browse/HIVE-2815 – https://issues.apache.org/jira/browse/HIVE-1643

Page 37 Architecting the Future of Big Data

Page 38: Integration of Hive and HBase

Other Resources

Page 38 © Hortonworks Inc. 2012

• Hadoop Summit – June 13-14 – San Jose, California – www.Hadoopsummit.org

• Hadoop Training and Certification – Developing Solutions Using Apache Hadoop – Administering Apache Hadoop – Online classes available US, India, EMEA – http://hortonworks.com/training/

Page 39: Integration of Hive and HBase

© Hortonworks Inc. 2011

Thanks Questions?

Architecting the Future of Big Data Page 39