Upload
hadoop-user-group
View
133
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Hive/HBase Integrationor, MaybeSQL?April 2010
John Sichi
+
Agenda
» Use Cases
» Architecture
» Storage Handler
» Load via INSERT
» Query Processing
» Bulk Load
» Q & A
Motivations
» Data, data, and more data› 200 GB/day in March 2008 -> 12+ TB/day at the end of 2009
› About 8x increase per year
» Queries, queries, and more queries› More than 200 unique users querying per day
› 7500+ queries on production cluster per day; mixture of ad-hoc queries and ETL/reporting queries
» They want it all and they want it now› Users expect faster response time on fresher data
› Sampled subsets aren’t always good enough
How Can HBase Help?
» Replicate dimension tables from transactional databases with low latency and without sharding› (Fact data can stay in Hive since it is append-only)
» Only move changed rows› “Full scrape” is too slow and doesn’t scale as data keeps
growing
› Hive by itself is not good at row-level operations
» Integrate into Hive’s map/reduce query execution plans for full parallel distributed processing
» Multiversioning for snapshot consistency?
Use Case 1: HBase As ETL Data Target
HBaseHBaseHive INSERT… SELECT …Hive INSERT… SELECT …
SourceFiles/Tables
SourceFiles/Tables
Use Case 2: HBase As Data Source
HBaseHBase
OtherFiles/Tables
OtherFiles/Tables
Hive SELECT … JOIN …
GROUP BY …
Hive SELECT … JOIN …
GROUP BY …
QueryResultQueryResult
Use Case 3: Low Latency Warehouse
HBaseHBase
OtherFiles/Tables
OtherFiles/Tables
Periodic LoadPeriodic Load
Continuous UpdateContinuous Update
HiveQueries
HiveQueries
HBase Architecture
Facebook From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
Hive Architecture
All Together Now!
Hive CLI With HBase
» Minimum configuration needed:
hive \
--auxpath hive_hbasehandler.jar,hbase.jar,zookeeper.jar \
-hiveconf hbase.zookeeper.quorum=zk1,zk2…
hive> create table …
Storage Handler
CREATE TABLE users(
userid int, name string, email string, notes string)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
“hbase.columns.mapping” =
“small:name,small:email,large:notes”)
TBLPROPERTIES (
“hbase.table.name” = “user_list”
Column Mapping
» First column in table is always the row key
» Other columns can be mapped to either:› An HBase column (any Hive type)
› An HBase column family (must be MAP type in Hive)
» Multiple Hive columns can map to the same HBase column or family
» Limitations› Currently no control over type mapping (always string in
HBase)
› Currently no way to map HBase timestamp attribute
Load Via INSERT
INSERT OVERWRITE TABLE users
SELECT * FROM …; Hive task writes rows to HBase via
org.apache.hadoop.hbase.mapred.TableOutputFormat HBaseSerDe serializes rows into BatchUpdate objects
(currently all values are converted to strings) Multiple rows with same key -> only one row written Limitations
No write atomicity yetNo way to delete rowsWrite parallelism is query-dependent (map vs reduce)
Map-Reduce Job for INSERT
HBaseHBase
From http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png
Map-Only Job for INSERT
HBaseHBase
Query Processing
SELECT name, notes FROM users WHERE userid=‘xyz’; Rows are read from HBase via
org.apache.hadoop.hbase.mapred.TableInputFormatBase HBase determines the splits (one per table region) HBaseSerDe produces lazy rows/maps for RowResults Column selection is pushed down Any SQL can be used (join, aggregation, union…) Limitations
Currently no filter pushdownHow do we achieve locality?
Metastore Integration
» DDL can be used to create metadata in Hive and HBase simultaneously and consistently
» CREATE EXTERNAL TABLE: register existing Hbase table
» DROP TABLE: will drop HBase table too unless it was created as EXTERNAL
» Limitations› No two-phase-commit for DDL operations
› ALTER TABLE is not yet implemented
› Partitioning is not yet defined
› No secondary indexing
Bulk Load
Ideally…
SET hive.hbase.bulk=true;
INSERT OVERWRITE TABLE users SELECT … ; But for now, you have to do some work and issue multiple
Hive commands1 Sample source data for range partitioning
2 Save sampling results to a file
3 Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner (sorts data, producing a large number of region files)
4 Import HFiles into HBase
5 HBase can merge files if necessary
Range Partitioning During Sort
A-G
H-Q
R-Z
HBaseHBase
(H)(R)
TotalOrderPartitionerloadtable.rb
Sampling Query For Range Partitioning
Given 5 million users in a table bucketed into 1000 buckets of 5000 users each, pick 9 user_ids which partition the set of all user_ids into 10 nearly-equal-sized ranges.
select user_id from
(select user_id
from hive_user_table
tablesample(bucket 1 out of 1000 on user_id) s
order by user_id) sorted_user_5k_sample
where (row_sequence() % 501)=0;
Sorting Query For Bulk Load
set mapred.reduce.tasks=12;
set hive.mapred.partitioner=
org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
set total.order.partitioner.path=/tmp/hb_range_key_list;
set hfile.compression=gz;
create table hbsort(user_id string, user_type string, ...)
stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat’
outputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat’ tblproperties ('hfile.family.path' = '/tmp/hbsort/cf');
insert overwrite table hbsort
select user_id, user_type, createtime, …
from hive_user_table
cluster by user_id;
Deployment
» Latest Hive trunk (will be in Hive 0.6.0)
» Requires Hadoop 0.20+
» Tested with HBase 0.20.3 and Zookeeper 3.2.2
» 20-node hbtest cluster at Facebook
» No performance numbers yet› Currently setting up tests with about 6TB (gz compressed)
Questions?
» http://wiki.apache.org/hadoop/Hive/HBaseIntegration
» http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad
» Special thanks to Samuel Guo for the early versions of the integration code
Hey, What About HBQL?
» HBQL focuses on providing a convenient language layer for managing and accessing individual HBase tables, and is not intended for heavy-duty SQL processing such as joins and aggregations
» HBQL is implemented via client-side calls, whereas Hive/HBase integration is implemented via map/reduce jobs