Hadoop, Hbase, And Hive

Hive/HBase Integrationor, MaybeSQL?April 2010John SichiFacebook+

AgendaUse CasesArchitectureStorage HandlerLoad via INSERTQuery ProcessingBulk LoadQ & A

Facebook

MotivationsData, data, and more data200 GB/day in March 2008 -> 12+ TB/day at the end of 2009About 8x increase per yearQueries, queries, and more queriesMore than 200 unique users querying per day7500+ queries on production cluster per day; mixture of ad-hoc queries and ETL/reporting queriesThey want it all and they want it nowUsers expect faster response time on fresher dataSampled subsets arent always good enough

Facebook

How Can HBase Help?Replicate dimension tables from transactional databases with low latency and without sharding(Fact data can stay in Hive since it is append-only)Only move changed rowsFull scrape is too slow and doesnt scale as data keeps growingHive by itself is not good at row-level operationsIntegrate into Hives map/reduce query execution plans for full parallel distributed processingMultiversioning for snapshot consistency?

Facebook

Use Case 1: HBase As ETL Data TargetFacebookHBaseHive INSERT SELECT SourceFiles/Tables

Use Case 2: HBase As Data SourceFacebookHBaseOtherFiles/TablesHive SELECT JOIN GROUP BY QueryResult

Use Case 3: Low Latency Warehouse FacebookHBaseOtherFiles/TablesPeriodic LoadContinuous UpdateHiveQueries

HBase ArchitectureFacebookFrom http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

Hive ArchitectureFacebook

All Together Now!Facebook

Hive CLI With HBaseMinimum configuration needed:

hive \--auxpath hive_hbasehandler.jar,hbase.jar,zookeeper.jar \-hiveconf hbase.zookeeper.quorum=zk1,zk2

hive> create table Facebook

Storage Handler

CREATE TABLE users( userid int, name string, email string, notes string)STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( hbase.columns.mapping = small:name,small:email,large:notes)TBLPROPERTIES ( hbase.table.name = user_list);Facebook

Column MappingFirst column in table is always the row keyOther columns can be mapped to either:An HBase column (any Hive type)An HBase column family (must be MAP type in Hive)Multiple Hive columns can map to the same HBase column or familyLimitationsCurrently no control over type mapping (always string in HBase)Currently no way to map HBase timestamp attributeFacebook

Load Via INSERTINSERT OVERWRITE TABLE usersSELECT * FROM ;Hive task writes rows to HBase via org.apache.hadoop.hbase.mapred.TableOutputFormatHBaseSerDe serializes rows into BatchUpdate objects (currently all values are converted to strings)Multiple rows with same key -> only one row writtenLimitationsNo write atomicity yetNo way to delete rowsWrite parallelism is query-dependent (map vs reduce)

Facebook

Map-Reduce Job for INSERTFacebookHBaseFrom http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png

Map-Only Job for INSERTFacebookHBase

Query ProcessingSELECT name, notes FROM users WHERE userid=xyz;Rows are read from HBase via org.apache.hadoop.hbase.mapred.TableInputFormatBaseHBase determines the splits (one per table region)HBaseSerDe produces lazy rows/maps for RowResultsColumn selection is pushed downAny SQL can be used (join, aggregation, union)LimitationsCurrently no filter pushdownHow do we achieve locality?Facebook

Metastore IntegrationDDL can be used to create metadata in Hive and HBase simultaneously and consistentlyCREATE EXTERNAL TABLE: register existing Hbase tableDROP TABLE: will drop HBase table too unless it was created as EXTERNALLimitationsNo two-phase-commit for DDL operationsALTER TABLE is not yet implementedPartitioning is not yet definedNo secondary indexingFacebook

Bulk LoadIdeallySET hive.hbase.bulk=true;INSERT OVERWRITE TABLE users SELECT ;But for now, you have to do some work and issue multiple Hive commandsSample source data for range partitioningSave sampling results to a fileRun CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner (sorts data, producing a large number of region files)Import HFiles into HBaseHBase can merge files if necessaryFacebook

Range Partitioning During SortFacebookA-GH-QR-ZHBase(H)(R)TotalOrderPartitionerloadtable.rb

Sampling Query For Range PartitioningGiven 5 million users in a table bucketed into 1000 buckets of 5000 users each, pick 9 user_ids which partition the set of all user_ids into 10 nearly-equal-sized ranges.

select user_id from (select user_id from hive_user_table tablesample(bucket 1 out of 1000 on user_id) s order by user_id) sorted_user_5k_samplewhere (row_sequence() % 501)=0;Facebook

Sorting Query For Bulk Loadset mapred.reduce.tasks=12; set hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner; set total.order.partitioner.path=/tmp/hb_range_key_list; set hfile.compression=gz; create table hbsort(user_id string, user_type string, ...) stored as inputformat 'org.apache.hadoop.mapred.TextInputFormatoutputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat tblproperties ('hfile.family.path' = '/tmp/hbsort/cf');

insert overwrite table hbsortselect user_id, user_type, createtime, from hive_user_tablecluster by user_id;

Facebook

DeploymentLatest Hive trunk (will be in Hive 0.6.0)Requires Hadoop 0.20+Tested with HBase 0.20.3 and Zookeeper 3.2.220-node hbtest cluster at FacebookNo performance numbers yetCurrently setting up tests with about 6TB (gz compressed)Facebook

[email protected]@facebook.comhttp://wiki.apache.org/hadoop/Hive/HBaseIntegrationhttp://wiki.apache.org/hadoop/Hive/HBaseBulkLoad

Special thanks to Samuel Guo for the early versions of the integration codeFacebook

Hey, What About HBQL?HBQL focuses on providing a convenient language layer for managing and accessing individual HBase tables, and is not intended for heavy-duty SQL processing such as joins and aggregationsHBQL is implemented via client-side calls, whereas Hive/HBase integration is implemented via map/reduce jobsFacebook

Documents

Hadoop, Hbase, And Hive