Hadoop, Hbase, And Hive

  • View
    26

  • Download
    7

Embed Size (px)

DESCRIPTION

Good notes for Hadoop, Hbase and Hive

Text of Hadoop, Hbase, And Hive

  • Hive/HBase Integrationor, MaybeSQL?April 2010John SichiFacebook+

  • AgendaUse CasesArchitectureStorage HandlerLoad via INSERTQuery ProcessingBulk LoadQ & A

    Facebook

  • MotivationsData, data, and more data200 GB/day in March 2008 -> 12+ TB/day at the end of 2009About 8x increase per yearQueries, queries, and more queriesMore than 200 unique users querying per day7500+ queries on production cluster per day; mixture of ad-hoc queries and ETL/reporting queriesThey want it all and they want it nowUsers expect faster response time on fresher dataSampled subsets arent always good enough

    Facebook

  • How Can HBase Help?Replicate dimension tables from transactional databases with low latency and without sharding(Fact data can stay in Hive since it is append-only)Only move changed rowsFull scrape is too slow and doesnt scale as data keeps growingHive by itself is not good at row-level operationsIntegrate into Hives map/reduce query execution plans for full parallel distributed processingMultiversioning for snapshot consistency?

    Facebook

  • Use Case 1: HBase As ETL Data TargetFacebookHBaseHive INSERT SELECT SourceFiles/Tables

  • Use Case 2: HBase As Data SourceFacebookHBaseOtherFiles/TablesHive SELECT JOIN GROUP BY QueryResult

  • Use Case 3: Low Latency Warehouse FacebookHBaseOtherFiles/TablesPeriodic LoadContinuous UpdateHiveQueries

  • HBase ArchitectureFacebookFrom http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

  • Hive ArchitectureFacebook

  • All Together Now!Facebook

  • Hive CLI With HBaseMinimum configuration needed:

    hive \--auxpath hive_hbasehandler.jar,hbase.jar,zookeeper.jar \-hiveconf hbase.zookeeper.quorum=zk1,zk2

    hive> create table Facebook

  • Storage Handler

    CREATE TABLE users( userid int, name string, email string, notes string)STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( hbase.columns.mapping = small:name,small:email,large:notes)TBLPROPERTIES ( hbase.table.name = user_list);Facebook

  • Column MappingFirst column in table is always the row keyOther columns can be mapped to either:An HBase column (any Hive type)An HBase column family (must be MAP type in Hive)Multiple Hive columns can map to the same HBase column or familyLimitationsCurrently no control over type mapping (always string in HBase)Currently no way to map HBase timestamp attributeFacebook

  • Load Via INSERTINSERT OVERWRITE TABLE usersSELECT * FROM ;Hive task writes rows to HBase via org.apache.hadoop.hbase.mapred.TableOutputFormatHBaseSerDe serializes rows into BatchUpdate objects (currently all values are converted to strings)Multiple rows with same key -> only one row writtenLimitationsNo write atomicity yetNo way to delete rowsWrite parallelism is query-dependent (map vs reduce)

    Facebook

  • Map-Reduce Job for INSERTFacebookHBaseFrom http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png

  • Map-Only Job for INSERTFacebookHBase

  • Query ProcessingSELECT name, notes FROM users WHERE userid=xyz;Rows are read from HBase via org.apache.hadoop.hbase.mapred.TableInputFormatBaseHBase determines the splits (one per table region)HBaseSerDe produces lazy rows/maps for RowResultsColumn selection is pushed downAny SQL can be used (join, aggregation, union)LimitationsCurrently no filter pushdownHow do we achieve locality?Facebook

  • Metastore IntegrationDDL can be used to create metadata in Hive and HBase simultaneously and consistentlyCREATE EXTERNAL TABLE: register existing Hbase tableDROP TABLE: will drop HBase table too unless it was created as EXTERNALLimitationsNo two-phase-commit for DDL operationsALTER TABLE is not yet implementedPartitioning is not yet definedNo secondary indexingFacebook

  • Bulk LoadIdeallySET hive.hbase.bulk=true;INSERT OVERWRITE TABLE users SELECT ;But for now, you have to do some work and issue multiple Hive commandsSample source data for range partitioningSave sampling results to a fileRun CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner (sorts data, producing a large number of region files)Import HFiles into HBaseHBase can merge files if necessaryFacebook

  • Range Partitioning During SortFacebookA-GH-QR-ZHBase(H)(R)TotalOrderPartitionerloadtable.rb

  • Sampling Query For Range PartitioningGiven 5 million users in a table bucketed into 1000 buckets of 5000 users each, pick 9 user_ids which partition the set of all user_ids into 10 nearly-equal-sized ranges.

    select user_id from (select user_id from hive_user_table tablesample(bucket 1 out of 1000 on user_id) s order by user_id) sorted_user_5k_samplewhere (row_sequence() % 501)=0;Facebook

  • Sorting Query For Bulk Loadset mapred.reduce.tasks=12; set hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner; set total.order.partitioner.path=/tmp/hb_range_key_list; set hfile.compression=gz; create table hbsort(user_id string, user_type string, ...) stored as inputformat 'org.apache.hadoop.mapred.TextInputFormatoutputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat tblproperties ('hfile.family.path' = '/tmp/hbsort/cf');

    insert overwrite table hbsortselect user_id, user_type, createtime, from hive_user_tablecluster by user_id;

    Facebook

  • DeploymentLatest Hive trunk (will be in Hive 0.6.0)Requires Hadoop 0.20+Tested with HBase 0.20.3 and Zookeeper 3.2.220-node hbtest cluster at FacebookNo performance numbers yetCurrently setting up tests with about 6TB (gz compressed)Facebook

  • Questions?hive-user@hadoop.apache.orgjsichi@facebook.comhttp://wiki.apache.org/hadoop/Hive/HBaseIntegrationhttp://wiki.apache.org/hadoop/Hive/HBaseBulkLoad

    Special thanks to Samuel Guo for the early versions of the integration codeFacebook

  • Hey, What About HBQL?HBQL focuses on providing a convenient language layer for managing and accessing individual HBase tables, and is not intended for heavy-duty SQL processing such as joins and aggregationsHBQL is implemented via client-side calls, whereas Hive/HBase integration is implemented via map/reduce jobsFacebook