Apachecon Europe 2012: Operating HBase - Things you need to know

Operating HBase – Things You Need to Know

Christian Gügi

Outline● HBase internals

● Overview of HBase utilities

● HBase split visualisation with Hannibal

● Challenges & lessons learned

● Resources to get started

About me● Software Architect @ Sentric

● Founder and organizer of the Swiss Big Data User Grouphttp://www.bigdata-usergroup.ch

● Contact:christian.guegi@sentric.chhttp://www.sentric.ch@chrisgugi

HBase Internals

Data Model● A sparse, multi-dimensional, sorted map

● Table consist of rows, each has a row key

● Each row may have any number of columns

● Rows are sorted lexicographically based on row key

● Column = Column Family : Column Qualifier

– Cell → {rowkey, column, timestamp}

● Region: contiguous set of sorted rows

● Region: unit of distribution and availability

[Bigtable: A Distributed Storage System for Structured Data]

Physical Data Organization

Memstore

HFile(on HDFS)

Region

content Column Family

● Column families are stored separately on disk

– Unit of access control with different patterns

● Writes are held (sorted) in memory until flush

● Sorted on disk in predictable order

– By row key, column key, descending timestamp

Memstore

HFile(on HDFS)

anchor Column Family

Flushes and Compaction● Flushing/compaction per Region

– One thread (CompactSplitThread) per region server

● Minor compaction

– Merges two or more HFiles into one

● Major compaction

– Picks up all HFiles in the region, merges them and removes deleted k/v

● Regions are split when grown too large

System Architecture

Master

Write-Ahead Log

RegionServer

HDFS ZooKeeper

[HBase: The Definitive Guide]

MemstoreHFile

Key Design & Distribution● Bad idea: continuous number or timestamp

(sequential row keys)– RegionServer hot-spotting

● Better: use hash function and/or composite key – Distribute keys over random regions

– Uniform reads/writes across key space

● Proper key design is very essential– E.g. reversed URL (Bigtable paper)

Overview HBase Utilities

Useful Tools● hbck – checks and fixes table integrity and

region consistency

● HFile – examine contents of HFile

● HLog – examine contents of HLog file

● OfflineMetaRepair – rebuild meta table from file system

● HBase web interfaces– Master

– RegionsServer

Monitoring Tools● Ganglia

● Nagios

● OpenTSDB

● …

All tools use metrics provided through JMX

Manual Splitting● Via master web interface– Split

● HBase shell split command

● RegionSplitter– Create table with pre-split regions

– Rolling split of all regions on existing table

– . /bin/hbase org.apache.hadoop.hbase.util.RegionSplitter

Disable Automatic Splitting● Determined by hbase.hregion.max.filesize

● Set to max. 100GB

● OK, but: – How do I monitor my region growth?

– Where do I split when I have irregular data growth?

HBase Split Visualisation with Hannibal

Hannibal● Open source, project on github

– https://github.com/sentric/hannibal

● Web based

● Implemented in Scala

● Compatible with HBase 0.90

● Support > 0.92 added soon

● Check it out!

How well are regions balanced over the cluster?

How well are the regions split for the table?

How did the region evolve over time?

Future Plans● HBase 0.92 client API changes allow to

query Compaction-State on Regions through HBaseAdmin → differentiate major from minor compactions

● Add tool to find best region-key for irregular data growth

● Expose metrics through JMX

Challenges & Lessons Learned

Challenges● Everyone is still learning

● Some issues only appear at scale– At scale, nothing works as advertised

● Production cluster configuration– Hardware issues

– Tuning cluster configuration to our work loads

● HBase stability

● Monitoring health of HBase

Lessons Learned● Schema & key design

– What’s queried together should be stored together

● Monitoring/Operational tooling is most important

● Forget “emergency actions”, it takes some time

● You need DevOps in production

● Huge know-how curve, you need to know the whole ecosystem

– Hadoop, HDFS, Map/Red, ZooKeeper

Resources to get started● https://github.com/sentric/hannibal

● http://hbase.apache.org/book.html

● https://github.com/jmhsieh/hbase-repair-scripts

● http://www.sentric.ch/blog/best-practice-why-monitoring-hbase-is-important

● HBase: The Definitive Guide

Questions?@chrisgugi

Thank you!

Apachecon Europe 2012: Operating HBase - Things you need to know

Technology

Keynote from ApacheCon NA 2011

ApacheCon - Mentoring

02.28.13 WANdisco ApacheCon 2013

Hive Apachecon 2008

BIG DATA HADOOP FULLlBulk Loading in HBase lCreate, Insert, Read Tables in HBase lHBase Admin APIs l HBase Security lHBase vs Hive lBackup & Restore in HBase lApache HBase External

200211 Fielding Apachecon

Lens at apachecon

Hive Evolution: ApacheCon NA 2010

ApacheCon BigData Europe 2015

MyLife with HBase or HBase three flavors

HBase + Hue - LA HBase User Group

Apache iBatis (ApacheCon US 2007)

Shindig Apachecon Asia 09

Apache Student Induction ApacheCon 2013

Apachecon Rails

Amdatu - ApacheCon NA 2011

Getting Hadoop, Hive and HBase up and running in …archive.apachecon.com/na2013/presentations/26-Tuesday...Getting Hadoop, Hive and HBase up and running in less than 15 mins ApacheCon

PPT: ApacheCon presentation EU 2007

ApacheCon EU 2012

HBase User Group #9: HBase and HDFS