Apache HBase: Where We've Been and What's Upcoming

Headline Goes Here Speaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

Apache HBase – Where we’ve been and what’s upcoming Jonathan Hsieh | @jmhsieh Tech lead / SoMware Engineer at Cloudera | HBase PMC Member Hadoop Users Group UK April 10, 2014

4/10/14 Hadoop Users Group UK

Who Am I?

•  Cloudera: •  Tech Lead HBase Team •  So<ware Engineer •  Apache HBase commiVer / PMC •  Apache Flume founder / PMC

• U of Washington: •  Research in Distributed Systems


What is Apache HBase?

Apache HBase is a reliable, column-‐

oriented data store that provides consistent, low-‐latency, random read/

write access.

ZK HDFS

App MR


HBase provides Low-‐latency Random Access

•  Writes: •  1-‐3ms, 1k-‐20k writes/sec per node

•  Reads: •  0-‐3ms cached, 10-‐30ms disk •  10k-‐40k reads / second / node from cache

•  Cell size: •  0B-‐3MB

•  Read, write, and insert data anywhere in the table


0000000000

1111111111

2222222222

3333333333

4444444444

5555555555

6666666666

7777777777

1

2 3

4

5

Core Properces

•  ACID guarantees on a row •  Writes are durable •  Strong consistency first, then availability •  AMer failure, recover and return current value instead of returning stale value •  CAS and atomic increments can be efficient.

•  Sorted By Primary Key •  Short scans are efficient •  Parcconed by Primary Key

•  Log Structured Merged Tree •  Writes are extremely efficient •  Reads are efficient

•  Periodic layout opcmizacons for read opcmizacon (“compaccons”) required.


An HBase History

Where We’ve Been


Jan ‘12: 0.92.0

Apache HBase Timeline


2014 2006 2007 2008 2009 2010 2011 2013 2012

Nov ’06: Google BigTable OSDI ‘06

Apr ‘07: First Apache HBase commit as Hadoop contrib project

Apr ‘10: Apache HBase becomes top level project Oct ‘13: 0.96.0

Apr’11: CDH3 GA with HBase 0.90.1

May ‘12: HBaseCon 2012

Jun ‘13: HBaseCon 2013

Jan‘08: Promoted to Hadoop subproject

Feb ‘13: 0.98.0 May ‘12: 0.94.0

Summer‘11: Messages on HBase Summer ‘09

StumbleUpon goes produccon on HBase ~0.20

Nov ‘11: Cassini on HBase

Jan ‘13 Phoenix on HBase

Summer‘11: Web Crawl Cache

Developer Community

•  Accve community! •  Diverse commiVers from many organizacons


Apache HBase “Nascar” Slide


Apache HBase Core Development


•  Vendors •  Self Service

Apache HBase Sample Users


•  Inbox •  Storage • Web •  Search • Analyccs • Monitoring

Apache HBase Ecosystem Projects


What’s here and new today?

Today: Apache 0.96.2 / 0.98.1


Criccal Features

Disaster Recovery

•  Cluster Replicacon •  Table Snapshots •  Copy Table •  Import / Export Tables • Metadata Corrupcon repair tool (hbck)

AdministraMve and ConMnuity

•  Kerberos based Authenccacon •  ACL based Authorizacon •  Config change via rolling restart. • Within version rolling upgrade. •  Protobuf based wire protocol for RPC future proofing


Hardened for 0.96

Table AdministraMon

• Online Schema change • Online Region Merging •  Concnuous fault injeccon tescng with “Chaos Monkey”

Performance Tuning

•  Alternate key encodings for efficient memory usage

•  Exploring Compactor policy minimizes compaccon storms

•  Smart and Adapcve Stochascc region load balancer

•  Fast split policy for new tables


MR over Table Snapshots (0.98, CDH5.0)

•  Previously MapReduce jobs over HBase required online full table scan

•  Idea: Take a snapshot and run MR job over snapshot files

•  Doesn’t use HBase client •  Avoid affeccng HBase caches •  3-‐5x perf boost.


map map map map map map map map

reduce reduce reduce

map map map map map map map map

reduce reduce reduce

snapshot

Mean Time to Recovery (MTTR)

• Machine failures happen in distributed systems • Average unavailability when automaccally recovering from a failure.

•  Recovery cme for a unclean data center power cycle


recovered nocfy repair detect

Region unavailable

Region available client aware

Region available client unaware

Fast nocficacon and deteccon (0.96)

•  Proaccve nocficacon of HMaster failure (0.96) •  Proaccve nocficacon of RS failure (0.96) • Nocfy client on recovery (0.96)

•  Fast server failover (Hardware)


recovered replay assign split

Region unavailable

Region available for RW

hdfs hdfs

detect

hdfs

•  Previously had two IO intensive passes: •  Log splitng to intermediate files •  Assign and log replay

•  Now just one IO heavy pass: Assign first, then split+replay. •  Improves read and write recovery cmes. •  Off by default currently*.

Distributed log replay (experimental 0.96)


recovered split + replay assign

Region unavailable


Region available for replay writes

hdfs

detect

*Caveat: If you override cme stamps you could have READ REPEATED isolacon violacons (use tags to fix this)

Cell Tags (0.98 experimental)

• Mechanism for aVaching arbitrary metadata to Cells.

• Mocvacon: Finer-‐grained isolacon •  Use for Accumulo-‐style cell-‐level visibility

• Main feature for 0.98 • Other uses:

•  Add sequence numbers to enable correct fast read/write recovery

•  Potencal for schema tags


Htrace (0.96 experimental)

•  Problem: •  Where is cme being spent inside HBase?

•  Solucon: HTrace Framework •  Inspired by Google Dapper •  Threaded through HBase and HDFS

•  Tracks cme spent in calls in a distributed system by tracking spans* on different machines.

*Some assembly scll required.


HTrace: Distributed Tracing in HBase and HDFS

•  Framework Inspired by Google Dapper

•  Tracks cme spent in calls in RPCs across different machines.

•  Threaded through HBase (0.96) and future HDFS.


HBase

RS

1 HDFS

DN

ZK

HBase

Client

HBase

meta

HDFS

NN

A span

RPC calls

Zipkin – Visualizing Spans

•  UI + Visualizacon System •  WriVen by TwiVer

•  Zipkin HBase Storage •  Zipkin HTrace integracon

•  View where cme from a specific call is spent in HBase, HDFS, and ZK.


A Future HBase

What’s Upcoming


Outline

•  Improved Mean cme to recovery (MTTR)

•  Improved Predictability

•  Improved Usability

•  Improved Mulctenancy


Faster read recovery

Improving MTTR Further


•  Previously had two IO intensive passes: •  Log splitng to intermediate files •  Assign and log replay

•  Now just one IO heavy pass: Assign first, then split+replay. •  Improves read and write recovery cmes. •  Off by default currently*.

Distributed log replay (experimental 0.96)


recovered split + replay assign

Region unavailable


Region available for replay writes

hdfs

detect

*Caveat: If you override cme stamps you could have READ REPEATED isolacon violacons (use tags to fix this)

recovered split + replay

Distributed log replay with fast write recovery


assign

Region unavailable


Region available for all writes

hdfs

detect

• Writes in HBase do not incur reads. • With distributed log replay, we’ve already have regions open for write.

• Allow fresh writes while replaying old logs*. *Caveat: If you override cme stamps you could have READ REPEATED isolacon violacons (use tags to fix this)

Fast Read Recovery (proposed)

•  Idea: Priscne Region fast read recovery •  If region not edited it is consistent and can recover RW immediately

•  Idea: Shadow Regions for fast read recovery •  Shadow region tails the WAL of the primary region •  Shadow memstore is one HDFS block behind, catch up recover RW

•  Currently some progress for trunk


recovered assign

Region unavailable

Can guarantee no new edits? Region available for all RW

detect

Can guarantee we have all edits? Region available for all RW

Improving the 99%cle

Improving Predictability


Common causes of performance variability

•  Locality Loss •  Favored Nodes, HDFS block affinity

•  Compaccon •  Exploring compactor

• GC* •  Off-‐heap Cache

• Hardware hiccups • MulM WAL, HDFS speculaMve read


Performance degraded aMer recovery

•  AMer recovery, reads suffer a performance hit. •  Regions have lost locality •  To maintain performance aMer failover, we need to regain locality. •  Compact Region to regain locality

• We can do beVer by using HDFS features


performance recovered recovered

Service recovered; degraded performance L

recovery

Performance recovered because compaccon restores locality J

•  Control and track where block replicas are •  All files for a region created such that blocks go to the same set of favored nodes •  When failing over, assign the region to one of those favored nodes.

•  Currently a preview feature in 0.96 •  Disabled by default because it doesn’t work well with the latest balancer or splits. •  Will likely use upcoming HDFS block affinity for beVer operability

•  Originally on Facebook’s 0.89, ported to 0.96

performance recovered

Read Throughput: Favored Nodes (experimental 0.96)


Service recovered; performance sustained because region assigned to favored node. J

recovery

Read latency: HDFS hedged read (CDH5.0)

• HBase’s Region servers use HDFS client to reads 1 of 3 HDFS block replicas

•  If you chose the slow node, your reads are slow.

•  If a read is taking too long, speculacvely go to another that may be faster.

4/10/14 Hadoop Users Group UK RS

1 2 3

Slow read!

Hdfs re

plicas

RS

1 2 3

Hdfs re

plicas

Too slow, read other replica

Read latency: Read Replicas (in progress)

•  HBase client reads from primary region servers.

•  If you chose the slow node, your reads are slow.

•  Idea: Read replica assigned to other region servers. Replicas periodically catch up (via snapshots or shadow region memstores)

•  Client specifies if stale read OK. If a read is taking too long, speculacvely go to another that may be faster.


Hbase

Client

1

Slow read!

1 2 3 Re

gion

replicas

Too slow, read stale replica

Hbase

Client

Write latency: Mulcple WALs (in progress)

• HBase’s HDFS client writes 3 replicas

• Min write latency is bounded by the slowest of the 3 replicas

•  Idea: If a write is taking too long let’s duplicate it on another set that may be faster.

4/10/14 Hadoop Users Group UK RS

1 2 3

Slow Write

Hdfs

replicas

RS

1 2 3 Hd

fs

replicas

1 2 3 Hd

fs

replicas

Too slow, write to other replica

Improving Usability Autotuning, Tracing, and SQL


Making HBase easier to use and tune.

• Difficult to see what is happening in HBase •  Easy to make poor design decisions early without realizing • New Developments

• Memory auto tuning •  HTrace + Zipkin •  Frameworks for Schema design


Memory Use Auto-‐tuning (trunk)

• Memory is divided between •  the memstore (used for serving recent writes) •  the block cache (used for read hot spots)

• Need to choose balance for work load


memstore

Block cache memstore

Block cache

memstore Block cache

Read Heavy Balanced Write heavy

HBase Schemas

•  HBase Applicacon developers must iterate to find a suitable HBase schema

•  Schema criMcal for Performance at Scale •  How can we make this easier? •  How can we reduce the expercse required to do this?

•  Today: •  Lots of tuning knobs •  Developers need to understand Column Families, Rowkey design, Data encoding, …

•  Some are expensive to change aMer the fact


How should I arrange my data?

•  Isomorphic data representacons!


rowkey d:

bob-‐col1 aaaa

bob-‐col2 bbbb

bob-‐col3 cccc

bob-‐col4 dddd

jon-‐col1 eeee

jon-‐col2 ffff

jon-‐col3 gggg

jon-‐col4 hhhh

Rowkey d:col1 d:col2 d:col3 d:col4

bob aaaa bbbb cccc dddd

jon eeee ffff gggg hhhhh

Rowkey col1: col2: col3: col4:



Short Fat Table using column qualifiers

Short Fat Table using column families

Tall skinny with compound rowkey

How should I arrange my data?

•  Isomorphic data representacons!


rowkey d:

bob-‐col1 aaaa

bob-‐col2 bbbb

bob-‐col3 cccc

bob-‐col4 dddd

jon-‐col1 eeee

jon-‐col2 ffff

jon-‐col3 gggg

jon-‐col4 hhhh

Rowkey d:col1 d:col2 d:col3 d:col4



Rowkey col1: col2: col3: col4:



Short Fat Table using column qualifiers

Short Fat Table using column families

Tall skinny with compound rowkey

With great pow

er comes grea

t

responsibility

!

How can we

make this easie

r for users?

Impala

•  Scalable Low-‐latency SQL querying for HDFS (and HBase!) •  ODBC/JDBC driver interface •  Highlights

•  Use’s Hive metastore and its hbase-‐hbase connector configuracon convencons.

•  Nacve code implementacon, uses JIT for query execucon opcmizacon.

•  Authorizacon via Kerberos support

•  Open sourced by Cloudera •  hVps://github.com/cloudera/impala


Phoenix

•  A SQL skin over HBase targecng low-‐latency queries. •  JDBC SQL interface •  Highlights

•  Adds Types •  Handles Compound Row key encoding •  Secondary indices in development •  Provides some pushdown aggregacons (coprocessor).

•  Open sourced by Salesforce.com •  Work from James Taylor, Jesse Yates, et al •  hVps://github.com/forcedotcom/phoenix


Kite (nee Cloudera Development Kit/CDK)

•  APIs that provides a Dataset abstracMon •  Provides get/put/delete API in avro objects •  HBase Support in progress

•  Highlights •  Supports mulcple components of the hadoop distros (flume, morphlines, hive, crunch, hcat)

•  Provides types using Avro and parquet formats for encoding encces

•  Manages schema evolucon

•  Open source by Cloudera •  hVps://github.com/kite-‐sdk/kite


Many apps and users in a single cluster

Mulc-‐tenancy


Growing HBase

•  Pre 0.96.0: scaling up HBase for single HBase applicacons

•  Essencally a single user for single app.

•  Ex: Facebook messages, one applicacon, many hbase clusters

•  Shard users to different pods •  Focused on concnuity and disaster recovery features

•  Cross-‐cluster Replicacon •  Table Snapshots •  Rolling Upgrades


# of isolated applicacons

# of clusters

Scalability

One giant applicacon, Mulcple clusters

Growing HBase

•  In 0.96 we introduce primicves for supporcng MulMtenancy

• Many users, many applicacons, one HBase cluster

•  Need to have some control of the interaccons different users cause.

•  Ex: Manage for MR analyccs and low-‐latency serving in one cluster.


# of isolated applicacons

# of clusters

mulctenancy

Scalability

One giant applicacon, Mulcple clusters

Many applicacons In one shared cluster

Namespaces (0.96)

• Namespaces provide an abstraccon for mulcple tenants to create and manage their own tables within a large HBase instance.


Namespace blue Namespace green Namespace orange

Mulctenancy goals

•  Security (0.96) •  A separate admin ACLs for different sets of tables

•  Quotas (in progress) •  Max tables, max regions.

•  Performance Isolacon (in progress) •  Limit performance impact load on one table has on others.

•  Priority (future) •  Prioricze some workloads/tables/user before others


Isolacon with Region Server Groups (in progress)


Region assignment distribucon (no region server groups)


Isolacon with Region Server Groups (in progress)


RSG blue RSG green orange


Region assignment distribucon with Region Server Groups (RSG)

Conclusions


Summary by Version 0.90 (CDH3) 0.92 /0.94 (CDH4) 0.96 (CDH5) Next (0.98 / 1.0.0)

New Features

Stability Reliability

Concnuity Mulctenancy

MTTR Recovery in Hours

Recovery in Minutes Recovery of writes in seconds, reads in 10’s of seconds

Recovery in Seconds (reads+writes)

Perf Baseline BeVer Throughput Opcmizing Performance

Predictable Performance

Usability HBase Developer Expercse

HBase Operaconal Experience

Distributed Systems Admin Experience

Applicacon Developers Experience


Quescons? @jmhsieh


Technology

Apache HBase: Where We've Been and What's Upcoming