Apache HBase Low Latency

HBase Low Latency Nick Dimiduk, Hortonworks (@xefyr)

Nicolas Liochon, Scaled Risk (@nkeywal)

HBaseCon May 5, 2014

Agenda

•  Latency, what is it, how to measure it

• Write path

• Read path

• Next steps

What’s low latency

Latency is about percenJles •  Average != 50% percenJle •  There are oRen order of magnitudes between « average » and « 95 percenJle » •  Post 99% = « magical 1% ». Work in progress here.

• Meaning from micro seconds (High Frequency Trading) to seconds (interacJve queries) •  In this talk milliseconds

Measure latency

bin/hbase org.apache.hadoop.hbase.PerformanceEvaluaJon •  More opJons related to HBase: autoflush, replicas, … •  Latency measured in micro second •  Easier for internal analysis

YCSB -‐ Yahoo! Cloud Serving Benchmark •  Useful for comparison between databases •  Set of workload already defined

Write path

•  Two parts •  Single put (WAL)

•  The client just sends the put •  MulJple puts from the client (new behavior since 0.96)

•  The client is much smarter

•  Four stages to look at for latency •  Start (establish tcp connecJons, etc.) •  Steady: when expected condiJons are met •  Machine failure: expected as well •  Overloaded system

Single put: communica>on & scheduling

• Client: TCP connecJon to the server •  Shared: mulJtheads on the same client are using the same TCP connecJon

•  Pooling is possible and does improve the performances in some circonstances •  hbase.client.ipc.pool.size

•  Server: mulJple calls from mulJple threads on mulJple machines

•  Can become thousand of simultaneous queries •  Scheduling is required

Single put: real work

•  The server must • Write into the WAL queue •  Sync the WAL queue (HDFS flush) • Write into the memstore

• WALs queue is shared between all the regions/handlers •  Sync is avoided if another handlers did the work •  You may flush more than expected

Simple put: A small run

Percen&le Time in ms Mean 1.21 50% 0.95 95% 1.50 99% 2.12

Latency sources

• Candidate one: network •  0.5ms within a datacenter •  Much less between nodes in the same rack


Latency sources

• Candidate two: HDFS Flush

• We can sJll do beier: HADOOP-‐7714 & sons.


Latency sources

• Millisecond world: everything can go wrong •  JVM •  Network •  OS Scheduler •  File System •  All this goes into the post 99% percenJle

• Requires monitoring • Usually using the latest version shelps.

Latency sources

•  Split (and presplits) •  Autosharding is great! •  Puts have to wait •  Impacts: seconds

•  Balance •  Regions move •  Triggers a retry for the client

•  hbase.client.pause = 100ms since HBase 0.96

•  Garbage CollecJon •  Impacts: 10’s of ms, even with a good config •  Covered with the read path of this talk

From steady to loaded and overloaded

•  Number of concurrent tasks is a factor of •  Number of cores •  Number of disks •  Number of remote machines used

•  Difficult to esJmate •  Queues are doomed to happen •  hbase.regionserver.handler.count

•  So for low latency •  Replable scheduler since HBase 0.98 (HBASE-‐8884). Requires specific code. •  RPC PrioriJes: work in progress (HBASE-‐11048)

From loaded to overloaded

•  MemStore takes too much room: flush, then blocksquite quickly •  hbase.regionserver.global.memstore.size.lower.limit •  hbase.regionserver.global.memstore.size •  hbase.hregion.memstore.block.multiplier

•  Too many Hfiles: block unJl compacJons keep up •  hbase.hstore.blockingStoreFiles

•  Too many WALs files: Flush and block •  hbase.regionserver.maxlogs

Machine failure

•  Failure •  Dectect •  Reallocate •  Replay WAL

•  Replaying WAL is NOT required for puts •  hbase.master.distributed.log.replay

•  (default true in 1.0)

•  Failure = Dectect + Reallocate + Retry •  That’s in the range of ~1s for simple failures •  Silent failures leads puts you in the 10s range if the hardware does not help

•  zookeeper.session.timeout

Single puts

• Millisecond range

•  Spikes do happen in steady mode •  100ms •  Causes: GC, load, splits

Streaming puts

Htable#setAutoFlushTo(false)!Htable#put!Htable#flushCommit!

• As simple puts, but •  Puts are grouped and send in background •  Load is taken into account •  Does not block

Mul>ple puts

hbase.client.max.total.tasks (default 100) hbase.client.max.perserver.tasks (default 5) hbase.client.max.perregion.tasks (default 1)

• Decouple the client from a latency spike of a region server

•  Increase the throughput by 50% compared to old mulJput •  Makes split and GC more transparent

Conclusion on write path

•  Single puts can be very fast •  It’s not a « hard real Jme » system: there are spikes

• Most latency spikes can be hidden when streaming puts

•  Failure are NOT that difficult for the write path •  No WAL to replay

And now for the read path

Read path

• Get/short scan are assumed for low-‐latency operaJons • Again, two APIs

•  Single get: HTable#get(Get) •  MulJ-‐get: HTable#get(List<Get>)

•  Four stages, same as write path •  Start (tcp connecJon, …) •  Steady: when expected condiJons are met •  Machine failure: expected as well •  Overloaded system: you may need to add machines or tune your workload

Mul> get / Client

Group Gets by RegionServer

Execute them one by one

Mul> get / Server

Mul> get / Server

Access latency magnides Storage hierarchy: a different view

A bumpy ride that has been getting bumpier over time

Dean/2009

Memory is 100000x faster than disk!

Disk seek = 10ms

Known unknowns

•  For each candidate HFile •  Exclude by file metadata

•  Timestamp •  Rowkey range

•  Exclude by bloom filter

StoreFileScanner# shouldUseScanner()

Unknown knowns

• Merge sort results polled from Stores •  Seek each scanner to a reference KeyValue •  Retrieve candidate data from disk

• MulJple HFiles => mulitple seeks •  hbase.storescanner.parallel.seek.enable=true

•  Short Circuit Reads •  dfs.client.read.shortcircuit=true

• Block locality •  Happy clusters compact!

HFileBlock# readBlockData()

BlockCache

• Reuse previously read data • Maximize cache hit rate

•  Larger cache •  Temporal access locality •  Physical access locality

BlockCache#getBlock()

BlockCache Showdown

•  LruBlockCache •  Default, onheap •  Quite good most of the Jme •  EvicJons impact GC

• BucketCache •  Oxeap alternaJve •  SerializaJon overhead •  Large memory configuraJons

hip://www.n10k.com/blog/blockcache-‐showdown/

L2 off-‐heap BucketCache makes a strong showing

Latency enemies: Garbage Collec>on

• Use heap. Not too much. With CMS. • Max heap •  30GB (compressed pointers) •  8-‐16GB if you care about 9’s

• Healthy cluster load •  regular, reliable collecJons •  25-‐100ms pause on regular interval

• Overloaded RegionServer suffers GC overmuch

Off-‐heap to the rescue?

• BucketCache (0.96, HBASE-‐7404) • Network interfaces (HBASE-‐9535) • MemStore et al (HBASE-‐10191)

Latency enemies: Compac>ons

•  Fewer HFiles => fewer seeks

•  Evict data blocks! •  Evict Index blocks!!

•  hfile.block.index.cacheonwrite •  Evict bloom blocks!!!

•  hfile.block.bloom.cacheonwrite

•  OS buffer cache to the rescue •  Compactected data is sJll fresh •  Beier than going all the way back to disk

Failure

• Detect + Reassign + Replay •  Strong consistency requires replay

•  Locality drops to 0 • Cache starts from scratch

Hedging our bets

• HDFS Hedged reads (2.4, HDFS-‐5776) •  Reads on secondary DataNodes •  Strongly consistent • Works at the HDFS level

•  Timeline consistency (HBASE-‐10070) •  Reads on « Replica Region » •  Not strongly consistent

Read latency in summary

•  Steady mode •  Cache hit: < 1 ms •  Cache miss: + 10 ms per seek •  WriJng while reading => cache churn •  GC: 25-‐100ms pause on regular interval

Network request + (1 -‐ P(cache hit)) * (10 ms * seeks) •  Same long tail issues as write •  Overloaded: same scheduling issues as write •  ParJal failures hurt a lot

HBase ranges for 99% latency

Put Streamed Mul&put Get Timeline get

Steady milliseconds milliseconds milliseconds milliseconds

Failure seconds seconds seconds milliseconds

GC 10’s of

milliseconds milliseconds 10’s of

milliseconds milliseconds

What’s next

•  Less GC •  Use less objects •  Oxeap

• Compressed BlockCache (HBASE-‐8894) • Prefered locaJon (HBASE-‐4755)

•  The « magical 1% » •  Most tools stops at the 99% latency • What happens aRer is much more complex

Thanks! Nick Dimiduk, Hortonworks (@xefyr)

Nicolas Liochon, Scaled Risk (@nkeywal)

HBaseCon May 5, 2014

Technology

Apache HBase Low Latency