HBase Low Latency

HBase Low Latency

Nick Dimiduk, Hortonworks (@xefyr)Nicolas Liochon, Scaled Risk (@nkeywal)

Hadoop Summit June 4, 2014

Agenda

• Latency, what is it, how to measure it

• Write path

• Read path

• Next steps

What’s low latency

Latency is about percentiles• Average != 50% percentile• There are often order of magnitudes between « average » and « 95

percentile »• Post 99% = « magical 1% ». Work in progress here.

• Meaning from micro seconds (High Frequency Trading) to seconds (interactive queries)• In this talk milliseconds

Measure latency

bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation• More options related to HBase: autoflush, replicas, …• Latency measured in micro second• Easier for internal analysis

YCSB - Yahoo! Cloud Serving Benchmark• Useful for comparison between databases• Set of workload already defined

Write path

• Two parts• Single put (WAL)

• The client just sends the put• Multiple puts from the client (new behavior since 0.96)

• The client is much smarter

• Four stages to look at for latency• Start (establish tcp connections, etc.)• Steady: when expected conditions are met• Machine failure: expected as well• Overloaded system

Single put: communication & scheduling• Client: TCP connection to the server• Shared: multitheads on the same client are using the same TCP connection

• Pooling is possible and does improve the performances in some circonstances• hbase.client.ipc.pool.size

• Server: multiple calls from multiple threads on multiple machines• Can become thousand of simultaneous queries• Scheduling is required

Single put: real work

• The server must• Write into the WAL queue• Sync the WAL queue (HDFS flush)• Write into the memstore

• WALs queue is shared between all the regions/handlers• Sync is avoided if another handlers did the work• Your handler may flush more data than expected

Simple put: A small run

Percentile Time in msMean 1.2150% 0.9595% 1.5099% 2.12

Latency sources

• Candidate one: network• 0.5ms within a datacenter• Much less between nodes in the same rack


Latency sources

• Candidate two: HDFS Flush

• We can still do better: HADOOP-7714 & sons.


Latency sources

• Millisecond world: everything can go wrong• JVM• Network• OS Scheduler• File System• All this goes into the post 99% percentile

• Requires monitoring• Usually using the latest version helps

Latency sources

• Split (and presplits)• Autosharding is great!• Puts have to wait• Impacts: seconds

• Balance• Regions move• Triggers a retry for the client

• hbase.client.pause = 100ms since HBase 0.96

• Garbage Collection• Impacts: 10’s of ms, even with a good config• Covered with the read path of this talk

From steady to loaded and overloaded• Number of concurrent tasks is a factor of

• Number of cores• Number of disks• Number of remote machines used

• Difficult to estimate• Queues are doomed to happen• hbase.regionserver.handler.count

• So for low latency• Replable scheduler since HBase 0.98 (HBASE-8884). Requires specific code.• RPC Priorities: work in progress (HBASE-11048)

From loaded to overloaded

• MemStore takes too much room: flush, then blocksquite quickly• hbase.regionserver.global.memstore.size.lower.limit• hbase.regionserver.global.memstore.size• hbase.hregion.memstore.block.multiplier

• Too many Hfiles: block until compactions keep up• hbase.hstore.blockingStoreFiles

• Too many WALs files: Flush and block• hbase.regionserver.maxlogs

Machine failure

• Failure• Dectect• Reallocate• Replay WAL

• Replaying WAL is NOT required for puts• hbase.master.distributed.log.replay

• (default true in 1.0)

• Failure = Dectect + Reallocate + Retry• That’s in the range of ~1s for simple failures• Silent failures leads puts you in the 10s range if the hardware does not help

• zookeeper.session.timeout

Single puts

• Millisecond range

• Spikes do happen in steady mode• 100ms• Causes: GC, load, splits

Streaming puts

Htable#setAutoFlushTo(false)

Htable#put

Htable#flushCommit

• As simple puts, but• Puts are grouped and send in background• Load is taken into account• Does not block

Multiple puts

hbase.client.max.total.tasks (default 100)

hbase.client.max.perserver.tasks (default 5)

hbase.client.max.perregion.tasks (default 1)

• Decouple the client from a latency spike of a region server

• Increase the throughput by 50% compared to old multiput• Makes split and GC more transparent

Conclusion on write path

• Single puts can be very fast• It’s not a « hard real time » system: there are spikes

• Most latency spikes can be hidden when streaming puts

• Failure are NOT that difficult for the write path• No WAL to replay

And now for the read path

Read path

• Get/short scan are assumed for low-latency operations• Again, two APIs• Single get: HTable#get(Get)• Multi-get: HTable#get(List<Get>)

• Four stages, same as write path• Start (tcp connection, …)• Steady: when expected conditions are met• Machine failure: expected as well• Overloaded system: you may need to add machines or tune your workload

Multi get / Client

Group Gets byRegionServer

Execute themone by one

Multi get / Server

Multi get / Server

http://hadoop-hbase.blogspot.com/2012/05/hbasecon.html

Access latency magnides

Dean/2009

Memory is 100000xfaster than disk!

Disk seek = 10ms

Known unknowns

• For each candidate HFile• Exclude by file metadata

• Timestamp• Rowkey range

• Exclude by bloom filter

StoreFileScanner#shouldUseScanner()

Unknown knowns

• Merge sort results polled from Stores• Seek each scanner to a reference KeyValue• Retrieve candidate data from disk

• Multiple HFiles => mulitple seeks• hbase.storescanner.parallel.seek.enable=true

• Short Circuit Reads• dfs.client.read.shortcircuit=true

• Block locality• Happy clusters compact!

HFileBlock#readBlockData()

BlockCache

• Reuse previously read data• Maximize cache hit rate• Larger cache• Temporal access locality• Physical access locality

BlockCache#getBlock()

BlockCache Showdown

• LruBlockCache• Default, onheap• Quite good most of the time• Evictions impact GC

• BucketCache• Offheap alternative• Serialization overhead• Large memory configurations

http://www.n10k.com/blog/blockcache-showdown/

L2 off-heap BucketCachemakes a strong showing

Latency enemies: Garbage Collection

• Use heap. Not too much. With CMS.• Max heap

• 30GB (compressed pointers)• 8-16GB if you care about 9’s

• Healthy cluster load• regular, reliable collections• 25-100ms pause on regular interval

• Overloaded RegionServer suffers GC overmuch

Off-heap to the rescue?

• BucketCache (0.96, HBASE-7404)• Network interfaces (HBASE-9535)• MemStore et al (HBASE-10191)

Latency enemies: Compactions

• Fewer HFiles => fewer seeks

• Evict data blocks!• Evict Index blocks!!

• hfile.block.index.cacheonwrite

• Evict bloom blocks!!!• hfile.block.bloom.cacheonwrite

• OS buffer cache to the rescue• Compactected data is still fresh• Better than going all the way back to disk

Failure

• Detect + Reassign + Replay• Strong consistency requires replay

• Locality drops to 0• Cache starts from scratch

Hedging our bets

• HDFS Hedged reads (2.4, HDFS-5776)• Reads on secondary DataNodes• Strongly consistent• Works at the HDFS level

• Timeline consistency (HBASE-10070)• Reads on « Replica Region »• Not strongly consistent

Read latency in summary

• Steady mode• Cache hit: < 1 ms• Cache miss: + 10 ms per seek• Writing while reading => cache churn• GC: 25-100ms pause on regular interval

Network request + (1 - P(cache hit)) * (10 ms * seeks)

• Same long tail issues as write• Overloaded: same scheduling issues as write• Partial failures hurt a lot

HBase ranges for 99% latency

PutStreamed Multiput Get Timeline get

Steady milliseconds milliseconds milliseconds milliseconds

Failure seconds seconds seconds milliseconds

GC10’s of

milliseconds milliseconds10’s of

milliseconds milliseconds

What’s next

• Less GC• Use less objects• Offheap

• Compressed BlockCache (HBASE-8894)• Prefered location (HBASE-4755)

• The « magical 1% »• Most tools stops at the 99% latency• What happens after is much more complex

Thanks!Nick Dimiduk, Hortonworks (@xefyr)

Nicolas Liochon, Scaled Risk (@nkeywal)

Hadoop Summit June 4, 2014

Technology

HBase Low Latency