38
HBase Low Latency Nick Dimiduk, Hortonworks (@xefyr) Nicolas Liochon, Scaled Risk (@nkeywal) HBaseCon May 5, 2014

Apache HBase Low Latency

Embed Size (px)

DESCRIPTION

A deeper look at the HBase read and write paths with a focus on request latency. We look at sources of latency and how to minimize them.

Citation preview

Page 1: Apache HBase Low Latency

HBase  Low  Latency Nick  Dimiduk,  Hortonworks  (@xefyr)  

Nicolas  Liochon,  Scaled  Risk  (@nkeywal)    

HBaseCon  May  5,  2014  

Page 2: Apache HBase Low Latency

Agenda

•  Latency,  what  is  it,  how  to  measure  it  

• Write  path  

• Read  path  

• Next  steps  

Page 3: Apache HBase Low Latency

What’s  low  latency

Latency  is  about  percenJles  •  Average  !=  50%  percenJle  •  There  are  oRen  order  of  magnitudes  between  «  average  »  and  «  95  percenJle  »  •  Post  99%  =  «  magical  1%  ».  Work  in  progress  here.  

• Meaning  from  micro  seconds  (High  Frequency  Trading)  to  seconds  (interacJve  queries)  •  In  this  talk  milliseconds  

Page 4: Apache HBase Low Latency

Measure  latency

bin/hbase  org.apache.hadoop.hbase.PerformanceEvaluaJon  •  More  opJons  related  to  HBase:  autoflush,  replicas,  …  •  Latency  measured  in  micro  second  •  Easier  for  internal  analysis  

YCSB  -­‐  Yahoo!  Cloud  Serving  Benchmark  •  Useful  for  comparison  between  databases  •  Set  of  workload  already  defined  

Page 5: Apache HBase Low Latency

Write  path

•  Two  parts  •  Single  put  (WAL)  

•  The  client  just  sends  the  put  •  MulJple  puts  from  the  client  (new  behavior  since  0.96)  

•  The  client  is  much  smarter  

•  Four  stages  to  look  at  for  latency  •  Start  (establish  tcp  connecJons,  etc.)  •  Steady:  when  expected  condiJons  are  met  •  Machine  failure:  expected  as  well  •  Overloaded  system  

Page 6: Apache HBase Low Latency

Single  put:  communica>on  &  scheduling

• Client:  TCP  connecJon  to  the  server  •  Shared:  mulJtheads  on  the  same  client  are  using  the  same  TCP  connecJon  

•  Pooling  is  possible  and  does  improve  the  performances  in  some  circonstances  •  hbase.client.ipc.pool.size

 •  Server:  mulJple  calls  from  mulJple  threads  on  mulJple  machines  

•  Can  become  thousand  of  simultaneous  queries  •  Scheduling  is  required    

Page 7: Apache HBase Low Latency

Single  put:  real  work

•  The  server  must  • Write  into  the  WAL  queue  •  Sync  the    WAL  queue  (HDFS  flush)  • Write  into  the  memstore  

• WALs  queue  is  shared  between  all  the  regions/handlers  •  Sync  is  avoided  if  another  handlers  did  the  work  •  You  may  flush  more  than  expected  

Page 8: Apache HBase Low Latency

Simple  put:  A  small  run

Percen&le   Time  in  ms  Mean   1.21  50%   0.95  95%   1.50  99%   2.12  

Page 9: Apache HBase Low Latency

Latency  sources

• Candidate  one:  network  •  0.5ms  within  a  datacenter  •  Much  less  between  nodes  in  the  same  rack  

       

Percen&le   Time  in  ms  Mean   0.13  50%   0.12  95%   0.15  99%   0.47  

Page 10: Apache HBase Low Latency

Latency  sources

• Candidate  two:  HDFS  Flush  

• We  can  sJll  do  beier:  HADOOP-­‐7714  &  sons.  

Percen&le   Time  in  ms  Mean   0.33  50%   0.26  95%   0.59  99%   1.24  

Page 11: Apache HBase Low Latency

Latency  sources

• Millisecond  world:  everything  can  go  wrong  •  JVM  •  Network  •  OS  Scheduler  •  File  System  •  All  this  goes  into  the  post  99%  percenJle  

• Requires  monitoring  • Usually  using  the  latest  version  shelps.  

Page 12: Apache HBase Low Latency

Latency  sources

•  Split  (and  presplits)  •  Autosharding  is  great!  •  Puts  have  to  wait  •  Impacts:  seconds  

•  Balance  •  Regions  move  •  Triggers  a  retry  for  the  client  

•  hbase.client.pause  =  100ms  since  HBase  0.96  

•   Garbage  CollecJon  •  Impacts:  10’s  of  ms,  even  with  a  good  config  •  Covered  with  the  read  path  of  this  talk  

Page 13: Apache HBase Low Latency

From  steady  to  loaded  and  overloaded

•  Number  of  concurrent  tasks  is  a  factor  of  •  Number  of  cores  •  Number  of  disks  •  Number  of  remote  machines  used  

•  Difficult  to  esJmate  •  Queues  are  doomed  to  happen  •  hbase.regionserver.handler.count

•  So  for  low  latency  •  Replable  scheduler  since  HBase  0.98  (HBASE-­‐8884).  Requires  specific  code.  •  RPC  PrioriJes:  work  in  progress  (HBASE-­‐11048)  

Page 14: Apache HBase Low Latency

From  loaded  to  overloaded

•  MemStore  takes  too  much  room:  flush,  then  blocksquite  quickly  •  hbase.regionserver.global.memstore.size.lower.limit •  hbase.regionserver.global.memstore.size •  hbase.hregion.memstore.block.multiplier

•  Too  many  Hfiles:  block  unJl  compacJons  keep  up  •  hbase.hstore.blockingStoreFiles

•  Too  many  WALs  files:  Flush  and  block  •  hbase.regionserver.maxlogs

Page 15: Apache HBase Low Latency

Machine  failure

•  Failure  •  Dectect  •  Reallocate  •  Replay  WAL  

•  Replaying  WAL  is  NOT  required  for  puts  •  hbase.master.distributed.log.replay  

•  (default  true  in  1.0)  

•  Failure  =  Dectect  +  Reallocate  +  Retry  •  That’s  in  the  range  of  ~1s  for  simple  failures  •  Silent  failures  leads  puts  you  in  the  10s  range  if  the  hardware  does  not  help  

•  zookeeper.session.timeout

Page 16: Apache HBase Low Latency

Single  puts

• Millisecond  range  

•  Spikes  do  happen  in  steady  mode  •  100ms  •  Causes:  GC,  load,  splits  

Page 17: Apache HBase Low Latency

Streaming  puts

Htable#setAutoFlushTo(false)!Htable#put!Htable#flushCommit!

• As  simple  puts,  but  •  Puts  are  grouped  and  send  in  background  •  Load  is  taken  into  account  •  Does  not  block  

Page 18: Apache HBase Low Latency

Mul>ple  puts

hbase.client.max.total.tasks (default 100) hbase.client.max.perserver.tasks (default 5) hbase.client.max.perregion.tasks (default 1)

• Decouple  the  client  from  a  latency  spike  of  a  region  server  

•  Increase  the  throughput  by  50%  compared  to  old  mulJput  •  Makes  split  and  GC  more  transparent  

Page 19: Apache HBase Low Latency

Conclusion  on  write  path

•  Single  puts  can  be  very  fast  •  It’s  not  a  «  hard  real  Jme  »  system:  there  are  spikes  

• Most  latency  spikes  can  be  hidden  when  streaming  puts  

•  Failure  are  NOT  that  difficult  for  the  write  path  •  No  WAL  to  replay  

Page 20: Apache HBase Low Latency

And  now  for  the  read  path

Page 21: Apache HBase Low Latency

Read  path

• Get/short  scan  are  assumed  for  low-­‐latency  operaJons  • Again,  two  APIs  

•  Single  get:  HTable#get(Get) •  MulJ-­‐get:  HTable#get(List<Get>)

•  Four  stages,  same  as  write  path  •  Start  (tcp  connecJon,  …)  •  Steady:  when  expected  condiJons  are  met  •  Machine  failure:  expected  as  well  •  Overloaded  system:  you  may  need  to  add  machines  or  tune  your  workload  

Page 22: Apache HBase Low Latency

Mul>  get  /  Client

Group  Gets  by  RegionServer  

Execute  them  one  by  one  

Page 23: Apache HBase Low Latency

Mul>  get  /  Server

Page 24: Apache HBase Low Latency

Mul>  get  /  Server

Page 25: Apache HBase Low Latency

Access  latency  magnides Storage hierarchy: a different view

A bumpy ride that has been getting bumpier over time

Dean/2009  

Memory  is  100000x  faster  than  disk!  

Disk  seek  =  10ms  

Page 26: Apache HBase Low Latency

Known  unknowns

•  For  each  candidate  HFile  •  Exclude  by  file  metadata  

•  Timestamp  •  Rowkey  range  

•  Exclude  by  bloom  filter  

StoreFileScanner#  shouldUseScanner()  

Page 27: Apache HBase Low Latency

Unknown  knowns

• Merge  sort  results  polled  from  Stores  •  Seek  each  scanner  to  a  reference  KeyValue  •  Retrieve  candidate  data  from  disk  

• MulJple  HFiles  =>  mulitple  seeks  •  hbase.storescanner.parallel.seek.enable=true  

•  Short  Circuit  Reads  •  dfs.client.read.shortcircuit=true  

• Block  locality  •  Happy  clusters  compact!  

HFileBlock#  readBlockData()  

Page 28: Apache HBase Low Latency

BlockCache

• Reuse  previously  read  data  • Maximize  cache  hit  rate  

•  Larger  cache  •  Temporal  access  locality  •  Physical  access  locality  

BlockCache#getBlock()  

Page 29: Apache HBase Low Latency

BlockCache  Showdown

•  LruBlockCache  •  Default,  onheap  •  Quite  good  most  of  the  Jme  •  EvicJons  impact  GC  

• BucketCache  •  Oxeap  alternaJve  •  SerializaJon  overhead  •  Large  memory  configuraJons  

hip://www.n10k.com/blog/blockcache-­‐showdown/  

L2  off-­‐heap  BucketCache  makes  a  strong  showing  

Page 30: Apache HBase Low Latency

Latency  enemies:  Garbage  Collec>on

• Use  heap.  Not  too  much.  With  CMS.  • Max  heap  •  30GB  (compressed  pointers)  •  8-­‐16GB  if  you  care  about  9’s  

• Healthy  cluster  load  •  regular,  reliable  collecJons  •  25-­‐100ms  pause  on  regular  interval  

• Overloaded  RegionServer  suffers  GC  overmuch    

Page 31: Apache HBase Low Latency

Off-­‐heap  to  the  rescue?  

• BucketCache  (0.96,  HBASE-­‐7404)  • Network  interfaces  (HBASE-­‐9535)  • MemStore  et  al  (HBASE-­‐10191)  

Page 32: Apache HBase Low Latency

Latency  enemies:  Compac>ons

•  Fewer  HFiles  =>  fewer  seeks  

•  Evict  data  blocks!  •  Evict  Index  blocks!!  

•  hfile.block.index.cacheonwrite  •  Evict  bloom  blocks!!!  

•  hfile.block.bloom.cacheonwrite  

•  OS  buffer  cache  to  the  rescue  •  Compactected  data  is  sJll  fresh  •  Beier  than  going  all  the  way  back  to  disk  

Page 33: Apache HBase Low Latency

Failure

• Detect  +  Reassign  +  Replay  •  Strong  consistency  requires  replay  

•  Locality  drops  to  0  • Cache  starts  from  scratch  

Page 34: Apache HBase Low Latency

Hedging  our  bets

• HDFS  Hedged  reads  (2.4,  HDFS-­‐5776)  •  Reads  on  secondary  DataNodes  •  Strongly  consistent  • Works  at  the  HDFS  level  

•  Timeline  consistency  (HBASE-­‐10070)  •  Reads  on  «  Replica  Region  »  •  Not  strongly  consistent  

Page 35: Apache HBase Low Latency

Read  latency  in  summary

•  Steady  mode  •  Cache  hit:  <  1  ms  •  Cache  miss:  +  10  ms  per  seek  •  WriJng  while  reading  =>  cache  churn  •  GC:  25-­‐100ms  pause  on  regular  interval  

Network  request  +  (1  -­‐  P(cache  hit))  *  (10  ms  *  seeks)    •  Same  long  tail  issues  as  write  •  Overloaded:  same  scheduling  issues  as  write  •  ParJal  failures  hurt  a  lot  

Page 36: Apache HBase Low Latency

HBase  ranges  for  99%  latency

    Put  Streamed  Mul&put   Get   Timeline  get  

Steady   milliseconds   milliseconds   milliseconds   milliseconds  

Failure   seconds   seconds   seconds   milliseconds  

GC  10’s  of  

milliseconds   milliseconds  10’s  of  

milliseconds   milliseconds  

Page 37: Apache HBase Low Latency

What’s  next

•  Less  GC  •  Use  less  objects  •  Oxeap  

• Compressed  BlockCache  (HBASE-­‐8894)  • Prefered  locaJon  (HBASE-­‐4755)  

•  The  «  magical  1%  »  •  Most  tools  stops  at  the  99%  latency  • What  happens  aRer  is  much  more  complex  

Page 38: Apache HBase Low Latency

Thanks! Nick  Dimiduk,  Hortonworks  (@xefyr)  

Nicolas  Liochon,  Scaled  Risk  (@nkeywal)    

HBaseCon  May  5,  2014