Upload
huguk
View
977
Download
0
Embed Size (px)
DESCRIPTION
Jon Hsieh, Software Engineer @ Cloudera and HBase Committer Apache HBase is a distributed non-relational database that provides low-latency random read write access to massive quantities of data. This talk will be broken up into two parts. First I'll talk about how in the past few years, HBase has been deployed in production at companies like Facebook, Pinterest, Groupon, and eBay and about the vibrant community of contributors from around the world include folks at Cloudera, Salesforce.com, Intel, HortonWorks, Yahoo!, and XiaoMi. Second I'll talk about the features in the newest release 0.96.x and in the upcoming 0.98.x release.
Citation preview
Headline Goes Here Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY PRIOR TO 10/23/12
Apache HBase – Where we’ve been and what’s upcoming Jonathan Hsieh | @jmhsieh Tech lead / SoMware Engineer at Cloudera | HBase PMC Member Hadoop Users Group UK April 10, 2014
4/10/14 Hadoop Users Group UK
Who Am I?
• Cloudera: • Tech Lead HBase Team • So<ware Engineer • Apache HBase commiVer / PMC • Apache Flume founder / PMC
• U of Washington: • Research in Distributed Systems
4/10/14 Hadoop Users Group UK
What is Apache HBase?
Apache HBase is a reliable, column-‐
oriented data store that provides consistent, low-‐latency, random read/
write access.
ZK HDFS
App MR
4/10/14 Hadoop Users Group UK
HBase provides Low-‐latency Random Access
• Writes: • 1-‐3ms, 1k-‐20k writes/sec per node
• Reads: • 0-‐3ms cached, 10-‐30ms disk • 10k-‐40k reads / second / node from cache
• Cell size: • 0B-‐3MB
• Read, write, and insert data anywhere in the table
4/10/14 Hadoop Users Group UK
0000000000
1111111111
2222222222
3333333333
4444444444
5555555555
6666666666
7777777777
1
2 3
4
5
Core Properces
• ACID guarantees on a row • Writes are durable • Strong consistency first, then availability • AMer failure, recover and return current value instead of returning stale value • CAS and atomic increments can be efficient.
• Sorted By Primary Key • Short scans are efficient • Parcconed by Primary Key
• Log Structured Merged Tree • Writes are extremely efficient • Reads are efficient
• Periodic layout opcmizacons for read opcmizacon (“compaccons”) required.
4/10/14 Hadoop Users Group UK
An HBase History
Where We’ve Been
4/10/14 Hadoop Users Group UK
Jan ‘12: 0.92.0
Apache HBase Timeline
4/10/14 Hadoop Users Group UK
2014 2006 2007 2008 2009 2010 2011 2013 2012
Nov ’06: Google BigTable OSDI ‘06
Apr ‘07: First Apache HBase commit as Hadoop contrib project
Apr ‘10: Apache HBase becomes top level project Oct ‘13: 0.96.0
Apr’11: CDH3 GA with HBase 0.90.1
May ‘12: HBaseCon 2012
Jun ‘13: HBaseCon 2013
Jan‘08: Promoted to Hadoop subproject
Feb ‘13: 0.98.0 May ‘12: 0.94.0
Summer‘11: Messages on HBase Summer ‘09
StumbleUpon goes produccon on HBase ~0.20
Nov ‘11: Cassini on HBase
Jan ‘13 Phoenix on HBase
Summer‘11: Web Crawl Cache
Developer Community
• Accve community! • Diverse commiVers from many organizacons
4/10/14 Hadoop Users Group UK
Apache HBase “Nascar” Slide
4/10/14 Hadoop Users Group UK
Apache HBase Core Development
4/10/14 Hadoop Users Group UK
• Vendors • Self Service
Apache HBase Sample Users
4/10/14 Hadoop Users Group UK
• Inbox • Storage • Web • Search • Analyccs • Monitoring
Apache HBase Ecosystem Projects
4/10/14 Hadoop Users Group UK
What’s here and new today?
Today: Apache 0.96.2 / 0.98.1
4/10/14 Hadoop Users Group UK
Criccal Features
Disaster Recovery
• Cluster Replicacon • Table Snapshots • Copy Table • Import / Export Tables • Metadata Corrupcon repair tool (hbck)
AdministraMve and ConMnuity
• Kerberos based Authenccacon • ACL based Authorizacon • Config change via rolling restart. • Within version rolling upgrade. • Protobuf based wire protocol for RPC future proofing
4/10/14 Hadoop Users Group UK
Hardened for 0.96
Table AdministraMon
• Online Schema change • Online Region Merging • Concnuous fault injeccon tescng with “Chaos Monkey”
Performance Tuning
• Alternate key encodings for efficient memory usage
• Exploring Compactor policy minimizes compaccon storms
• Smart and Adapcve Stochascc region load balancer
• Fast split policy for new tables
4/10/14 Hadoop Users Group UK
MR over Table Snapshots (0.98, CDH5.0)
• Previously MapReduce jobs over HBase required online full table scan
• Idea: Take a snapshot and run MR job over snapshot files
• Doesn’t use HBase client • Avoid affeccng HBase caches • 3-‐5x perf boost.
4/10/14 Hadoop Users Group UK
map map map map map map map map
reduce reduce reduce
map map map map map map map map
reduce reduce reduce
snapshot
Mean Time to Recovery (MTTR)
• Machine failures happen in distributed systems • Average unavailability when automaccally recovering from a failure.
• Recovery cme for a unclean data center power cycle
4/10/14 Hadoop Users Group UK
recovered nocfy repair detect
Region unavailable
Region available client aware
Region available client unaware
Fast nocficacon and deteccon (0.96)
• Proaccve nocficacon of HMaster failure (0.96) • Proaccve nocficacon of RS failure (0.96) • Nocfy client on recovery (0.96)
• Fast server failover (Hardware)
4/10/14 Hadoop Users Group UK
recovered replay assign split
Region unavailable
Region available for RW
hdfs hdfs
detect
hdfs
• Previously had two IO intensive passes: • Log splitng to intermediate files • Assign and log replay
• Now just one IO heavy pass: Assign first, then split+replay. • Improves read and write recovery cmes. • Off by default currently*.
Distributed log replay (experimental 0.96)
4/10/14 Hadoop Users Group UK
recovered split + replay assign
Region unavailable
Region available for RW
Region available for replay writes
hdfs
detect
*Caveat: If you override cme stamps you could have READ REPEATED isolacon violacons (use tags to fix this)
Cell Tags (0.98 experimental)
• Mechanism for aVaching arbitrary metadata to Cells.
• Mocvacon: Finer-‐grained isolacon • Use for Accumulo-‐style cell-‐level visibility
• Main feature for 0.98 • Other uses:
• Add sequence numbers to enable correct fast read/write recovery
• Potencal for schema tags
4/10/14 Hadoop Users Group UK
Htrace (0.96 experimental)
• Problem: • Where is cme being spent inside HBase?
• Solucon: HTrace Framework • Inspired by Google Dapper • Threaded through HBase and HDFS
• Tracks cme spent in calls in a distributed system by tracking spans* on different machines.
*Some assembly scll required.
4/10/14 Hadoop Users Group UK
HTrace: Distributed Tracing in HBase and HDFS
• Framework Inspired by Google Dapper
• Tracks cme spent in calls in RPCs across different machines.
• Threaded through HBase (0.96) and future HDFS.
4/10/14 Hadoop Users Group UK
HBase
RS
1 HDFS
DN
ZK
HBase
Client
HBase
meta
HDFS
NN
A span
RPC calls
Zipkin – Visualizing Spans
• UI + Visualizacon System • WriVen by TwiVer
• Zipkin HBase Storage • Zipkin HTrace integracon
• View where cme from a specific call is spent in HBase, HDFS, and ZK.
4/10/14 Hadoop Users Group UK
A Future HBase
What’s Upcoming
4/10/14 Hadoop Users Group UK
Outline
• Improved Mean cme to recovery (MTTR)
• Improved Predictability
• Improved Usability
• Improved Mulctenancy
4/10/14 Hadoop Users Group UK
Faster read recovery
Improving MTTR Further
4/10/14 Hadoop Users Group UK
• Previously had two IO intensive passes: • Log splitng to intermediate files • Assign and log replay
• Now just one IO heavy pass: Assign first, then split+replay. • Improves read and write recovery cmes. • Off by default currently*.
Distributed log replay (experimental 0.96)
4/10/14 Hadoop Users Group UK
recovered split + replay assign
Region unavailable
Region available for RW
Region available for replay writes
hdfs
detect
*Caveat: If you override cme stamps you could have READ REPEATED isolacon violacons (use tags to fix this)
recovered split + replay
Distributed log replay with fast write recovery
4/10/14 Hadoop Users Group UK
assign
Region unavailable
Region available for RW
Region available for all writes
hdfs
detect
• Writes in HBase do not incur reads. • With distributed log replay, we’ve already have regions open for write.
• Allow fresh writes while replaying old logs*. *Caveat: If you override cme stamps you could have READ REPEATED isolacon violacons (use tags to fix this)
Fast Read Recovery (proposed)
• Idea: Priscne Region fast read recovery • If region not edited it is consistent and can recover RW immediately
• Idea: Shadow Regions for fast read recovery • Shadow region tails the WAL of the primary region • Shadow memstore is one HDFS block behind, catch up recover RW
• Currently some progress for trunk
4/10/14 Hadoop Users Group UK
recovered assign
Region unavailable
Can guarantee no new edits? Region available for all RW
detect
Can guarantee we have all edits? Region available for all RW
Improving the 99%cle
Improving Predictability
4/10/14 Hadoop Users Group UK
Common causes of performance variability
• Locality Loss • Favored Nodes, HDFS block affinity
• Compaccon • Exploring compactor
• GC* • Off-‐heap Cache
• Hardware hiccups • MulM WAL, HDFS speculaMve read
4/10/14 Hadoop Users Group UK
Performance degraded aMer recovery
• AMer recovery, reads suffer a performance hit. • Regions have lost locality • To maintain performance aMer failover, we need to regain locality. • Compact Region to regain locality
• We can do beVer by using HDFS features
4/10/14 Hadoop Users Group UK
performance recovered recovered
Service recovered; degraded performance L
recovery
Performance recovered because compaccon restores locality J
• Control and track where block replicas are • All files for a region created such that blocks go to the same set of favored nodes • When failing over, assign the region to one of those favored nodes.
• Currently a preview feature in 0.96 • Disabled by default because it doesn’t work well with the latest balancer or splits. • Will likely use upcoming HDFS block affinity for beVer operability
• Originally on Facebook’s 0.89, ported to 0.96
performance recovered
Read Throughput: Favored Nodes (experimental 0.96)
4/10/14 Hadoop Users Group UK
Service recovered; performance sustained because region assigned to favored node. J
recovery
Read latency: HDFS hedged read (CDH5.0)
• HBase’s Region servers use HDFS client to reads 1 of 3 HDFS block replicas
• If you chose the slow node, your reads are slow.
• If a read is taking too long, speculacvely go to another that may be faster.
4/10/14 Hadoop Users Group UK RS
1 2 3
Slow read!
Hdfs re
plicas
RS
1 2 3
Hdfs re
plicas
Too slow, read other replica
Read latency: Read Replicas (in progress)
• HBase client reads from primary region servers.
• If you chose the slow node, your reads are slow.
• Idea: Read replica assigned to other region servers. Replicas periodically catch up (via snapshots or shadow region memstores)
• Client specifies if stale read OK. If a read is taking too long, speculacvely go to another that may be faster.
4/10/14 Hadoop Users Group UK
Hbase
Client
1
Slow read!
1 2 3 Re
gion
replicas
Too slow, read stale replica
Hbase
Client
Write latency: Mulcple WALs (in progress)
• HBase’s HDFS client writes 3 replicas
• Min write latency is bounded by the slowest of the 3 replicas
• Idea: If a write is taking too long let’s duplicate it on another set that may be faster.
4/10/14 Hadoop Users Group UK RS
1 2 3
Slow Write
Hdfs
replicas
RS
1 2 3 Hd
fs
replicas
1 2 3 Hd
fs
replicas
Too slow, write to other replica
Improving Usability Autotuning, Tracing, and SQL
4/10/14 Hadoop Users Group UK
Making HBase easier to use and tune.
• Difficult to see what is happening in HBase • Easy to make poor design decisions early without realizing • New Developments
• Memory auto tuning • HTrace + Zipkin • Frameworks for Schema design
4/10/14 Hadoop Users Group UK
Memory Use Auto-‐tuning (trunk)
• Memory is divided between • the memstore (used for serving recent writes) • the block cache (used for read hot spots)
• Need to choose balance for work load
4/10/14 Hadoop Users Group UK
memstore
Block cache memstore
Block cache
memstore Block cache
Read Heavy Balanced Write heavy
HBase Schemas
• HBase Applicacon developers must iterate to find a suitable HBase schema
• Schema criMcal for Performance at Scale • How can we make this easier? • How can we reduce the expercse required to do this?
• Today: • Lots of tuning knobs • Developers need to understand Column Families, Rowkey design, Data encoding, …
• Some are expensive to change aMer the fact
4/10/14 Hadoop Users Group UK
How should I arrange my data?
• Isomorphic data representacons!
4/10/14 Hadoop Users Group UK
rowkey d:
bob-‐col1 aaaa
bob-‐col2 bbbb
bob-‐col3 cccc
bob-‐col4 dddd
jon-‐col1 eeee
jon-‐col2 ffff
jon-‐col3 gggg
jon-‐col4 hhhh
Rowkey d:col1 d:col2 d:col3 d:col4
bob aaaa bbbb cccc dddd
jon eeee ffff gggg hhhhh
Rowkey col1: col2: col3: col4:
bob aaaa bbbb cccc dddd
jon eeee ffff gggg hhhhh
Short Fat Table using column qualifiers
Short Fat Table using column families
Tall skinny with compound rowkey
How should I arrange my data?
• Isomorphic data representacons!
4/10/14 Hadoop Users Group UK
rowkey d:
bob-‐col1 aaaa
bob-‐col2 bbbb
bob-‐col3 cccc
bob-‐col4 dddd
jon-‐col1 eeee
jon-‐col2 ffff
jon-‐col3 gggg
jon-‐col4 hhhh
Rowkey d:col1 d:col2 d:col3 d:col4
bob aaaa bbbb cccc dddd
jon eeee ffff gggg hhhhh
Rowkey col1: col2: col3: col4:
bob aaaa bbbb cccc dddd
jon eeee ffff gggg hhhhh
Short Fat Table using column qualifiers
Short Fat Table using column families
Tall skinny with compound rowkey
With great pow
er comes grea
t
responsibility
!
How can we
make this easie
r for users?
Impala
• Scalable Low-‐latency SQL querying for HDFS (and HBase!) • ODBC/JDBC driver interface • Highlights
• Use’s Hive metastore and its hbase-‐hbase connector configuracon convencons.
• Nacve code implementacon, uses JIT for query execucon opcmizacon.
• Authorizacon via Kerberos support
• Open sourced by Cloudera • hVps://github.com/cloudera/impala
4/10/14 Hadoop Users Group UK
Phoenix
• A SQL skin over HBase targecng low-‐latency queries. • JDBC SQL interface • Highlights
• Adds Types • Handles Compound Row key encoding • Secondary indices in development • Provides some pushdown aggregacons (coprocessor).
• Open sourced by Salesforce.com • Work from James Taylor, Jesse Yates, et al • hVps://github.com/forcedotcom/phoenix
4/10/14 Hadoop Users Group UK
Kite (nee Cloudera Development Kit/CDK)
• APIs that provides a Dataset abstracMon • Provides get/put/delete API in avro objects • HBase Support in progress
• Highlights • Supports mulcple components of the hadoop distros (flume, morphlines, hive, crunch, hcat)
• Provides types using Avro and parquet formats for encoding encces
• Manages schema evolucon
• Open source by Cloudera • hVps://github.com/kite-‐sdk/kite
4/10/14 Hadoop Users Group UK
Many apps and users in a single cluster
Mulc-‐tenancy
4/10/14 Hadoop Users Group UK
Growing HBase
• Pre 0.96.0: scaling up HBase for single HBase applicacons
• Essencally a single user for single app.
• Ex: Facebook messages, one applicacon, many hbase clusters
• Shard users to different pods • Focused on concnuity and disaster recovery features
• Cross-‐cluster Replicacon • Table Snapshots • Rolling Upgrades
4/10/14 Hadoop Users Group UK
# of isolated applicacons
# of clusters
Scalability
One giant applicacon, Mulcple clusters
Growing HBase
• In 0.96 we introduce primicves for supporcng MulMtenancy
• Many users, many applicacons, one HBase cluster
• Need to have some control of the interaccons different users cause.
• Ex: Manage for MR analyccs and low-‐latency serving in one cluster.
4/10/14 Hadoop Users Group UK
# of isolated applicacons
# of clusters
mulctenancy
Scalability
One giant applicacon, Mulcple clusters
Many applicacons In one shared cluster
Namespaces (0.96)
• Namespaces provide an abstraccon for mulcple tenants to create and manage their own tables within a large HBase instance.
4/10/14 Hadoop Users Group UK
Namespace blue Namespace green Namespace orange
Mulctenancy goals
• Security (0.96) • A separate admin ACLs for different sets of tables
• Quotas (in progress) • Max tables, max regions.
• Performance Isolacon (in progress) • Limit performance impact load on one table has on others.
• Priority (future) • Prioricze some workloads/tables/user before others
4/10/14 Hadoop Users Group UK
Isolacon with Region Server Groups (in progress)
4/10/14 Hadoop Users Group UK
Region assignment distribucon (no region server groups)
Namespace blue Namespace green Namespace orange
Isolacon with Region Server Groups (in progress)
4/10/14 Hadoop Users Group UK
RSG blue RSG green orange
Namespace blue Namespace green Namespace orange
Region assignment distribucon with Region Server Groups (RSG)
Conclusions
4/10/14 Hadoop Users Group UK
Summary by Version 0.90 (CDH3) 0.92 /0.94 (CDH4) 0.96 (CDH5) Next (0.98 / 1.0.0)
New Features
Stability Reliability
Concnuity Mulctenancy
MTTR Recovery in Hours
Recovery in Minutes Recovery of writes in seconds, reads in 10’s of seconds
Recovery in Seconds (reads+writes)
Perf Baseline BeVer Throughput Opcmizing Performance
Predictable Performance
Usability HBase Developer Expercse
HBase Operaconal Experience
Distributed Systems Admin Experience
Applicacon Developers Experience
4/10/14 Hadoop Users Group UK
Quescons? @jmhsieh
4/10/14 Hadoop Users Group UK