HBase at Xiaomi {xieliang, fenghonghua}@xiaomi.com Liang Xie / Honghua Feng 1

HBase at Xiaomi

{xieliang, fenghonghua}@xiaomi.com

Liang Xie / Honghua Feng

1www.mi.com

2

About Us

Honghua FengLiang Xie

www.mi.com

3

Outline

Introduction

Latency practice

Some patches we contributed

Some ongoing patches

Q&A

www.mi.com

4

About Xiaomi

Mobile internet company founded in 2010

Sold 18.7 million phones in 2013

Over $5 billion revenue in 2013

Sold 11 million phones in Q1, 2014

www.mi.com

5

Hardware

www.mi.com

6

Software

www.mi.com

7

Internet Services

www.mi.com

8

About Our HBase Team

Founded in October 2012

5 members Liang Xie Shaohui Liu Jianwei Cui Liangliang He Honghua Feng

Resolved 130+ JIRAs so far

www.mi.com

9

Our Clusters and Scenarios

15 Clusters : 9 online / 2 processing / 4 test

Scenarios MiCloud MiPush MiTalk Perf Counter

www.mi.com

10

Our Latency Pain Points

Java GC

Stable page write in OS layer

Slow buffered IO (FS journal IO)

Read/Write IO contention

www.mi.com

11

Bucket cache with off-heap mode

Xmn/ServivorRatio/MaxTenuringThreshold

PretenureSizeThreshold & repl src size

GC concurrent thread number

GC time per day : [2500, 3000] -> [300, 600]s !!!

www.mi.com

HBase GC Practice

12

HBase client put->HRegion.batchMutate->HLog.sync->SequenceFileLogWriter.sync->DFSOutputStream.flushOrSync->DFSOutputStream.waitForAckedSeqno <Stuck here often!>==================================================

=DataNode pipeline write, in BlockReceiver.receivePacket() :->receiveNextPacket->mirrorPacketTo(mirrorOut) //write packet to the mirror->out.write/flush //write data to local disk. <- buffered IO

[Added instrumentation(HDFS-6110) showed the stalled write was the culprit, strace result also confirmed it

www.mi.com

Write Latency Spikes

13

write() is expected to be fast

But blocked by write-back sometimes!

www.mi.com

Root Cause of Write Latency Spikes

14

Workaround :

2.6.32.279(6.3) -> 2.6.32.220(6.2)or2.6.32.279(6.3) -> 2.6.32.358(6.4)

Try to avoid deploying REHL6.3/Centos6.3 in an extremely latency sensitive HBase cluster!

www.mi.com

Stable page write issue workaround

15

...0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2]0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2]0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4]0xffffffffa00fa253 : ext4_reserve_inode_write+0x73/0xa0 [ext4]0xffffffffa00fa2cc : ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]0xffffffffa00fa6c4 : ext4_generic_write_end+0xe4/0xf0 [ext4]0xffffffffa00fdf74 : ext4_writeback_write_end+0x74/0x160 [ext4]0xffffffff81111474 : generic_file_buffered_write+0x174/0x2a0 [kernel]0xffffffff81112d60 : __generic_file_aio_write+0x250/0x480 [kernel]0xffffffff81112fff : generic_file_aio_write+0x6f/0xe0 [kernel]0xffffffffa00f3de1 : ext4_file_write+0x61/0x1e0 [ext4]0xffffffff811762da : do_sync_write+0xfa/0x140 [kernel]0xffffffff811765d8 : vfs_write+0xb8/0x1a0 [kernel]0xffffffff81176fe1 : sys_write+0x51/0x90 [kernel]

XFS in latest kernel can relieve journal IO blocking issue, more friendly to metadata heavy scenarios like HBase + HDFS

www.mi.com

Root Cause of Write Latency Spikes

16

8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel : 3.12.17

Statistic the stalled write() which costs > 100ms

The largest write() latency in Ext4 : ~600ms !

www.mi.com

Write Latency Spikes Testing

17

Hedged Read (HDFS-5776)

www.mi.com

18

Long first “put” issue (HBASE-10010)

Token invalid (HDFS-5637)

Retry/timeout setting in DFSClient

Reduce write traffic? (HLog compression)

HDFS IO Priority (HADOOP-10410)

Other Meaningful Latency Work

www.mi.com

19

Real-time HDFS, esp. priority related

Core data structure GC friendly

More off-heap; shenandoah GC

TCP/Disk IO characteristic analysis

Need more eyes on OS

Stay tuned…

www.mi.com

Wish List

New write thread model(HBASE-8755)

Reverse scan(HBASE-4811)

Per table/cf replication(HBASE-8751)

Block index key optimization(HBASE-7845)

20www.mi.com

Some Patches Xiaomi Contributed

WriteHandler :sync to HDFS

WriteHandler :write to HDFS

WriteHandler :sync to HDFS

WriteHandler :write to HDFS

1. New Write Thread Model

WriteHandler WriteHandlerWriteHandler ……

WriteHandler : write to HDFS

WriteHandler : sync to HDFS

Local Buffer

Problem : WriteHandler does everything, severe lock race!

Old model:

21www.mi.com

256

256

256

WriteHandler :sync to HDFS WriteHandler :sync to HDFS

New Write Thread Model

WriteHandler WriteHandlerWriteHandler ……

AsyncWriter : write to HDFS

AsyncSyncer : sync to HDFS

Local Buffer

New model :

AsyncNotifier : notify writers

22www.mi.com

256

1

1

4

New Write Thread Model

Low load : No improvement Heavy load : Huge improvement (3.5x)

23www.mi.com

2. Reverse Scan

Row2 kv2

Row3 kv1

Row3 kv3

Row4 kv2

Row4 kv5

Row5 kv2

Row1 kv2

Row3 kv2

Row3 kv4

Row4 kv4

Row4 kv6

Row5 kv3

Row1 kv1

Row2 kv1

Row2 kv3

Row4 kv1

Row4 kv3

Row6 kv1

1. All scanners seek to ‘previous’ rows (SeekBefore)

2. Figure out next row : max ‘previous’ row

3. All scanners seek to first KV of next row (SeekTo)

Performance : 70% of forward scan

24www.mi.com

Need a way to specify which data to replicate!

3. Per Table/CF Replication

Source

PeerA(backup)

PeerB(T2:cfX)

T1 : cfA, cfBT2 : cfX, cfY

PeerB creates T2 only : replication can’t work!

T1:cfA,cfB; T2:cfX,cfY

?

PeerB creates T1&T2 : all data replicated!

25www.mi.com

Per Table/CF Replication

Source

PeerA

PeerB(T2:cfX)

T1:cfA,cfB; T2:cfX,cfY

T2:cfX

add_peer ‘PeerA’, ‘PeerA_ZK’

add_peer ‘PeerB’, ‘PeerB_ZK’, ‘T2:cfX’

T1 : cfA, cfBT2 : cfX, cfY

26www.mi.com

4. Block Index Key Optimization

Block 1 Block 2

… …

k1:“ab” k2 : “ah, hello world”

Before : ‘Block 2’ block index key = “ah, hello world/…”

Now : ‘Block 2’ block index key = “ac/…” ( k1 < key <= k2)

Reduce block index size

Save seeking previous block if the searching key is in [‘ac’, ‘ah, hello world’]

27www.mi.com

Cross-table cross-row transaction(HBASE-10999)

HLog compactor(HBASE-9873)

Adjusted delete semantic(HBASE-8721) Coordinated compaction (HBASE-9528)

Quorum master (HBASE-10296)

28www.mi.com

Some ongoing patches

http://github.com/xiaomi/themis

1. Cross-Row Transaction : Themis

Google Percolator : Large-scale Incremental Processing Using

Distributed Transactions and Notifications

Two-phase commit : strong cross-table/row consistency Global timestamp server : global strictly incremental

timestamp No touch to HBase internal: based on HBase Client and coprocessor Read : 90%, Write : 23% (same downgrade as Google percolator) More details : HBASE-10999

29www.mi.com

http://github.com/XiaoMi/themis

2. HLog Compactor HLog 1,2,3

Region 1Memstore

HFiles

Region 2 Region x

Region x : few writes but scatter in many HLogs

PeriodicMemstoreFlusher : flush old memstores forcefully

‘flushCheckInterval’/‘flushPerChanges’ : hard to config

Result in ‘tiny’ HFiles

HBASE-10499 : problematic region can’t be flushed!

30www.mi.com

HLog Compactor HLog 1, 2, 3,4

Region 1Memstore

HFiles

Region 2 Region x

Compact : HLog 1,2,3,4 HLog x

Archive : HLog1,2,3,4

HLog x

31www.mi.com

3. Adjusted Delete Semantic

1. Write kvA at t02. Delete kvA at t0, flush to hfile3. Write kvA at t0 again4. Read kvA

Result : kvA can’t be read out

Scenario 1

1. Write kvA at t02. Delete kvA at t0, flush to hfile3. Major compact4. Write kvA at t0 again

Result : kvA can be read out

Scenario 2

5. Read kvA

Fix : “delete can’t mask kvs with larger mvcc ( put later )”

32www.mi.com

4. Coordinated Compaction

HDFS (global resource)

RS RS RS

Compact storm!

Compact uses a global HDFS, while whether to compact is decided locally!

33www.mi.com

Coordinated Compaction

RS RS RS

MasterCan I ?OK Can I ? OKCan I ?

NO

HDFS (global resource)

Compact is scheduled by master, no compact storm any longer

34www.mi.com

5. Quorum Master

zk3 zk2

zk1

RS RSRS

Master

MasterZooKeeper

X

Read info/states

A

A

When active master serves, standby master stays ‘really’ idle When standby master becomes active, it needs to rebuild in-memory status

35www.mi.com

Quorum Master

Master 3 Master 1

Master 2

RS RSRS

XA

A

Better master failover perf : No phase to rebuild in-memory status

No external(ZooKeeper) dependency No potential consistency issue Simpler deployment

Better restart perf for BIG cluster(10+K regions)

36www.mi.com

Hangjun Ye, Zesheng Wu, Peng ZhangXing Yong, Hao Huang, Hailei Li

Shaohui Liu, Jianwei Cui, Liangliang HeDihao Chen

Acknowledgement

37www.mi.com

Thank You!

[email protected]

[email protected]

www.mi.com

38www.mi.com

mailto:[email protected]

mailto:[email protected]

Documents

HBase at Xiaomi {xieliang, fenghonghua}@xiaomi.com Liang Xie / Honghua Feng 1