Upload
shawn-henry
View
235
Download
2
Embed Size (px)
Citation preview
3
Outline
Introduction
Latency practice
Some patches we contributed
Some ongoing patches
Q&A
www.mi.com
4
About Xiaomi
Mobile internet company founded in 2010
Sold 18.7 million phones in 2013
Over $5 billion revenue in 2013
Sold 11 million phones in Q1, 2014
www.mi.com
8
About Our HBase Team
Founded in October 2012
5 members Liang Xie Shaohui Liu Jianwei Cui Liangliang He Honghua Feng
Resolved 130+ JIRAs so far
www.mi.com
9
Our Clusters and Scenarios
15 Clusters : 9 online / 2 processing / 4 test
Scenarios MiCloud MiPush MiTalk Perf Counter
www.mi.com
10
Our Latency Pain Points
Java GC
Stable page write in OS layer
Slow buffered IO (FS journal IO)
Read/Write IO contention
www.mi.com
11
Bucket cache with off-heap mode
Xmn/ServivorRatio/MaxTenuringThreshold
PretenureSizeThreshold & repl src size
GC concurrent thread number
GC time per day : [2500, 3000] -> [300, 600]s !!!
www.mi.com
HBase GC Practice
12
HBase client put->HRegion.batchMutate->HLog.sync->SequenceFileLogWriter.sync->DFSOutputStream.flushOrSync->DFSOutputStream.waitForAckedSeqno <Stuck here often!>==================================================
=DataNode pipeline write, in BlockReceiver.receivePacket() :->receiveNextPacket->mirrorPacketTo(mirrorOut) //write packet to the mirror->out.write/flush //write data to local disk. <- buffered IO
[Added instrumentation(HDFS-6110) showed the stalled write was the culprit, strace result also confirmed it
www.mi.com
Write Latency Spikes
13
write() is expected to be fast
But blocked by write-back sometimes!
www.mi.com
Root Cause of Write Latency Spikes
14
Workaround :
2.6.32.279(6.3) -> 2.6.32.220(6.2)or2.6.32.279(6.3) -> 2.6.32.358(6.4)
Try to avoid deploying REHL6.3/Centos6.3 in an extremely latency sensitive HBase cluster!
www.mi.com
Stable page write issue workaround
15
...0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2]0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2]0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4]0xffffffffa00fa253 : ext4_reserve_inode_write+0x73/0xa0 [ext4]0xffffffffa00fa2cc : ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]0xffffffffa00fa6c4 : ext4_generic_write_end+0xe4/0xf0 [ext4]0xffffffffa00fdf74 : ext4_writeback_write_end+0x74/0x160 [ext4]0xffffffff81111474 : generic_file_buffered_write+0x174/0x2a0 [kernel]0xffffffff81112d60 : __generic_file_aio_write+0x250/0x480 [kernel]0xffffffff81112fff : generic_file_aio_write+0x6f/0xe0 [kernel]0xffffffffa00f3de1 : ext4_file_write+0x61/0x1e0 [ext4]0xffffffff811762da : do_sync_write+0xfa/0x140 [kernel]0xffffffff811765d8 : vfs_write+0xb8/0x1a0 [kernel]0xffffffff81176fe1 : sys_write+0x51/0x90 [kernel]
XFS in latest kernel can relieve journal IO blocking issue, more friendly to metadata heavy scenarios like HBase + HDFS
www.mi.com
Root Cause of Write Latency Spikes
16
8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel : 3.12.17
Statistic the stalled write() which costs > 100ms
The largest write() latency in Ext4 : ~600ms !
www.mi.com
Write Latency Spikes Testing
18
Long first “put” issue (HBASE-10010)
Token invalid (HDFS-5637)
Retry/timeout setting in DFSClient
Reduce write traffic? (HLog compression)
HDFS IO Priority (HADOOP-10410)
Other Meaningful Latency Work
www.mi.com
19
Real-time HDFS, esp. priority related
Core data structure GC friendly
More off-heap; shenandoah GC
TCP/Disk IO characteristic analysis
Need more eyes on OS
Stay tuned…
www.mi.com
Wish List
New write thread model(HBASE-8755)
Reverse scan(HBASE-4811)
Per table/cf replication(HBASE-8751)
Block index key optimization(HBASE-7845)
20www.mi.com
Some Patches Xiaomi Contributed
WriteHandler :sync to HDFS
WriteHandler :write to HDFS
WriteHandler :sync to HDFS
WriteHandler :write to HDFS
1. New Write Thread Model
WriteHandler WriteHandlerWriteHandler ……
WriteHandler : write to HDFS
WriteHandler : sync to HDFS
Local Buffer
Problem : WriteHandler does everything, severe lock race!
Old model:
21www.mi.com
256
256
256
WriteHandler :sync to HDFS WriteHandler :sync to HDFS
New Write Thread Model
WriteHandler WriteHandlerWriteHandler ……
AsyncWriter : write to HDFS
AsyncSyncer : sync to HDFS
Local Buffer
New model :
AsyncNotifier : notify writers
22www.mi.com
256
1
1
4
2. Reverse Scan
Row2 kv2
Row3 kv1
Row3 kv3
Row4 kv2
Row4 kv5
Row5 kv2
Row1 kv2
Row3 kv2
Row3 kv4
Row4 kv4
Row4 kv6
Row5 kv3
Row1 kv1
Row2 kv1
Row2 kv3
Row4 kv1
Row4 kv3
Row6 kv1
1. All scanners seek to ‘previous’ rows (SeekBefore)
2. Figure out next row : max ‘previous’ row
3. All scanners seek to first KV of next row (SeekTo)
Performance : 70% of forward scan
24www.mi.com
Need a way to specify which data to replicate!
3. Per Table/CF Replication
Source
PeerA(backup)
PeerB(T2:cfX)
T1 : cfA, cfBT2 : cfX, cfY
PeerB creates T2 only : replication can’t work!
T1:cfA,cfB; T2:cfX,cfY
?
PeerB creates T1&T2 : all data replicated!
25www.mi.com
Per Table/CF Replication
Source
PeerA
PeerB(T2:cfX)
T1:cfA,cfB; T2:cfX,cfY
T2:cfX
add_peer ‘PeerA’, ‘PeerA_ZK’
add_peer ‘PeerB’, ‘PeerB_ZK’, ‘T2:cfX’
T1 : cfA, cfBT2 : cfX, cfY
26www.mi.com
4. Block Index Key Optimization
Block 1 Block 2
… …
k1:“ab” k2 : “ah, hello world”
Before : ‘Block 2’ block index key = “ah, hello world/…”
Now : ‘Block 2’ block index key = “ac/…” ( k1 < key <= k2)
Reduce block index size
Save seeking previous block if the searching key is in [‘ac’, ‘ah, hello world’]
27www.mi.com
Cross-table cross-row transaction(HBASE-10999)
HLog compactor(HBASE-9873)
Adjusted delete semantic(HBASE-8721) Coordinated compaction (HBASE-9528)
Quorum master (HBASE-10296)
28www.mi.com
Some ongoing patches
http://github.com/xiaomi/themis
1. Cross-Row Transaction : Themis
Google Percolator : Large-scale Incremental Processing Using
Distributed Transactions and Notifications
Two-phase commit : strong cross-table/row consistency Global timestamp server : global strictly incremental
timestamp No touch to HBase internal: based on HBase Client and coprocessor Read : 90%, Write : 23% (same downgrade as Google percolator) More details : HBASE-10999
29www.mi.com
2. HLog Compactor HLog 1,2,3
Region 1Memstore
HFiles
Region 2 Region x
Region x : few writes but scatter in many HLogs
PeriodicMemstoreFlusher : flush old memstores forcefully
‘flushCheckInterval’/‘flushPerChanges’ : hard to config
Result in ‘tiny’ HFiles
HBASE-10499 : problematic region can’t be flushed!
30www.mi.com
HLog Compactor HLog 1, 2, 3,4
Region 1Memstore
HFiles
Region 2 Region x
Compact : HLog 1,2,3,4 HLog x
Archive : HLog1,2,3,4
HLog x
31www.mi.com
3. Adjusted Delete Semantic
1. Write kvA at t02. Delete kvA at t0, flush to hfile3. Write kvA at t0 again4. Read kvA
Result : kvA can’t be read out
Scenario 1
1. Write kvA at t02. Delete kvA at t0, flush to hfile3. Major compact4. Write kvA at t0 again
Result : kvA can be read out
Scenario 2
5. Read kvA
Fix : “delete can’t mask kvs with larger mvcc ( put later )”
32www.mi.com
4. Coordinated Compaction
HDFS (global resource)
RS RS RS
Compact storm!
Compact uses a global HDFS, while whether to compact is decided locally!
33www.mi.com
Coordinated Compaction
RS RS RS
MasterCan I ?OK Can I ? OKCan I ?
NO
HDFS (global resource)
Compact is scheduled by master, no compact storm any longer
34www.mi.com
5. Quorum Master
zk3 zk2
zk1
RS RSRS
Master
MasterZooKeeper
X
Read info/states
A
A
When active master serves, standby master stays ‘really’ idle When standby master becomes active, it needs to rebuild in-memory status
35www.mi.com
Quorum Master
Master 3 Master 1
Master 2
RS RSRS
XA
A
Better master failover perf : No phase to rebuild in-memory status
No external(ZooKeeper) dependency No potential consistency issue Simpler deployment
Better restart perf for BIG cluster(10+K regions)
36www.mi.com
Hangjun Ye, Zesheng Wu, Peng ZhangXing Yong, Hao Huang, Hailei Li
Shaohui Liu, Jianwei Cui, Liangliang HeDihao Chen
Acknowledgement
37www.mi.com
Thank You!
www.mi.com
38www.mi.com