Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
1
Dhananjoy Das NVM Compression: Hybrid Flash-‐Aware Applica;on Level Compression
Sr. Architect, Advanced Development Group. 5th October 2014
2
Presenta;on Outline
§ Background § Architecture § Building blocks § EvaluaEon/Benchmarks § Summary § Q & A
Flash TranslaEon Layer
Name Space/FileSystem
ApplicaEon (DB)
NVM-‐Compression solu;on stack
3
Background Flash TranslaEon Layer
Name Space/FileSystem
ApplicaEon (DB)
4
Problem: Exis;ng MySQL ROW Compression
100%
20%
80%
0%
20%
40%
60%
80%
100%
Uncompressed Row Compressed
TransacEon Rate Compression Performance Penalty
5
ROW-‐Compression § Uncompressed data is stored in 16K pages
§ 16K pages are compressed into a fixed compressed page size of 1K, 2K, 4K, 8K
– Compressed page size is chosen at table creaEon
(CREATE TABLE .. KEY_BLOCK_SIZE=8/4/2)
§ Table updates appended to Page ModificaEon Log (mlog) at the end of the compressed (8K) page
§ When mlog gets full, page is recompressed
Uncompressed Data 16K
Compressed Data mlog 8K
Upd
ate
Insert
Insert
Compressed Data
mlog 8K
Insert
6
Row-‐Compression Insert Failures Split – Rebalance – Recompress
Compressed Data
8K
Upd
ate
Insert
Insert
Insert
>8K Compressed Data
Compressed Data mlog 8K
Compressed Data mlog 8K
Insert
+ Insert § Merge + Compress operaEon fails to fit
within compressed block size
§ Page is split abempEng to merge contents, triggers an abempt to rebalance the tree
7
Row-‐Compression Architectural Drawbacks § Memory
– Space: Uncompressed & Compressed pages stored in DRAM buffer pool
– Access: Updates are applied to both copies in memory
§ Capacity cap – Fixed compression page size -‐ sets a upper bound on compression
§ Outcome – Poor performance on standard benchmarks – Poor adopEon (Less than 1% of MySQL users use compression)
8
NVM Compression Architecture
9
NVM Compression High level approach
§ ApplicaEon operates on sparse address space which is always the size of uncompressed.
§ Compressed data block is wriben in place at same virtual address as the un-‐compressed. Leaving a hole, empty space in the remainder of space allocated.
§ FTL garbage collecEon naturally coalesces the addresses in physical space, allowing for re-‐provisioning of physical space.
10
NVM Primi;ves Primi;ves Details
Sparse Sparse address space, allows for mapping and allocaEon of blocks on FLASH only as required. Also termed as “Thin provisioning”.
PTRIMs Persistent TRIMs, unlike regular TRIMs which are only hint to garbage-‐collector, are persistent and enforced across crashes to unmap block that can be reallocated.
Atomics Atomic-‐write guarantees that no part of the buffer will be parEally wriben.
11
NVM-‐Compression (page-‐compression)
§ Only store uncompressed 16KB pages in memory
§ Update to storage results in compression.
§ Trim unused 512B sectors in compressed page
§ 16K uncompressed data is now compressed & “thin-‐provisioned” down to 3.5K on Flash
Compressed Data
16K Unused
16K = (32) 512B Sectors
3.5K = (7) 512B Sectors on Flash
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
Uncompressed Data 16K
. . . . . . .
12
Flush opera;on Page Header
Data-‐page Header
Index Data
ROW Data start
end
Footer Page checksum
Garbage/ Unused
Compressed Page Header
Compressed Index Data
Garbage/ Unused
Compressed Page header
Compressed Index Data Compress
+ Add Header
TRIM unused region
13
NVM-‐Compression Architectural Benefits § Uses natural sparseness of the underlying storage to avoid packing of data
§ Compression is performed only when dirty page is flushed § Unused sectors are PTRIM(ed)/unmapped, for reuse § Decompression is done on page read, and only uncompressed data is available in buffer-‐pool pages
§ MulEple compression algorithms are pluggable
Modular design and simplified func;onality
14
Building Blocks Flash TranslaEon Layer
NVMFS
MySQL (DB)
15
System Interfaces
15
NVM-Compression uses existing system Interfaces
System Interface Func;onality
fallocate(offset, len) PreallocaEon and extending files/table space
fallocate(PUNCH_HOLE) Unmap/Punch hole operaEon. (issue Persistent TRIMs)
io_submit() AIO Transparent Atomic writes
16
New: Mul;-‐Threaded Flush Framework
17
New: Mul;-‐Threaded Flush Framework
Flushing Thread
enqueue wqe (per Page-‐pool flush) dequeue request
Compress + DIO (per Page) report compleEons
MT-‐worker threads
1 2
4 3
queue
§ Producer/consumer framework for servicing scalable/distributed operaEon, can be used for purposes other than NVM-‐Compression
§ Required by NVM-‐Compression, allows for scalable & efficient servicing of writes
CPU#0 CPU#1 CPU#n
NVM Compression uses MT-‐Flush framework to perform compression and PTRIM operaEon in parallel criEcal for individual page-‐flush latency and overall MySQL transacEonal throughput
18
NVMFS § Non Vola;le Memory FileSystem § It’s a POSIX compliant filesystem, designed by SanDisk § Designed to meet the needs of data intensive workloads
§ Strengths include – Direct I/O on large, data intensive workloads (DB workload) – Pre-‐allocaEon of large files is efficient – No fragmentaEon/degradaEon with aging of filesystem – Re-‐exports features/Primi;ves of the underlying flash
19
NVMFS – Elimina;ng Duplicate Logic
NaEve Flash TranslaEon Layer block allocaEon, mapping, recycling
ACID updates, logging/journaling, crash-‐recovery
NVMFS file metadata mgmt
Kernel block layer
kernel-‐space
user-‐space
Ext3 file metadata mgmt, block allocaEon, mapping, recycling, ACID updates, logging/journaling, crash-‐recovery
PrimiEve Interfaces
Applica;on
Linux VFS (virtual file system) abstracEon layer
20
MySQL already uses Transparent Atomics
NVM Primi;ves
21
Atomic Writes (no double write) Tradi;onal MySQL Writes MySQL with Atomic Writes
Page C Page
B
Page A
Buffer
DRAM Buffer
SSD (or HDD)
Database Server
Page C
Page B
Page A
Updates to pages A, B, C
1
Copied to memory buffer.
2
Writes to double-‐write buffer
3
Once acknowledged, commit to the database.
4
ioMemory Database
MySQL writes ONCE to database, commisng the transacEon with inherent atomicity through Atomic Writes API
3
Page C Page
B
Page A
Page C
Page B
Page A
Page C
Page B
Page A
DRAM Buffer
Database Server
Page C
Page B
Page A
Page C
Page B
Page A
Updates to pages A, B, C
1
Copied to memory buffer.
2
2 Writes 1 Atomic Write
Database
Database
22
Benchmarks
23
Benchmarks Used § LinkBench
§ Percona tool tpcc-‐mysql (TPC-‐c like) § Other Micro-‐benchmarks
24
Storage Savings
Row-‐comp Page-‐comp Delta-‐Uncomp% 49.0% 58.5%
44.0%
46.0%
48.0%
50.0%
52.0%
54.0%
56.0%
58.0%
60.0%
% im
provem
ent
LinkBench 10x workload Uncompressed vs. Table with ROW, NVM Compression
* The data show is using zlib(lz77), storage efficiency with lz4 are comparable delta ~3%
Cau;on: ResulEng compression raEo depends on Entropy of data along with the algorithm used
NVM Compression supports mulEple pluggable compression methods including lz77, lz4, lzo
25
LinkBench 99% OP latencies
0
10
20
30
40
50
60
ADD_NODE UPDATE_NODE DELETE_NODE GET_NODE ADD_LINK DELETE_LINK UPDATE_LINK COUNT_LINK
LinkBench 10x Mean OP latency (msec)
Uncomp/NVM-‐comp/Row-‐comp
Uncompressed NVM-‐Comp ROW-‐Comp
NVM Compression transacEon latency is 2x-‐5x beber compared to ROW Compression and is closed to Uncompressed
26
Benchmark: LinkBench
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Uncomp Page-‐Comp Row-‐Comp
MariaDB 10.0.9/InnoDB 10x Workload (110GB, 50GB-‐buf-‐pool) Uncomp vs. Page-‐Comp vs. Row-‐Comp
NVM Compression transacEon throughput is 2.25x beber than ROW Compression and is <5% of Uncompressed
27
Scalability with Cores
42750 40751
18930
69134 67542
28259
Uncomp Page-‐Comp Row-‐Comp
LinkBench 10x workload (110GB, 50GB-‐buf-‐pool) Uncomp vs. Page-‐Comp vs. Row-‐Comp
24-‐[email protected] GHz 32-‐[email protected] GHz
42.7K 40.7K
18.9K
69.1K 67.5K
28.2K
NVM Compression transacEon throughput scales well with the core count (MT-‐Flush)
28
Benchmark: (TPCC-‐like) NVM-‐Compression
0
5000
10000
15000
20000
25000
30000
Time
130
260
390
520
650
780
910
1040
1170
1300
1430
1560
1690
1820
1950
2080
2210
2340
2470
2600
2730
2860
2990
3120
3250
3380
3510
New Order TX
TPC-‐C like workload MariaDB 10
1000 warehouses -‐ 75GB DRAM
MySQL uncompressed
MySQL compression
Fusion-‐io Compression
Time
NVM
NVM Compression performance is 7x beber than ROW Compression and close to Uncompressed workload perf
29
0
1
2
3
4
5
6
7
8
nvmfs xfs
Elap
sed Time
(normalize
d)
NVMFS Performance -‐ TRIM Handling
Micro-‐benchmark § Trim aver write § 16 KiB Direct Write
+ 4KiB TRIM Applica;ons § MySQL Page-‐compression
PTRIMs help with efficient garbage collecEon, reallocaEon of unused blocks and reduces WA
30
Flash Endurance Power/Cooling
31
Fewer Writes/OP
*Lower is beber
LinkBench 10x workload 50GB buf-‐pool Uncompressed vs. NVM-‐Compression
32
4x Bejer Flash Endurance § Compression
– 2x fewer writes to flash
§ Atomics – 2x fewer writes by disabling double-‐writes
§ Persistent TRIMs – Lower write amplificaEon benefits for flash
Usable for flash with fewer write cycles, like TLC/3BPC
33
Power/Cooling
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
% peak
uncompressed
LinkBench 10x workload 50GB buf-‐pool Uncompressed vs. NVM-‐Compression
PG-‐Wabs
Uncmp-‐Wab
Delta-‐Temp
Time
Fewer writes to Flash with NVM Compression improves savings on power usage and cooling
34
Summary NVM-‐Compression
– Design combines applicaEon level compression with flash awareness – Compression is performed by applicaEon at the point of page-‐flush – Leverages FTL primiEves for block management and garbage collecEon – System standard interfaces (PTRIMs/Atomic-‐write) – Pluggable compression libraries
Benchmark OLTP workload results – Storage saving 2x+ – Scalable & superior performance
35
Lessons learnt § Reducing writes can have dramaEc performance benefit – even if it comes at the cost of CPU based compression.
§ CapabiliEes embedded in flash FTLs, like high performance GC, can be used to replace complex ApplicaEon level block management very effecEvely.
§ File system support is required to maintain manageability. File systems passing and implemenEng primiEves efficiently is criEcal for performance.
36
Q & A
37
Thank You
38
Backup Slides
39
Announcements § Early access NVM-‐Compression soluEon
§ Early access NVMFS
§ Release by community Vendor Release Download
SkySQL MariaDB 10.0.9 hbp://bazaar.launchpad.net/~jplindst/maria/10.0-‐FusionIO
Percona Percona Server 5.6 hbp://code.launchpad.net/~gl-‐az/percona-‐server/5.6-‐pagecomp_mxlush
Oracle MySQL 5.7.4 DMR (Atomics) hbp://dev.mysql.com
MySQL Labs release (Compression) hbp://labs.mysql.com/
40
Read Index-‐Data-‐BLOCK (16K) Page Header
Data-‐page Header
Index Data ROW Data start
end
Footer Page checksum
Garbage/ Unused
UnMapped Region
(zero filled)
Compressed Page header
Compressed Index Data DeCompress
+ Fill User Buffer
41
How to NVM Compress
42
How to NVM-‐Compress § Setup MySQL (/etc/my.cnf) § Create Table opEons § Compression staEsEcs
43
MySQL Seong /etc/my.cnf I/O engines supported • Innodb • XtraDB Compression Protocols supported • LZ4 • LZ77 • LZMO
44
Create Table Sample # 1 (MariaDB 10.0.9)
Enable PAGE_COMPRESSION
Reserved Keywords to use for create table opEons • PAGE_COMPRESSION • PAGE_COMPRESSION_LEVEL
45
Reserved Keywords to use for create table opEons • DIRECTORY • ATOMICS_WRITES • PAGE_COMPRESS op;ons..
Create Table under specified directory
Enable Page-‐compression + Atomic writes
Create Table sample #2 (MariaDB 10.0.9)
46
NVM Compression: Sta;s;cs § Show status like “Innodb_page_compression_%”
§ Show status like “Innodb_n%” +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | Variable_name | Value | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | Innodb_page_compression_saved | 6905823838720 | | Innodb_page_compression_trim_sect512 | 20665471315 | | Innodb_page_compression_trim_sect4096 | 1939752607 | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | Variable_name | Value | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | Innodb_num_index_pages_written | 1092477800 | | Innodb_num_pages_page_compressed | 877050078 | | Innodb_num_page_compressed_trim_op | 448698818 | | Innodb_num_page_compressed_trim_op_saved | 429143865 | | Innodb_num_pages_page_decompressed | 873119717 | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
47
LinkBench Seong (1x vs. 10x)
1x workload
• Bump up the workload from default 1x (10Mil) to 10x using “maxid”
• Change runEme, using “max;me”
48
OK What Happened with Latency ?