Accelerate Finger Printing in Data Deduplication · 2019-12-21 · Finger printing algorithms in ZFS Finger printing algorithms: Flecher2/4: good performance, but bad collision rate

2016 Storage Developer Conference. © Intel Corporation. All Rights Reserved.

Accelerate Finger Printing in Data Deduplication

Xiaodong Liu & Qihua Dai Intel Corporation


INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: https://www-ssl.intel.com/content/www/us/en/design/resource-design-center.html

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel, Intel logo, Look Inside, Intel Inside, Intel Inside logo, the Look Inside Logo, Intel Atom, and Intel Xeon are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Other vendors are listed by Intel as a convenience to Intel's general customer base, but Intel does not make any representations or warranties whatsoever regarding quality, reliability, functionality, or compatibility of these devices. This list and/or these devices may be subject to change without notice.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://www.intel.com/content/www/us/en/homepage.html.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Copyright © 2016, Intel Corporation. All rights reserved.

Legal Disclaimer

2

https://www-ssl.intel.com/content/www/us/en/design/resource-design-center.html

http://www.intel.com/content/www/us/en/homepage.html


Agenda

Data Deduplication in ZFS Multi-buffer Technique Based on IA Limitations of Multi-buffer Hash in Linux Kernel Accelerating finger printing in ZFS Data

Deduplication Performance Result of ZFS Data Deduplication

3


Data Deduplication in ZFS

Data chunking in block size(512B~1MB) and inline.

Finger printing algorithms: SHA256, Fletcher2/4

4

1010110010

CHUNK DATA

INCOMING

110110010

0010110010

0010

1010

1110

FINGER PRINTING

DATASET INDEXING

0011 1010

H B

1110 0010

C D

0111 1011

E F

0100 1010

g B

1110 0010

C D

0111 1011

E F

0110 1010

A B

1110 0010

C D

0111 1011

E F

DATA STORING


Data Deduplication in ZFS

5

Open

……

BP_INIT

CHECKSUM

DVA ALLOCATE

DONE

……

01101010

AB

11100010

CD

01111011

EF

01101010

AB

11100010

CD

01111011

EF

01101010

AB

11100010

CD

01111011

EF

1. data block checksum compute

2. dedup table select

3. dedup entry lookup

DDTs are loaded/stored with ZAP methods

Key = checksum + property

checksum proc

key

4. number of copies needed

?

DDT WRITE

ZIO pipeline stages

AVL tree

ZIO thread：


Finger printing algorithms in ZFS Finger printing algorithms: Flecher2/4: good

performance, but bad collision rate

SHA256: good collision rate, widely used, but low performance and heavy CPU cost.

6

2026

0

6508

205

F L E T C H E R 2 F L E T C H E R 4 S H A 2 5 6

PERFORMANCE HASH (MB/S)

All results collected by Intel Corporation. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations (http://www.intel.com/performance/resources/benchmark_limitations.htm).

http://www.intel.com/performance/resources/benchmark_limitations.htm


Multi-buffer Technique Based on IA

Two basic ways that processing multiple buffers in parallel SIMD approach Non-SIMD approach

7


SIMD approach Processing multiple buffers in parallel to reduce data dependency

limits Based on SIMD registers and instructions

8

SIMD register

X0 X1 X2 X3

Y0 Y1 Y2 Y3

Typical SIMD instruction

Parallel process independent buffers


9

Parallel utilization of the processor’s execution resources (AES-CBC-Encrypt with AES-NI)

Taking full advantage of execution unit resources – Function Stitching

Non-SIMD approach


A typical implementation of Multi-Buffer Hash in Intel® Intelligent Storage Acceleration Library (Intel® ISA-L):

10

Scheduler of ISA-L Multi-buffer Hash

INIT

Full

Flush

IDLE

1. Empty job queue 2. Getting 2 new jobs in job queue on hold

3.Processing the full job queue

4.Processing the partial job queue

Process

Process

submit

New job Partial finished jobFull finished job

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries


8.7 [VALUE]

6.5

14.5

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

0

10

20

30

40

50

60

70

SHA-1 SHA-256 SHA-512 MD5

100

MB/

s

MULTI - BUFFER PERFORMANCE G AIN

OpenSSL ISA-L ratio

Great Performance Gain

11

SIMD multi-buffer tech: up to 14.5x performance improvement from Intel® ISA-L multi-buffer hash

Parallel utilize CPU execution units: Intel® ISA-L Sha1 & murmur3 function stitching with 288bit digest has 6.9x higher throughput than OpenSSL SHA1

6.9

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

0

500

1000

1500

2000

2500

3000

3500

Ratio

Perf

orm

ance

(MB/

s)

OpenSSL SHA1 & Multi-Hash Sha1 + Murmur

F U N C T I O N S T I T C H I N G




Multi-buffer hash in Linux Kernel Crypto Framework (LKCF)

Intel® ISA-L Multi-buffer SHA1, SHA256, SHA512 are integrated The unified LKCF hash APIs Multi-buffer framework in LKCF:

The jobs in mcryptd_queue are assigned to the work daemon in workqueue For sha256 AVX2 version, work daemon is hold until 8 jobs submitted, or timeout

(usually 4ms)

12

Workqueue mcryptd_queue

Work daemon

CPU0cpu_workqueue

work

CPU0mcryptd_cpu_queue

Work daemon

CPU1cpu_workqueue


Work daemon

CPU2cpu_workqueue


Timeout,Flushed work

Full work

work

work

Hash requset



13

INIT

Full

Flush

IDLE4ms

1. Empty job queue

2. Getting the partial job queue on hold

3. Processing the full job queue

4. Processing the partial job queue

Process

Process

submit

New job Partial finished jobFull finished job

5. Processing 1 job

partial job queue (< 8 jobs) always holds until 4ms timeout and flush.

full job queue (= 8 jobs) gets processed immediately

The limitations of multi-buffer hash in LKCF


14

Multi-buffer sha256 never gets the better performance than single-buffer sha256 for <8 jobs.

The limitations of multi-buffer hash in LKCF

Intel® Xeon® Processor E5-2650v3 @ 2.3 GHz 1 Socket Single-buffer sha256

(single core) Multi-buffer sha256 in LKCF

(single core) 1 job 183MB/s 24.3MB/s

2 jobs 183MB/s 48.6MB/s 3 jobs 183MB/s 72.9MB/s 4 jobs 183MB/s 97.2MB/s 5 jobs 183MB/s 121.5MB/s 6 jobs 183MB/s 145.8MB/s 7 jobs 183MB/s 170.1MB/s 8 jobs 183MB/s 877MB/s




Finger Printing Accelerating in ZFS Data Deduplication SHA256 in ZFS data deduplication is traditional Single-buffer

Suitable for light workload High compute resource cost on heavy workload

Using Multi-buffer sha256 in LKCF for ZFS data deduplication Always has the timeout overhead for < 8 jobs (AVX2), and <16 jobs

(AVX512) Not suitable for file system

Our acceleration solution for finger printing accelerating for ZFS data deduplication Basically leverage multi-buffer and single buffer functions adaptively Break through advanced usage considerations.

15


Totally changed to asynchronize mode internally. Components:

Hash thread pool for the slave threads Hash request queues with master thread attached Different hash request queues for the different lengths of the job.

Consumer-producer system: Producer: ZIO threads Consumer: master and slave hash threads

16

Finger Printing Accelerating in ZFS Data Deduplication

……

……

Master Thread

Request Queues

Thread Pool

Slave ThreadsZIOthread

ZIOthread


17

• ZIO threads submit sha256 jobs to the request queues. • Once the jobs submitted, several threads (master and/or slave

threads) will fetch and process the jobs using single-buffer and/or multi-buffer sha256 according to the number of jobs

Finger Printing Accelerating in ZFS Data Deduplication

Fingerprint request

……

ZIOthread

ZIOthread

ZIOthread

ZIOthread

ZIOthread


18

The different length of jobs

APP

32KB

64KB 128KB 256KB

…… ……

ZIOthread


Users’ Different Preference

19

User A

User B

Heavy workload with lots of reqsI want high efficiency!

I want low latency!

Three job in queue

User A prefers each IO has low latency User B prefers high resource usage efficiency (low CPU

utilization)


20

Keep 1 thread in execution state and fetch jobs as many as posible

Keep more threads in execution state and fetch 1 job if possible

User A

User B Three job in queue

I want high efficiency!

I want low latency!

Parameter M to be used to adjust the system to get better latency or better efficiency.

M is the number of primary threads

Latency Or Efficiency


Tolerable waiting time for long length buffer.

21

Single buffer hash routine

Multi buffer hash routine

New requests in queue

A long processing time for new requests to wait

For longer block size, processing time is longer New requests in queue have to tolerate a high

waiting time.


Enable snoop mechanism and add parameter of segment size

22


New requests in queue

Segment_size = 128K

2. snoop after segment size data is processed

Multi buffer hash routine

Segment_size = 128K

3. transform to multi-buffer hash, snoop after segment size data is processed


1. snoop and single buffer process

Segment_size = 128K

Tolerable waiting time for long length buffer.


Performance Result of ZFS Data Deduplication Running Dedup workload by DEDISbench

23

With enough computation resource, the accelerated Sha256 in ZFS uses 1/3 CPU resource, and has same throughput. (Achieved disk boundary)

With 4 CPU cores, the accelerated Sha256 in ZFS gets 2.4X throughput and 3/4 CPU resource. (Achieved CPU boundary)

CPU usage(%) Thoughput(MB/s)

[CELLREF]

[CELLREF]

[CELLREF]

[CELLREF]

36 cores

Accelerated SHA256 in ZFS Existing SHA256 in ZFS

CPU usage(%) Thoughput(MB/s)

[CELLREF]

[CELLREF] [CELLREF]

[CELLREF]

4 Cores





Running FIO workload

24

• ZFS with the accelerated Sha256 gets 3X per core throughput

write

read

randwrite

randread

0.050.0

100.0150.0200.0250.0300.0350.0400.0450.0


275.9

113.3

440.8

132.0

251.3

77.1

333.7

123.3

Per Core Throughput(MB/s)

write read randwrite randread




Conclusion:

Great performance improvement: Saved Computation Resource x Throughput Gain >= 2.5X in

ZFS data deduplication

Our Design is a proper solution for ZFS to do finger printing in data deduplication

Plan to upstream this design to ZFS It’s a general framework to benefit other data

deduplication applications

25


Thank you

26


Appendix

27

Multi-buffer hash code: Intel® ISA-L(both user mode and kernel

mode) https://github.com/01org/isa-l https://github.com/01org/isa-l_crypto

LKCF*(only for kernel mode) Kernel_top_path/arch/x86/crypto/sha1_mb Kernel_top_path/arch/x86/crypto/sha256_mb Kernel_top_path/arch/x86/crypto/sha512_mb


https://github.com/01org/isa-l

https://github.com/01org/isa-l_crypto


Intel® ISA-L performance table

28

ISA-L Function

Next Generation Intel® Xeon® Processor (Codenamed Skylake) @ 2.0 GHz 1 Socket ISA-L OpenSSL 1.0.2g

Cycle/Byte Performance (lower is better)

Single Core Throughput (higher is better)

Cycle/Byte Performance (lower is better)

Single Core Throughput (higher is better)

Multihash SHA-1** 0.46 4.2 GB/s - - Multihash SHA-1 Murmur** 0.63 3.1 GB/s - - Multibuffer SHA-1** 0.50 3.9 GB/s 4.45 450 MB/s Multibuffer SHA-256** 0.93 2.1 GB/s 11.89 168 MB/s Multibuffer SHA-512** 1.15 1.7 GB/s 7.64 262 MB/s Multibuffer MD5** 0.34 5.8 GB/s 4.99 401 MB/s

BIOS Configuration

P-States: Disabled

Turbo: Disabled

Speed Step: Disabled

C-States: Disabled

Power Performance Tuning: Disabled

ENERGY_PERF_BIAS_CFG: PERF

Isochronous: Disabled

Memory Power Savings: Disabled

Up to 13X bandwidth

boost

Up to 15X bandwidth

boost

** ISA-L function uses AVX512 instructions




Performance test configuration for ZFS data deduplication

29

DEDISbench command: time DEDISbench -

g/etc/dedisbench/dist_highperf -w -c4 -d/pool/fs/ -b131072 -z -o./act_distrib -p -s256000 -f2560000

time DEDISbench -g/etc/dedisbench/dist_highperf -w -c10 -d/pool/fs/ -b131072 -z -o./act_distrib -p -s256000 -f2560000

Fio command: [global] runtime=100 time_based group_reporting end_fsync=1 directory=/pool/fs rw=write ? randwrite ? read ? randread bs=128k ioengine=libaio size=2G iodepth=64 numjobs=64 [zfstest] thread=1

ZFS version: SPL-0.6.5.6，ZFS-0.6.5.6 Linux version: Ubuntu 14.04.4 LTS, kernel

4.4.11 Test tool version: fio-2.2.10, DEDISbench-

v1.0.0 Hardware:

Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz MEM: DDR4, 2133MHz, 16GB * 8 DISK: INTEL SSD S3500 480GB * 7

ZFS configuration: zpool create pool sdb sdc sdd sde sda sdg sdh -f zfs create pool/fs zfs set recordsize=128k pool/fs zfs set primarycache=metadata pool/fs zfs set secondarycache=metadata pool/fs zfs set checksum=sha256 pool/fs zfs set dedup=sha256 pool/fs

ZFS modification: Change zio_taskq_batch_pc from 75 to 100 for ZFS

original SHA256 performance test Change zio_taskqs[WRITE][ISSUE] from ZTI_BATCH

to ZTI_N(40) for ZFS sha256-mb performance test


30

Referenced paper:

Processing Multiple Buffers in Parallel to Increase Performance on Intel ® Architecture Processors https://www.baidu.com/link?url=W1Txipv0F-MEYbY31p9YkTDUuenJznKOzpBZi2ByRn1belmhPgodxMmE2EzZrL4uYDLkbwbyTQhSkIFrSn-EXoaGbn8K9LMKBc-FvWCFkqt5BQBS7Q0wDaQb8AxeNHbnR8Slc9FiGEST6dOk1sfEZ-eMAAx-M-G9uJaMmoBk3-C&wd=&eqid=904bfa1900030a870000000257c3bda2

Fast Cryptographic Computation on Intel® Architecture Processors Via Function Stitching https://www.baidu.com/link?url=idGx1TveIgzKkXsVVRJmgaFut08N1nL6UcvpyLvT-qne1mXcptrhU88xXhXeiXIzUodQxOPa9WV77QKHy1R1NQMRMz6Kvi5lHAsSh87-iRUv5WqYNUZ93opRWNI_BQacCj8l7d-dh80B6v6ZTeOpSEGT23SVoxMLse_wwg3fwOK&wd=&eqid=fdced91e0003730f0000000257c3bdd9

Multi-Hash A Family of Cryptographic Hash Algorithm Extensions https://www.baidu.com/link?url=UbNFRBE9pqrJARlXNNaInUTYS1LOZsHYqnxYJlbPXC56gozPVo_ok2c0fayCU_yj2SHWN0BLeOeLZisaGNvvhyj-3T86p7X5E85UcWBjUeEeNkK6_3hvChQvDiS7iPkxAoeTDOq4YxQX0BsLiy7tr_&wd=&eqid=b07ae58200036f170000000257c3bdfa

https://www.baidu.com/link?url=W1Txipv0F-MEYbY31p9YkTDUuenJznKOzpBZi2ByRn1belmhPgodxMmE2EzZrL4uYDLkbwbyTQhSkIFrSn-EXoaGbn8K9LMKBc-FvWCFkqt5BQBS7Q0wDaQb8AxeNHbnR8Slc9FiGEST6dOk1sfEZ-eMAAx-M-G9uJaMmoBk3-C&wd=&eqid=904bfa1900030a870000000257c3bda2
















https://www.baidu.com/link?url=idGx1TveIgzKkXsVVRJmgaFut08N1nL6UcvpyLvT-qne1mXcptrhU88xXhXeiXIzUodQxOPa9WV77QKHy1R1NQMRMz6Kvi5lHAsSh87-iRUv5WqYNUZ93opRWNI_BQacCj8l7d-dh80B6v6ZTeOpSEGT23SVoxMLse_wwg3fwOK&wd=&eqid=fdced91e0003730f0000000257c3bdd9








https://www.baidu.com/link?url=UbNFRBE9pqrJARlXNNaInUTYS1LOZsHYqnxYJlbPXC56gozPVo_ok2c0fayCU_yj2SHWN0BLeOeLZisaGNvvhyj-3T86p7X5E85UcWBjUeEeNkK6_3hvChQvDiS7iPkxAoeTDOq4YxQX0BsLiy7tr_&wd=&eqid=b07ae58200036f170000000257c3bdfa







Documents

Accelerate Finger Printing in Data Deduplication · 2019-12-21 · Finger printing algorithms in ZFS Finger printing algorithms: Flecher2/4: good performance, but bad collision rate