Leverage Similarity and Locality to Enhance Fingerprint Prefetching of Data Deduplication Yongtao Zhou, Yuhui Deng, Junjie Xie Department of Computer Science,

Leverage Similarity and Locality to Enhance Fingerprint Prefetching of Data Deduplication

Yongtao Zhou, Yuhui Deng, Junjie XieDepartment of Computer Science, Jinan University, Guangzhou, 510632,

P. R.China

1ICPADS 2014

ICPADS 2014 2

Agenda• Introduction

• Related work

• Motivation

• System overview

• Evaluation

ICPADS 2014 3

Introduction• IDC: 95% redundant data in the backup systems; 75% redundant data across the

digital world Consumes IT resources and expensive network bandwidth

• Data deduplication: eliminate redundant data by storing only one data copy

Files

MD5

Data blocks

Chunk algorithm

Fingerprint

Ċ Ċ

Ċ Ċ

B

Ċ Ċ Ċ Ċ Ċ

Ċ Ċ

GA

Ċ Ċ Ċ

E C

Hash Table

B+ Tree

Fingerprints to large too store all fingerprints in memory

ICPADS 2014 4

Introduction• Querying fingerprints incurs disk bottleneck The size of fingerprints are too large too be cached in memory

Cache hit ration very low (lack temporal locality)

The IOPS of disk drivers is limited

800TB unique data

MD5 signature Avg.8KB Chunk

A large portion of fingerprints have to be stored on disk drives

ICPADS 2014 5

Locality based strategies: DDFS• Locality: segments tend to reappear in the same or very similar sequences with other

segments. This is because most data from previous backup has a slight modifications.

A B C AA B CD E FH I J

A E I A

RAM

...Bloom Filter

...Index

?

Poor deduplication performance when there is little or no locality in datasets

Prefetching!!!!

ICPADS 2014 6

Similarity based method: Extreme Binning

Files

C1 C2 Cn

RAM

Similarity Index

DISK

Bin(C1)

Bin(C2)

Bin(Cn)Fail to identify and thus remove significant amounts of redundant data when there is a lack of similarity among files

It uses a two-leve lindex structure made up of similarity characteristic value and the granularity of bin. Extreme Bining stores the similarity characteristics value in RAM. Extreme Bining only identifies the redundant data in the same bin, even though neighbouring binsmay have identical data blocks. This results in some redundant data blocks so as to degrade the deduplication ratio. the deduplication ration of Extreme Bining heavily relies on the similarity degree of data streams.

ICPADS 2014 7

SSD based approach• Fingerprint lookup disk bottleneck The IOPS of disk drive is limited

• Some studies alleviate disk bottleneck by using SSD Dedupv1, ChunkStash

• SDD is still very expensive in contrast to disk drives.

• The performance of random and small writes becomes a new bottleneck of SSD

HDD VS SSD

ICPADS 2014 8

Our approach• A fingerprint prefetching approach by using the file similarity to enhance the

deduplication performance

• The locality of fingerprints are maintained by arranging the fingerprints in terms of the sequence of the backup data stream

• The overhead of different similarity identification algorithms are investigated, and the impacts of those algorithms on data deduplication are evaluated in contrast to previous studies Extreme Binning, Silo, FPP

• This approach does not impact the deduplication ration

ICPADS 2014 9

System architecture

Implementation in LessFS Implementation in Tokyo Cabinet

ICPADS 2014 10

Storage structure for fingerprints

Loss the locality of fingerprints

The locality of fingerprints are maintained by arranging the fingerprints in terms of the sequence of the backup data stream

ICPADS 2014 11

The process of fingerprints prefetching

ICPADS 2014 12

Evaluation• Implement a real prototype based on LessFS and Tokyo Cabinet

• Three similarity identification algorithms FPP, PAS and Simhash are implemented in the Similar File Identification Module

• Ubuntu operation system(Kernel version is 3.5.0-17) ,1GB memory, 2:4GHz Intel(R) Xeon(R) CPU

• We take four full backups to evaluate the system like what DDFS does.

• Four data sets backup1, backup2, backup3 and backup4 to perform the evaluation

10GB, 15GB, 20GB and 25GB, and the numbers of files are 3073, 4694, 6539 and 9910,

respectively.

We choose fixed-size chunk algorithm. The chunk size is 4KB, 8KB, 16KB, 32KB, 64KB and 128KB

ICPADS 2014 13

FPP and PAS

ICPADS 2014 14

Simhash• Simhash is a member of the local sensitive hash

• Simhash has the property that the fingerprints of similar files differ in a small number of bit positions

• Actual runs at Google web search engine

ICPADS 2014 15

Data sets The file size distribution matches the previous studies.

ICPADS 2014 16

Deduplication ratio• We measure the size of unique data blocks by using three different similarity

identification algorithms including FPP, PAS and Simhash with four full backups

• When the chunk size is 4KB, the unique data blocks are 14GB, and the data deduplication ratios are 3.93 across the three cases.

• The performance is the same as that of the baseline system LessFS.

ICPADS 2014 17

Time overhead of fingerprint lookup• the time of similarity detection

• : the time of fingerprint prefetch

• : the time of fingerprint lookup

• The overall overhead of fingerprint lookup

• For Base has

ICPADS 2014 18

Time overhead of fingerprint lookup

ICPADS 2014 19

CPU utilization

ICPADS 2014 20

Memory utilization

ICPADS 2014 21

Conclusion• Proposes a fingerprint prefetching approach by preserving the locality of fingerprint in

the form of backup data stream as well as taking advantage of file similarity

• The proposed method can effectively alleviate the disk bottleneck with acceptable overhead of CPU, memory, and storage when performing fingerprint lookup, thus improving the throughput of data deduplication

• Does not impact the data deduplication ratio

ICPADS 2014 22

Reference• SSD: http://en.wikipedia.org/wiki/Solid-state_drive• HDD vs SSD: http://www.diffen.com/difference/HDD_vs_SSD• D. Bhagwat, K. Eshghi, D. D. Long, and M. Lillibridge, “Extreme binning: Scalable, parallel deduplication for chunk-based file backup,” in Modeling,

Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS’09. IEEE International Symposium on. IEEE, 2009, pp. 1–9.• W. Xia, H. Jiang, D. Feng, and Y. Hua, “Silo: a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput,” in

Proceedings of the 2011 USENIX conference on USENIX annual technical conference. USENIX Association, 2011, pp. 26–28.• A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, “Min-wise independent permutations,” Journal of Computer and System Sciences, vol. 60,

no. 3, pp. 630–659, 2000.• Y. Zhou, Y. Deng, X. Chen, and J. Xie, “Identifying file similarity in large data sets by modulo file length,” in Proceedings of the 14th International

Conference on Algorithms and Architectures for Parallel Processing. IEEE, 2014.• D. Meister and A. Brinkmann, “dedupv1: Improving deduplication throughput using solid state drives (ssd),” in Mass Storage Systems and Technologies

(MSST), 2010 IEEE 26th Symposium on. IEEE, 2010, pp. 1–6.• B. Debnath, S. Sengupta, and J. Li, “Chunkstash: speeding up inline storage deduplication using flash memory,” in Proceedings of the 2010 USENIX

conference on USENIX annual technical conference. USENIX Association, 2010, pp. 16–16.• Y. Deng, “What is the future of disk drives, death or rebirth?” ACM Computing Surveys (CSUR), vol. 43, no. 3, p. 23, 2011.• B. Zhu, K. Li, and R. H. Patterson, “Avoiding the disk bottleneck in the data domain deduplication file system.” in Fast, vol. 8, 2008, pp. 1–14.• J. Gantz and D. Reinsel, “The digital universe decade-are you ready,” IDC iView, 2010.• S. Quinlan and S. Dorward, “Venti: A new approach to archival storage.” in FAST, vol. 2, 2002, pp. 89–101.• M. Ruijter, “Lessfs,” http://www.lessfs.com/wordpress/.• F. Labs, “Tokyo cabinet,” http://fallabs.com/tokyocabinet/.• M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, and P. Camble, “Sparse indexing: Large scale, inline deduplication using sampling and

locality.” in Fast, vol. 9, 2009, pp. 111–123.• G. S. Manku, A. Jain, and A. Das Sarma, “Detecting nearduplicates for web crawling,” in Proceedings of the 16th international conference on World Wide

Web. ACM, 2007, pp. 141–150.

http://en.wikipedia.org/wiki/Solid-state_drive

http://en.wikipedia.org/wiki/Solid-state_drive

http://www.diffen.com/difference/HDD_vs_SSD

http://www.diffen.com/difference/HDD_vs_SSD

http://www.lessfs.com/wordpress/

http://www.lessfs.com/wordpress/

http://fallabs.com/tokyocabinet/

http://fallabs.com/tokyocabinet/

ICPADS 2014 23

Thank you!

Question?

Documents

Leverage Similarity and Locality to Enhance Fingerprint Prefetching of Data Deduplication Yongtao Zhou, Yuhui Deng, Junjie Xie Department of Computer Science,