Data deduplication techniques for efficient cloud storage ...static.tongtianta.site/paper_pdf/a051c3e0-1a19-11e9-81a1-00163e08bb86.pdfData deduplication techniques for efﬁcient cloud

J Supercomput (2018) 74:2035–2085https://doi.org/10.1007/s11227-017-2210-8

Data deduplication techniques for efficient cloudstorage management: a systematic review

Ravneet Kaur1 · Inderveer Chana1 ·Jhilik Bhattacharya1

Published online: 20 December 2017© Springer Science+Business Media, LLC, part of Springer Nature 2017

Abstract The exponential growth of digital data in cloud storage systems is a criticalissue presently as a large amount of duplicate data in the storage systems exerts an extraloadon it.Deduplication is an efficient technique that has gained attention in large-scalestorage systems. Deduplication eliminates redundant data, improves storage utiliza-tion and reduces storage cost. This paper presents a broad methodical literature reviewof existing data deduplication techniques along with various existing taxonomies ofdeduplication techniques that have been based on cloud data storage. Furthermore, thepaper investigates deduplication techniques based on text and multimedia data alongwith their corresponding taxonomies as these techniques have different challengesfor duplicate data detection. This research work is useful to identify deduplicationtechniques based on text, image and video data. It also discusses existing challengesand significant research directions in deduplication for future researchers, and arti-cle concludes with a summary of valuable suggestions for future enhancements indeduplication.

Keywords Data deduplication ·Data reduction · Storage systems ·Cloud computing ·Big data

B Ravneet [email protected]

Inderveer [email protected]

Jhilik [email protected]

1 Computer Science and Engineering Department, Thapar Institute of Engineering and Technology(Deemed to be University), Bhadson Road, Patiala 147004, India

123

http://crossmark.crossref.org/dialog/?doi=10.1007/s11227-017-2210-8&domain=pdf

2036 R. Kaur et al.

1 Introduction

With the advent of cloud computing in recent years, the data volumes in cloud areincreasing significantly due to continued growth of internet, adoption of smartphonesand social networking platforms. In 2011, International Data Corporation (IDC)reported that data volume created and copied in the world will be 35ZB by 2020[1]. The enterprises are facing problems in storing and processing a large amount ofdata volumes. In order to enhance the reliability and availability and provide disasterrecovery, data are generally duplicated on multiple storage locations. Most of theseduplicated data exert an extra load on the storage system in terms of additional spaceand bandwidth to transfer the duplicated data on the network. An efficient data storagemanagement is critical, and data deduplication technique is considered as an enablingtechnology for efficient storage [2] of big data. Deduplication technique is a specialdata compression technique [2] to eliminate redundant data and reduce network trans-mission rate and storage space in the cloud storage systems [3,4]. The techniques findout the duplicate data, save only one copy of the data [2] and strategically use logi-cal pointers for duplicated data [3,5]. Deduplication addresses the growing demandfor storage capacity [6]. Many cloud storage providers like Amazon S3, Bitcasa andMicrosoft Azure [7] and backup services such as Dropbox and Memopal are employ-ing data deduplication techniques [8] to improve storage efficiency.

The deduplication techniques are data type specific, and different techniques areemployed on different types of data such as text, image and video data. All three typesof data have different storage formats and implicit characteristics. Based on type ofdata, deduplication techniques have different processes to find and remove duplicateinformation. So, type of data is important for the development of deduplication tech-niques. The format of information is critical for reading, finding and matching theinformation. Bit- level matching is required to find duplication in executable files. Thetechniques to check duplicates in text, image and video have different processes dueto varied format of data.

Aminimumnumber of data replica called replication factor aremaintained in a largedistributed storage system to achieve high data availability. Any duplicate data abovereplication factor is removed to reduce storage requirement, storage cost, computationand energy. Due to these significant benefits to industry, deduplication techniques fora large distributed storage systems gained momentum in academia and industry. Still,these techniques are facing challenges due to efficiency and efficacy of data matchingtechniques. The researchers in academia and industry are working to develop efficientdistributed deduplication techniques.

Figure 1 represents data deduplication inwhich duplicate segments of same data arereduced to unique segments of data. The whole file is divided into fixed- or variable-size segments. With deduplication process, only single copy of each segment is stored[10] and pointers are used for duplicate segments. If deduplication engine come acrossa piece of data that is already stored somewhere in the storage system, it saves pointerin the data copy place that leads back to original copy. It helps in freeing up theblocks in the storage system, thus freeing the memory space. Figure 1 presents thededuplication process.

123

Data deduplication techniques for efficient cloud... 2037

Fig. 1 Data deduplication process

1.1 Motivation

After making an assessments of the current research in deduplication techniques, aneed was felt to comprehensively review the literature available on subject. However,this section summarizes motivations, contribution and the novelty of this article.

(i) The role of deduplication technique to improve the performance of a large storagesystem has been discussed. The necessity of a deduplication technique, its meritsand demerits has also been studied.

(ii) The existing deduplication techniques have been categorized as storage based,point of application based and level based. Further, deduplication technique hasbeen classified based on text, image and video. The parameters of text deduplica-tion technique and their importance have been explained. A comparison of text-,image- and video-based deduplication and their classification based on varioustaxonomies are presented. Therefore, this paper reviews literature and presentsa comprehensive view on deduplication techniques.

(iii) Future research directions in the field of deduplication have been highlighted forresearchers of academia and industry.

This paper refers quality journals and proceedings of various conferences and reportsof many research centers. This article has been organized into six sections. Section 2describes background, evolution of deduplication, redundant data reduction techniquesand its comparison,merits anddemerits of deduplication techniques. Section 3presentsa review method, research questions and a research methodology used to select andreview the previous researchmaterial. It also presents a framework for analysis and dis-cussion of research material. Section 4 focuses on presenting a generic deduplicationprocess, taxonomyof deduplication techniques. Further, the techniques are categorizedbased on text, image and video. Section 5 carries a discussion on open challenges andfuture research directions in the area of deduplication techniques. Section 6 concludesthe review and provides recommendations for future research.

123

2038 R. Kaur et al.

2 Background

With impetuous growth of data, the term big data came into existence and used mainlyto describe massive datasets [9]. It typically includes unstructured data that requiremore real-time analysis as compared to traditional datasets. The frequency with whichthe data are generated is very crucial and a major challenge to handle [10]. Thissection first provides necessary background related to traditional data compressionapproaches and data deduplication approaches. Detailed studies of data deduplicationhave been conducted by Paulo and Pereira [5] and Xia et al. [11]. Both these surveyshave extensively explored data deduplication techniques in the storage systems. Thissurvey summarizes existing data deduplication techniques on text-, image- and video-based deduplication techniques and is an enhancement of the existing survey.

2.1 Evolution of deduplication

In early 1950s, data reduction techniques [11,12] were introduced, which can becategorized into lossless and lossy [12] data reduction techniques. Later in 1990s,space- efficient approach, i.e., intelligent delta compression techniques, was proposedto target the compression of very similar files or similar chunks. Data deduplicationterm got coined in early 2000s to help the large storage systems at high granular-ity level [1]. It eliminates both inter-file- and intra-file- level redundancy over largedatasets across multiple distributed storage servers unlike traditional data compres-sion techniques which remove redundancy over small group of files based on intra-fileredundancy. Cryptographic hashes of each file or chunk are calculated to identify theduplicates. These techniques have also been applied on multimedia contents in 2008,and it searches the duplicate multimedia content bymeasuring the similarity of imagesor frames in videos using feature extraction and hashing techniques. These data dedu-plication techniques came into existence to solve the issue related to increasing datasize in the storage systems [13]. Huffman coding and dictionary coding are the tra-ditional data compression techniques that work at byte or string level [11–13], whilededuplication techniques eliminate redundancy at file or chunk level.

Redundant data reduction techniques came into existence in 1950s [11], whichwere largely lossless and lossy data compression techniques, followed by delta com-pression in 1990s. In 2000, data deduplication techniques came later followed bymultimedia deduplication. Figure 2 shows the evolution of deduplication techniques.These techniques are discussed in the following sections.

Data Reduction Techniques (1950)

Delta Encoding Techniques (1990)

Data Deduplication Techniques (2000)

Deduplication Techniques on Multimedia data

(2011)

Fig. 2 Evolution of deduplication

123


2.2 Redundant data reduction techniques

These redundant data reduction techniques have been developed to cope with increas-ing amount of digital data and to identify redundancy from byte to string level andfrom chunk to file level. The organization of redundant data reduction techniques andtheir evolution are further shown in Fig. 3.

Data compression is a bit-rate reduction approach, which represents information incompact form. It reduces the required storage space and tries to find redundant data.Data compression is broadly classified into lossless and lossy compression techniques[11,12]. In lossless compression technique, the exact original data are reconstructedfrom the compressed data. Lossy compression reduced the data by identifying unnec-essary information like jpeg image compression. It reconstructs an approximationof the original data. Videos and audios data are compressed using lossy compressiontechniques [12]. In this section, necessary background related to redundant data reduc-tion techniques is presented, which shows the evolution of both traditional losslessdata compression approaches, delta compression techniques and data deduplicationapproaches. A taxonomy of all the approaches and their evolution are presented.

2.2.1 Lossless data compression techniques

Data compression was coined by Claude E. Shanon. The techniques like entropyencoding, run-length encoding and dictionary encoding are lossless data compressiontechniques [14]. The string of characters is represented by few bit sequences. A largeamount of redundant data are found in such strings and are removed in such datapatterns.

(a) Byte levelThe early data compression techniques use entropy encoding to identifyredundancy at byte level. Huffman coding and arithmetic coding are the types ofentropy encoding used to represent frequently occurring pattern with fewer bits.Huffman coding developed by David A. Huffman uses frequency-sorted binarytree to generate the optimal prefix code [12]. It replaces fixed-length codes withvariable-length codes [12,15]. Frequently used symbols are represented by shortencodings [12] than less frequently used symbols. Arithmetic coding developedby Elias in 1960 [16] encodes entire message into fixed floating point number. A

Fig. 3 Organization of redundant data reduction techniques

123

2040 R. Kaur et al.

string of characters is represented using fixed number of bits per character [12,15,16]. Frequently used characters are storedwith fewer bits, and less frequently usedcharacters are stored with more bits. The entropy encoding-based approacheshave limited scalability and are not efficient for large storage systems.

(b) String level String-level approach was proposed to search and eliminate repeatedstrings. The two main approaches at byte level are LZ77/LZ78 and LZW/LZO[11]. LZ77/LZ78 [13,14] proposedbyLempel andZiv in 1970s [11] is dictionary-level approach to support sliding window that detects and eliminates repeated setof strings. LZW/LZO, proposed by Terry Welch in 1980s, is the variant of LZcompression to speed up [11] or improve the compression process. The organi-zation of redundant data reduction techniques is presented in Fig. 3.

2.2.2 Delta compression techniques

In 1990s, delta compression technique came into existence to target the compression ofsimilar files or chunks [11].Most widely used applications are remote synchronizationand backup storage system. It uses byte-wise sliding window to find matched stringsbetween similar chunks. The differences between sequential file and complete file arestored in the form of “delts” or “diffs” [11,13]. The string level is one of the deltacompression techniques.String level Xdelta and Zdelta are delta compression techniques that use byte-wisesliding window [13] to identify duplicate strings between source chunk and targetchunk for delta calculation. But it is a very time-consuming approach and is not ascalable.

2.2.3 Data deduplication techniques

Data deduplication technique was firstly proposed in 2000 to support global compres-sion at coarse granularity [11]. The previous approaches are very time-consuming toidentify similar chunks and are not scalable,whereas data deduplication techniques canbe applied at file level or sub-file level. It compresses data by using fixed- or variable-size chunks. The hash values of these chunks are generated using cryptographic hashfunctions, and duplicates are detected by matching hash values.

(a) File-level deduplication These techniques are applied at file level, and file is con-sidered as a single unit. It checks the backup file index to compare the attributesstored in the file [3]. If the same file exists, it adds a pointer to the existing file,otherwise it updates and stores the index value. So, only one instance of the fileis saved, and it is also called as single-instance storage. Here, whole-file hash-ing approach is simple to apply. Since, file hash numbers are easy to generate,and it requires quite less processing power. A few or one byte change in filetriggers a generation of a different or new hash value that requires different stor-age. This issue of file-level deduplication leads to the introduction of block-leveldeduplication techniques.

(b) Sub-file-level (block-level) deduplication In these techniques, a file is broken intomultiple smaller blocks, which are of fixed- or variable-size blocks [9]. MD5,

123


SHA-I, Rabin fingerprinting and similar hash algorithms are used to identifysimilar blocks. So, a unique block is written to the disk and its index is updated.Otherwise, a pointer is added to the same data block’s original location. It requiresmore processing power because the number of identifiers increases significantlythat need to be processed. Block-level deduplication is further categorized asfixed- length or variable-length deduplication.• Fixed-length block deduplication Fixed-length approaches examine blocks ofdata with a predetermined length. It divides the files into fixed-size blocks[9]. The system does not take backup of same block of data twice. The mainadvantage of this approach is its simplicity. A single character insertion indata requires a shifting of data by one byte. The following all data blocks arebacked up again. This shortcoming of fixed-length block approaches leads tothe innovation of variable-length techniques.

• Variable-length block deduplication Variable-length deduplication techniquedivides the file into variable-length data blocks. Variable-block algorithms usedifferent methods for determining the block length. This helps the boundariesof data blocks to “float” within data stream so that changes in one part of blockhave no impact on the boundaries in other locations of blocks [9]. The file ispartitioned into content-dependent manner, and the segment is of any bytes inlength within a range. It provides greater granularity control and flexibility toinsert data in block.

2.3 Comparison of lossless compression, delta compression and datadeduplication

Table 1 shows the comparisonof redundant data reduction techniques basedonparame-ters like target data, granularity level, approaches, scalability, evolution and processingtime and is an enhancement of existing work [13]. Table 1 is adapted and enhancedfrom tabular data in [13].

Table 1 Comparison of redundant data reduction technologies

Lossless datacompression

Delta compression Data deduplication

Target data All data Similar data Duplicate data

Granularity Byte or string level String level Chunk or file level

Approaches Huffman coding,dictionary coding

KMP-based copy andinsert

Content-definedchunking, Hashing orfingerprint

Scalability Weak Weak Strong

Evolution (year) 1950s 1990s 2000

Processing time High processing timeHuffman/dictionarycoding

Optimized by usingRabin–Karp stringmatching

Less processing time asfile- or sub-file-leveldeduplication areapplied

123

2042 R. Kaur et al.

2.4 Merits and demerits of deduplication

The deduplication has significant advantage in storage system. These techniquesrequire resources to employ and draw benefits. The paper highlighted the importantmerits and demerits of deduplication techniques.

2.4.1 Merits of deduplication

The following merits of deduplication are identified and presented below:

(i) Reduce storage space Deduplication assists in reducing storage space requiredfor backups, file or other data applications. As only a unique copy of data isstored and duplicate copies are removed. So, it creates more free space to storemore data.

(ii) Improves network bandwidthAs the unique copies are stored on disk and logicalpointers are created for duplicate data, there is no need to transmit duplicatecopies over the network. The deduplication helps in reducing network bandwidthrequirements.

(iii) Reduce energy consumptions Deduplication is a storage optimization techniquethat reduces storage and energy requirements. The reduced storage space requiresless electricity and coolants. Thus, it saves energy requirements and reduces loadon system resources.

(iv) Reduce overall storage cost Deduplication helps in significant savings in termsof time, space, network bandwidth, human resources and budget. It leads to betterefficiency and efficacy of storage system.

2.4.2 Demerits of deduplication

(i) Impact on storage performance In primary storage system, fixed-size approachleads to multiple chunks stored at different memory locations. It leads to frag-mentation issue, which adversely impacts the performance. The deduplicationtechnique requires additional resources like CPU, memory and bandwidth for itsexecution. Any inefficient deduplication technique impacts the performance ofa large storage system.

(ii) Loss of data integrity The data blocks are indexed through hash values for betterlookup. The identical hashes can be generated for different data blocks due tohash collision that can cause loss of data integrity. So, hash collisions must becarefully addressed to avoid any loss of data and its integrity.

(iii) Backup appliance issues Data deduplication may require a separate hardwaredevice to transfer and process data. Such backup appliancemay lead to additionalcost and impact storage performance.

(iv) Privacy and security The deduplication techniques have full access to completestorage. It can be exploited to get complete access of storage. The security ofdeduplication techniques should be carefully designed to guard system from suchsecurity breaches and loss of private data.

123


(v) Reduce availability The data are duplicated to improve high availability in alarge distributed storage system. Any reduction to such duplicate copies affectsthe availability of storage system. A minimum number of copies need to beretained to maintain high availability of a storage system.

3 Review method

The systematic reviewof deduplication technique in large-scale storage system is basedon the proposed guidelines of Kitchenman et al. [17,18]. It presents the purpose of thisresearch and the steps involved for conducting the review. The main steps involvedfor conducting the review include creating a review framework and elaborating themethodology in depth, discussing the findings and further exploring new challenges.

3.1 Planning of review

The procedure involves the list of research questions on deduplication techniquesin Sect. 3.2, the search of different databases, and the identification and analysis ofthe existing techniques. The primary studies are searched exhaustively either throughelectronic databases queries or through manual searches on the refereed journals andconference proceedings. It was refined by certain exclusion criteria, identification ofprimary studies, to extract the data and to synthesize the data.

3.2 Research questions

This systematic review focuses on identification and classification of existing literatureon deduplication techniques. The review questions are identified after the discussionwith co-authors and are listed in Table 2. Themain objective of this review is to presentthe consolidated latest researchwork in data deduplication concepts by recommendinganswers to the review questions.

3.3 Study selection procedure

The refinement of search process is done with the help of search keywords. Theresearch literature found is selected based on detailed survey. The survey focuseson identification and classification of the existing literature on data deduplication,multimedia deduplication, deduplication techniques and its various challenges. Theresearch methodology adopted in this study is based on finding out relevant researchpapers from different databases and then listing out the questions that are desired tobe addressed.

3.4 Sources of information

The scholarly online electronic databases are selected to find relevant informationfrom wide publications. The following electronic databases have been searched forresearch articles.

123

2044 R. Kaur et al.

Table2

Reviewqu

estio

nsandmotivation

Sr.n

o.Reviewqu

estio

nsMotivation

1Whatisdatadeduplication?

Discussits

evolutionandits

merits

anddemerits

a)Discuss

thededuplicationprocess?

b)How

canwecategorize

deduplicationtechniques?

Itwill

exploredatadedu

plication,

itsevolution,

advantages,d

isadvantages

andvariouscategories.Itw

illfurtherhelp

theresearchersto

gain

adeep

understand

ingof

thecompletededu

plicationprocess.The

mainaim

ofthisreview

isto

clearlyunderstand

theconcept,currentstatus,issues

and

future

requ

irem

entsof

dedu

plicationthatim

provestorageefficiency

2Whatare

thededu

plicationtechniqu

esappliedon

text

data?Discuss

the

catego

rizatio

nof

text

dedu

plicationtechniqu

esandtheirkeyfin

ding

s?Itwill

discusstext-based

deduplicationtechniques.T

heaim

isto

discuss

text-based

techniques

andtheirkeyfin

dings.Text-based

deduplication

techniques

arefurthercategorizedbasedon

parameters

3Whatare

thededuplicationtechniques

appliedon

multim

edia?Discuss

the

catego

rizatio

nof

imageandvideodedu

plicationtechniqu

esandtheirkey

finding

s?

Itwill

exploremultim

edia-based

deduplicationtechniques

which

are

furthercategorizedas

imagebasedandvideobased.The

aim

isto

discuss

themultim

edia-based

techniques

asthesehave

poseddifferentchallenges

than

text-based

deduplicationtechniques.T

heaim

isto

gain

deep

know

ledg

eof

multim

ediadedu

plicationtechniqu

es

4Discuss

thevariousresearch

opportunities

inthefield

ofdeduplication

Itaimsto

provideinform

ationabouto

penresearch

issues

forprospective

researchersworking

inthefield

ofdedu

plication.

The

open

research

areasandits

sub-area

ofdeduplicationarediscussed

123


Table 3 Search strings (2005–2017)

Sr. no. Keywords Synonyms Content type and date(2005–2017)

1. Deduplication Data deduplication Journal, Conference,Workshop, Magazine, andTransaction

2. Deduplication storage Data deduplication in storagesystem

3. Data deduplication Data deduplication in cloud

4. Deduplication architecture Architecture of deduplication

5. Deduplication techniques Techniques of deduplication

5. Deduplication tools Simulation tools in deduplicationin cloud

6. Deduplication evolution Review of existing research indeduplication

7. Deduplication analysis Analysis of research gaps indeduplication

8. Deduplication comparison Comparison of existing research

9. Need for deduplication Practical benefits ofdeduplication on storagesystems

10. Image-based deduplication Virtual images deduplication oncloud

11. Images deduplication Image and video deduplication

12. Deduplication similar images Exact or near-exact deduplication

13. Deduplication duplicateimages

Image fingerprinting

• Springer• IEEE explore• Science Direct• ACM Digital Library• Elsevier• Google Scholar

3.4.1 Other sources

Apart from above sources, books, technical reports and online literature relevant to thisreview are included. The central objective is to widen the scope of literature coverageand give comprehensiveness to this review. The following sources are categorized:

• Other review articles• Books and Technical Reports• Tools and other Online sources.

123

2046 R. Kaur et al.

3.5 Search criteria

The initial search criteria involve the title (“deduplication”), (“techniques of dedupli-cation”), (“Deduplication techniques applied in cloud computing”), (“deduplicationbased on images”). The keyword “deduplication” is involved in almost all the searchesin abstract. The lookup for information relevant to this is quite wide in terms of period,coverage and quantity. It is time-consuming method of review.

Table 3 presents the search string based on keywords, synonyms and period ofsearch. The papers were included from different journals and conferences workshops.Although we have deliberated in considering the search strings to ensure broadness,yet some research papers were not in searches due to many reasons. It may happen asthe article does not have search string in article title and not in abstract. The researchcommunity is using word “deduplication in storage system” and “deduplication ofimages in storage system”. Thus, an attempt has been made to identify such articlesand included by searching manually with the following keywords.

3.6 Inclusion and exclusion criteria

To ensure the coherence of our search, the process of search is done extensively and theprocedure for selection is shown in Fig. 4. Several research papers have been excluded

Fig. 4 Systematic review technique

123


due to either inappropriate title or abstract was not providing efficient details. As thesearch string “deduplication” and “storage system” are widely used on other fieldsof research, the quantity of irrelevant articles is large. As shown in Fig. 4, the initialsearch returned over 754 total papers, which were narrowed down to 128 as the finallist. These were reduced to 354 articles based on relevant title, while 205 left afterreading abstract and conclusion. One hundred and eighty were left based on full text,i.e., that were relevant for literature review. Finally, 128 were selected by principle ofinclusion and exclusion as shown in Fig. 4.

4 Data deduplication

As data are growing exponentially [19,20] in cloud storage services. These data areduplicated on distributed storage system for high reliability, availability and disasterrecovery. The minimum number of replications of data called replication factor isimportant to guard system from disasters and high availability. Any number beyondreplication factormust be removed fromstorage system.Otherwise duplicate data exertmore pressure in the storage system in terms ofmore space and bandwidth. To reduce orcontrol these data duplication, deduplication techniques are applied tomake the storagesystemmore efficient in terms of cost and utilization. The application of deduplicationtechniques depends upon the type of data such as structured, unstructured and semi-structure. These data can further be classified as text, image and video.

The duplicate data affect storage performance, efficiency of storage system andnetwork bandwidth [21]. This has led researchers to focus on the development ofoptimized deduplication techniques for storage systems. This includes deletion ofduplicated data and efficient data delivery in a storage system. Deduplication can bedefined as a technique, which automatically removes duplicate data in the storagesystems. The data reduction in deduplication is reported by Microsoft and NetApp.Microsoft has conducted a study on the file system to estimate the balance in spacesavings between whole-file- and sub-file-level deduplication on 857 desktop windowsmachine over a period of 4 weeks [22]. Based on the study, whole-file-level deduplica-tion achieves 75% of the space savings, while block-level deduplication achieves 32%of the original requirements. Data deduplication was also applied on a digital librarythat identified the duplicated metadata bibliographic record using similarity functionson two real datasets [23]. The datasets are metadata records of two real digital libraries(BDB-Comp and DBLP) and article citation data from Cora collection. Based on thestudy, the results show that the quality of metadata deduplication improves from 2 to62% in digital library dataset and from 7 to 188% in article citation dataset. NetAppreports that deduplication can reduce 95% of duplicate data in the storage systems[24]. Experimental results show that the typical savings are 95% for backup, 72% forVMware, 30% for email and 35% for file services [24].

4.1 Generic deduplication process

The generic process of deduplication consists of data chunking, fingerprinting calcu-lation, index lookup [10,25] and chunk store as shown in Fig. 5. Index lookup plays

123

2048 R. Kaur et al.

Fig. 5 Steps of deduplication process

an important role to find duplicate chunks [25]. In Fig. 5, the file processed for dedu-plication is first broken down into fixed- or variable-size blocks referred to as objects.Data deduplication compares and eliminates blocks that are of same fingerprints. Theunique are stored and index is updated. The four generic steps of the deduplicationprocess are as follows.

• A hash value is calculated first for each chunk of data using cryptographic hashfunction.

• A comparison is made between the hash values of chunks and existing hashes.• The same hash values find duplicate chunk, and data are replaced with a logicalpointer to the object already present in the database.

• New chunk is added and index is updated.

4.2 Classification of deduplication techniques

The existing deduplication techniques have been classified based on storage type,i.e., primary and secondary deduplication, source and target based, processing time,i.e., inline and post-process deduplication, based on level, i.e., local- and global-leveldeduplication and cloud based. Figure 6 represents taxonomy of data deduplicationcategorization based on four parameters like storage based, type based, timing basedand level based.

These are further classified as following [4]:

• Storage based—Primary and secondary storage-based deduplication• Type based—Source- and target-level-based deduplication• Timing based—Inline- and post-process-based deduplication• Level based—Local deduplication and distributed-/global-based deduplication

123


Fig. 6 Taxonomy of data deduplication techniques

4.2.1 Storage-based deduplication

The deduplication has been classified based on the type of storage. Deduplicationis applied on either primary storage [26] or secondary storage [27]. Table 4 depictsstorage-based deduplication techniques.

• Primary storage The primary storage-based deduplication runs on main memoryor active storages that are directly accessible to CPU. CPU continuously reads andexecutes the instructions as required. The primary storage-based deduplication ismainly used for primary workloads, which are latency sensitive [3,26]. The mailservers in memory data are an examples of primary storage.

• Secondary storage It is an auxiliary or external storage system [27] that has notdirect access to CPU. It back ups primary storage data for data protection andrecovery. These systems are accessed only for retrieving old data anddata retention.The examples are storage archives, snapshots and backup storage.

4.2.2 Type-based deduplication

The deduplication process is executed either on source side or on target side. Basedon these two types, deduplication is characterized into source-based and target-baseddeduplication.

• Source-based deduplication The complete deduplication is performed on the dataat the source side before it is being transferred to the backup target [28]. Thesoftware installed on the servers uses CPU and memory of the source side, and itchecks the duplicate before transferring data to backup server. So, it also reduces

123

2050 R. Kaur et al.

Table4

Deduplicationapplications

used

indifferentstorage

system

s

Deduplicationcategories

Significance

Deduplicationadvantages

Tool

andtechniques

Prim

arystoragesystem

sMainor

activ

estoragesystem

,used

forprim

aryworkloads.E

xamples:

mailserversor

user

home

directories

Reduceprim

arystoragespaceand

cost,m

akestorageenergy

efficient

iDedup

[32],d

Dedup,S

DFS

,Ocarina,P

ermabit,

ZFS

,POD[33],

HPD

edup,G

HOST

[34]

Secondarystoragesystem

sAuxiliarystoragesystem

,infrequently

accessed.E

xamples:

storagearchives

andbackup

storage

Reducesecond

ary/backup

storage

spaceandcost

Sparse

indexing

[35],D

DFS

[36],

HydraStor

[37],R

evDedup

[38],

Silo

[39],D

EDE

Virtualmachine

system

Virtualstorage[40]

andprocessing

throug

hVMs

Virtualmachine

storageefficiency

andredu

ceVM

migratio

ntim

e[30]

Liquid[41],H

OPE

[31]

VMware

ESX

[42],V

Mflo

ck,D

EDE[43]

Networksystem

Distributed

storageandcachingfor

widearea

network(W

AN)storage

optim

ization

Reducetim

eto

storeandprocess

networkstorageon

WAN

SmartRE[44],E

ndRE[45]

SSD-based

multim

edia

storagesystem

SSDFaststorageandaccess

ofmultim

ediacontents

Reducestoragespaceandcostof

solid

-statedevice

(SSD

),make

storageenergy

efficient,reduce

multim

ediasize

andits

processing

cost

ViDedup

[46],N

itro[47],U

QLIPS

[48],C

AFT

L[49]

Cloud

storagesystem

sCloud

Storageprovidingaccess

toprivate,pu

blicandcombinatio

nof

both

users

Improves

clou

dstorageefficiency,

costandreduce

bandwidth

utilizatio

n

Cum

ulus

[50],N

ED[51],S

AR[52],

CABdedu

pe[53]

123


the bandwidth requirements, storage and time required to backup data. At the sametime, it consumes additional CPU and I/O resources to detect the duplicates.

• Target-side deduplication The deduplication is done on the targeted storage device[4] commonly on backup servers. In this, dedicated hardware deduplication appli-ances handle all deduplication functionalities [28]. It improves storage utilizationwith the additional cost of dedicated appliance. There is no overhead on the datasource and used for large storage systems. It requires additional network resourcesand is further classified as inline or post-process as discussed below.

4.2.3 Timing-based deduplication

Timing-based deduplication refers to the timewhendeduplication algorithm is applied.It places a limitation on time duration to perform deduplication [5]. The crucial strat-egy in timing-based deduplication is the deduplication operations like searching forduplicates. It can be done as synchronous/in-band or as asynchronous/out-of-bandoperation. The timing-based deduplication has been further categorized into inlinededuplication and post-process deduplication.

• Inline deduplication The data in deduplication are performed at the source side orbefore it is being written to disk [28]. So, there is no need of additional disk spaceto hold and protect the data to be backed up. Therefore, it increases efficiency as thedata are passed and processed only once. Inline deduplication requires additionalcomputation.

• Post-process deduplication Deduplication is done after backup data is writtentemporary to a storage system, i.e., to a disk. It is also called as offline deduplication[28]. It is usually faster than inline deduplication as it helps in reducing the backuptime.

4.2.4 Level-based deduplication

Data deduplication can be categorized as local-level-based and global-level-baseddeduplication as discussed below

• Local-level deduplication Local data deduplication supports local deduplicationat local area network (LAN) level. Local deduplication is only practical in a singleVM and detects replicas in a single node. It has a negative impact on performanceas it cannot completely remove all the duplicates [29]. It has slightly better perfor-mance in multi-node deployment as it can exploit parallelism and indexing withincreased number of nodes and also maintains data availability.

• Global-level deduplication Global-level deduplication also known as common fileelimination is performed in distributed environment, i.e., across multiple datasets.It is also known as multi-node deduplication and has a cluster of multiple nodesthat work together as a unit. The data sent to one node in the cluster are comparedwith previous data sent on that appliance and to the data sent to any other nodein that cluster. The main goal is to apply deduplication on distributed storage thatusesmultiple storage servers. It eliminates redundant disk accesses and removes all

123

2052 R. Kaur et al.

possible replicas within or across VMs. Further, it has additional hashing overhead[29].

4.2.5 Cloud-based data deduplication on storage systems

Data deduplication technique is widely employed in cloud storage [30,31] environ-ment in backup [31] and archives storage system as it helps in reducing the storagespace requirements and storage cost. Deduplication technique helps in reducing inter-net bandwidth over the network or the amount of data being uploaded to the cloudas only one physical copy is being stored instead of duplicate data copies. It helpsin improving the speed of cloud backup [31], resulting in faster and efficient dataprotection operations.

Deduplication to cloud storage can be set up using direct deduplication to cloud,deduplication to cloud on secondary storage copies and deduplication using cloudgateway. Further, deduplication can be employed in different storage systems from pri-mary to secondary, virtual machines [30] to cloud storage systems. Private, public andhybrid cloud storage system exploits advantages of deduplication technique. Table 4lists deduplication applications used in different storage systems, significance, advan-tages and deduplication applications and techniques. It will help researchers to identifydeduplication techniques based on primary and secondary storage systems, virtualmachine systems, network systems, SSD-based multimedia and cloud storage system.

4.3 Classification of data deduplication techniques based on type of data

Data deduplication techniques are data type specific. Text, image and video are threemain types of data, and all three types of data have different storage formats andimplicit characteristics. The format of information stored in data is critical for find-ing the matched information. As a result, data type is an important parameter for thedevelopment of deduplication techniques. It is among the primary focus domains thesedays and researchers are dedicating their time and attention to efficiently apply dedu-plication technique to remove the duplicates or redundancy. Text, images and videosdata are highly redundant [54] on Internet. With the advent of social networking plat-forms, such redundancy has grown and exert additional load on cloud storage systems[54]. Finding duplicate data on such a large heterogeneous platform is a challengingtask for researchers in industry and academia. Figure 7 presents deduplication tech-niques based on type of data. Text data deduplication has been further classified as filelevel and sub-file level. Multimedia deduplication techniques have been categorizedas image and video. Image-based deduplication is further categorized as exact imagededuplication and near-exact deduplication. The video deduplication is also calledframe-based deduplication.

4.3.1 Text deduplication

In text-based deduplication technique, byte-by-byte comparison is made to get exactmatch of text and duplicates that are identified. It works at file level and sub-file level.

123


Fig. 7 Classification of deduplication techniques based on type of data

File-level deduplication also known as single-instance storage works at the file levelby eliminating duplicate files. Sub-file level also known as block-level deduplicationcan be of fixed-size or variable-size blocks. Fixed-length approach examines blocksof data with a predetermined length. It divides the files into fixed-size blocks [9].Variable-length deduplication technology divides the file into variable-length datablocks.

4.3.2 Multimedia deduplication

Multimedia deduplication is further categorized as image-based deduplication andvideo-based deduplication. Image deduplication techniques are based on image detec-tion techniques that are further classified as exact image detection and near-exact imagedetection. Exact image deduplication techniques are based on the exact duplicateimages without considering image transformation. Near-exact images are modified orcopied version of an original image. The images are modified by applying cropping,modification, scaling, adding noise, compression, rotation, etc. The techniques to findsuch duplicate have different techniques and accuracies.

Video deduplication techniques are frame-based deduplication techniques. First avideo is converted into frames, and then, these frames represent visual features [55].Based on the visual features or descriptors, hash functions are generated from eachkeyframe of video. Visual features are used to detect duplicate sequences from theframes between queried video and video library [55].

4.4 Comparison of text-based deduplication and image-based deduplication

Text-based deduplication is compared with image-based deduplication techniquesbased on partitioning methods, indexing/hashing techniques and lookup methods,format of storage, matching technique and accuracy. Table 5 lists the parameters

123

2054 R. Kaur et al.

Table 5 Comparison between text-based deduplication and image-based deduplication

Parameters Text-based deduplication Image-based deduplication

Partition method Data partitioning is done in theform of chunks like fixed- orvariable-size chunks

Image preprocessing is done, features areextracted

Techniques used Cryptographic hash functions areused to calculate hashes ofvarious chunks

Features are extracted using different featureextraction techniques such as SIFT, SURFand FAST. Further hashing techniques areapplied on image features

Index lookup An exact matching for indexlookup is done in text-baseddata deduplication to detectduplicate files or chunks in thestorage systems

An approximate matching for index lookupis required to match the similar images

Storage Only one copy of file is storedand duplicates are removed

Centroid selection is done in images,centroid image is stored and the near-exactimage transformation is stored in the formof transformation matrix

Matching technique Byte-by-byte comparison is madeto get exact match of text andduplicates are found

Compare the number of the same elementsin the feature extracted from images andthen duplicate images are detected

Accuracy Complete matching is done here Exact or near-exact images are detected here

and comparative description under text-based deduplication and image-based dedu-plication. As video is converted into frames to employ image-based deduplicationtechniques as both are based on visual contents. So, a comparison of text-based dedu-plication and image-based deduplication is presented in Table 5. Table 5 is adaptedand enhanced from tabular data in [56].

4.5 Text-based deduplication techniques

Various works have been reported on text-based deduplication techniques of differentstorage systems such as secondary storage system and backups. Several authors havediscussed text-based deduplication techniques, which are broadly categorized basedon granularity, locality, indexing, security and cloud. Figure 8 presents taxonomy oftext-based deduplication techniques.

Fig. 8 Taxonomy of text-based deduplication techniques

123


(a) Taxonomy of text deduplication based on granularity Granularity is a method todivide data into chunks and is a fundamental factor for removing duplicates. Thechunks may be of fixed or variable size. The file and sub-file are two levels ofgranularity. In file-level granularity, deduplication is performed on file and is alsoknown as object granularity. The sub-file-level granularity is performed to dividea file into fixed- or variable- size chunks or blocks [5]. The variable chunks insub-file level provides better matching efficiency. The finer granularity of chunksimproves efficiency and precision of detecting duplicates [57]. The granularityrequires additional computation and memory. Table 6 lists taxonomy of textdeduplication techniques based on granularity proposed in research articles, theirdescription and key findings.

(b) Taxonomy of text deduplication based on locality The storage systems exploitlocality in caching strategies and on-disk layout [5]. Two types of localityemployed in storage are temporal locality and spatial locality. In temporal local-ity, chunks are referenced recently in a particular memory location at the momentof time and are expected to be referenced samememory location in the near future[5]. So, duplicate chunks appear several times in a short time duration. In spatiallocality, chunks with nearby addresses in the memory location are referencedclose together in time. Spatial locality refers to a particular memory locationthat is referenced at a particular time, then it is very likely that its close memorylocationswill be referenced in near future. Table 7 depicts taxonomy of text dedu-plication techniques based on locality, description and keys finding suggested inresearch articles.

(c) Taxonomy of text deduplication based on indexing Indexing provides an efficientdata structure to lookup for duplicated data [5]. To search for exact duplicates,hashing technique is used to summarize content, which leads to the identificationof signatures. Hashing computation needs additional CPU resources and hashcollision avoidance methods. Hashing collisions can be avoided by comparingthe contents of the two chunks with identical signatures. Rabin fingerprint [68] isanother technique that is used to compare similarity of two chunks. Table 8 depictstaxonomy of text deduplication techniques based on indexing, their descriptionand key findings.

(d) Taxonomy of text deduplication based on security Security is a key aspect forcloud storage system. The deduplication technique has authorization and authen-tication to access complete storage system. The deduplication technique requiresa security framework to thwart the attackers and prevent system from securitybreaches. In cloud storage system, security is essential in data sharing, maintain-ing data confidentiality and integrity, preventing data leakage and offsite datastorage facilities [75]. Table 9 presents taxonomy of text deduplication tech-niques based on security, their description and important findings.

(e) Taxonomy of text deduplication based on cloud Data deduplication is widelyemployed on cloud-based storage systems to improve the storage efficiencyand storage cost. Network bandwidth, high throughput, computational overhead,deduplication efficiency and low energy consumption are the key challenges forapplying data deduplication on cloud-based storage services. Table 10 shows dif-ferent cloud-based text deduplication techniques suggested in different articles.

123

2056 R. Kaur et al.

Table6

Taxo

nomyof

text

dedu

plicationbasedon

granularity

Autho

r(s)

Techniqu

eDescriptio

nFind

ings

Lietal.[38]

Reverse

dedu

plication(R

evDedup

)Hyb

ridinlin

eandou

t-of-line

dedu

plicationtechniqu

e,op

timized

readsto

thelatestbackup

s

Improves

storageeffic

iency,high

restorethroug

hput

fornewbackup

s

Laietal.[51]

Near-exactd

efragm

entatio

n(N

ED)

scheme

Identifyandrewritefragmentsin

clou

dbackup

,defragm

entatio

nbasedon

segm

entreference

analysis

Improves

restoreperformance

onclou

dbackup

Wangetal.[57

]Slidingblocking

algo

rithm

with

backtracking

sub-blocks

calle

dSB

BS

Chu

nk-leveldu

plication,

rsyn

crolling

checksum

algo

rithm

are

used

forweakhash

checkto

match

slidingblockandAdler-32

checksum

forweakhash

check

whenbacktracking

sub-blocks

and

MD5hash

algorithm

forstrong

hash

check

Improves

duplicatedetection

precision,

efficiently

detect

duplicatedatain

sub-blocks

Bob

barjun

getal.[58]

Fing

erdiffsupp

ortsflexiblevariable

chun

ksObjectp

artitioning

techniqu

e,dynamically

choosesapartitioning

strategy,m

erge

consecutive

duplicatechun

ksinto

bigg

eron

es

Improvestorageandband

width

utilizatio

n

Kruus

etal.[59]

Con

tent-defi

nedchun

king

Bim

odalalgo

rithms,chun

ksize

vary

dynamically,re-chun

ktheun

ique

butd

uplic

ateadjacent

chun

ks

Achievesreasonableduplication

elim

inationratio

123


Table6

continued

Autho

r(s)

Techniqu

eDescriptio

nFind

ings

Lim

etal.[60]

Duplicated

elim

inated

FlashFile

System

(DeFF

S)Variable-sizedblocks

increase

flexibility,no

n-overlapp

ing

duplicatechun

king

algo

rithm

Reducedu

plicatewriteof

dataand

prolon

gedflash

mem

orylifecycles

Kaczm

arczyk

etal.[61

]Context-based

rewritin

g(C

BR)

Deduplicationbasedon

stream

and

disk

context,rewriteshighly

fragmentedduplicates

Improves

restoreperformance

and

band

width

Wild

anietal.[62]

Heuristically

ArrangedNon

-Backu

pInlin

eDeduplicationSy

stem

(HANDS)

Scalablechunk-baseddeduplication,

N-N

eigh

borhoo

dPartition

ing

(NNP)

metho

dused

forgrou

ping

ofcorrelated

segm

entsinto

index

cache

Reducein-m

emoryindexstorage,

dynamically

pre-fetches

fingerprintsfrom

disk

into

mem

ory

cache

Nam

etal.[63]

CFL

-SD(C

ache-awareChu

nkFragmentatio

nLevelandSelective

Dedup

lication)

Selectivelydeduplicates

theinput

chun

ksbasedon

chun

kfragment

level

Betterread

performance

atreasonablecostof

write

performance

Park

etal.[64

]Lookaheadread

cache

Noveldedupe

storageread

cache

design

forabackup

application,

exploitfuturedatachun

kaccess

patte

rns

Fastread

performance

Xiaetal.[65

]Deduplication-aw

areresemblance

detectionandElim

inationscheme

(DARE)

Exploits

duplicate-adjacency

(DupAdj),forefficient

resemblance

detectionin

backup

s,im

proves

superfeatureapproach

toenhance

theresemblance

detection

efficiency

Lesscompu

tatio

naland

indexing

overheads,high

throug

hput

Fuetal.[66

]History-awarerewritin

galgo

rithm

(HAR),cache-aw

arefilter(CAF)

Reducefragmentatio

nissueby

exploitin

gcachekn

owledg

e,defragmentatio

nby

exploitin

ghistoricalinform

ationof

backup

system

s

Improves

restoreperformance,

redu

cegarbagecolle

ctionoverhead

123

2058 R. Kaur et al.

Table7

Taxo

nomyoftext

deduplicationbasedon

locality

Autho

rsTechniqu

eDescriptio

nFind

ings

Srinivasan

etal.[32]

IDedup

Spatialand

temporallocality,

selectivededuplicatesequencesof

disk

blocks

Minim

izeextraI/Oseeks,reduce

mem

oryandCPU

consum

ption

Lillibridgeetal.[35]

Sparse

indexing

Samplingandsparse

indexto

find

similarsegm

ents,exploits

the

inherent

localitywith

inbackup

stream

s

Reduces

in-m

emorydedu

pmetadata

size,b

ettermem

oryconsum

ption,

only

fewseeksrequ

ired

Zhu

etal.[36]

Datadomaindeduplicationfile

system

(DDFS

)Exploits

inmem

orybloom

filterto

identifynewsegm

entsandstream

Stream

-Informed

Segm

ento

riented

metadatapre-fetch,

locality

preservedcachingmaintainlocality

offin

gerprints

Highcacheratio

bymaintainlocality

offin

gerprints

Xiaetal.[39

]Similarity-locality

(SiLo)

based

Indexing

Exploits

similarity-and

locality-basedstatelessalgorithm

todistributeandparallelizethedata

chun

ksto

severalb

acku

pno

des

Highthroug

hput,loadbalancing,

low

RAM

overhead

improveindex

scalability

Fuetal.[67

]Scalableinlin

ecluster-based

dedu

plication

Exploitdatalocalityandsimilarity

usinghand

-printingTechniqu

e,basedon

localstatefulroutin

galgo

rithm

Highglob

aldedu

plication

effectiveness,high

parallel

dedu

plicationthroug

hput,low

RAM

usagein

each

node

123


Table8

Taxo

nomyof

text

dedu

plicationbasedon

indexing

Autho

rsTechniqu

eDescriptio

nFind

ings

Bhagw

atetal.[69]

Extremebinning,distributedfile

backup

system

Exploits

filesimilarity,splits

the

chunkindexinto

twotiers,fi

leallocatio

nthroughstatelessrouting

algorithm,u

sefilerepresentativ

eindexforprim

aryindex

Reasonablethroug

hput,scalableto

multip

lenodes,alleviatedisk

bottleneckprob

lem

Yangetal.[70]

Chu

nkfarm

,post-process

dedu

plication

Cluster

ofbackup

servers,index

look

upin-batch

(ILB)andindex

updatein-batch

(IUB)hash

algo

rithmsforfin

gerprint

look

upexploitm

emorycache)

Highwritethroug

hput

andscalability

Min

etal.[71]

Con

text-awarechun

king

LRU-based

IndexPartition

ingfor

effic

ient

fingerprint

look

up,

Increm

entalM

odulo-K(INC-K

)for

effic

ient

chun

king

Efficientloo

kupof

fingerprints,

redu

cescompu

tatio

nalo

verheadof

sign

aturegeneratio

n

Guo

andEfstathop

oulo

[72]

Sing

le-nod

ededu

plicationsystem

Prog

ressivesampled

indexing

for

fine-grainedindexing

,Group

edmarkandsw

eepmechanism

deals

with

issueof

chunkgarbage

collection,

minim

izes

disk

seeks

Improvesingle-nodescalability,

optim

izethroug

hput

Barreto

etal.[73

]Hashchalleng

esRedun

dant

chun

ksareidentifi

edby

exchanging

substantially

less

metadata,no

additio

nalcom

plexity

Reducecommun

icationoverheads

Christen[74]

Indexing

techniqu

esforrecord

linkage

anddedu

plication

Survey

of12

variations

ofsix

indexing

techniqu

esusingrealand

synthetic

datasets,estim

ategood

candidaterecord

pairs,heuristic

approaches

split

therecordsin

into

blocks

database

Scalability

andperformance

were

exam

ined

onvariousdatasets

123

2060 R. Kaur et al.

Table9

Taxonomyof

text

deduplicationbasedon

security

Autho

rsTechniqu

eDescriptio

nFind

ings

Harniketal.[76]

Cross-userdedu

plication

Hyb

ridapproach,som

etim

esturn

off

cross-user

deduplication

Reduced

therisk

ofdataleakage,

preservesprivacyof

user

data

Lietal.[77]

Dekey,ram

psecretsharingscheme

(RSS

S)DeK

eyusingRSS

SdistributesCE

keys

tomultip

leserversfor

efficient

CEkeymanagem

ent

Incurred

limite

dencoding

/decod

ing

overhead,preserves

security

and

confi

dentialityof

data

Liu

etal.[78

]Prox

yRe-encryp

tion

Enabledifferenttrustrelatio

nsam

ongclou

dstoragecompo

nents,

differentu

sersdecryp

tthe

shared

dedu

plicated

chun

ksandaccess

the

same

Improves

protectio

nof

user

privacy

Storer

etal.[79]

Convergentencryption

Encryptionkeys

aregeneratedfrom

chun

kdata,identicalchun

ksencrypttosamecipher

text.E

ach

fileisencryp

tedusingaun

ique

key.

Asymmetrickeypairsareused

tomanagethekeys

forsecurity

purpose

Space-efficient

secure

deduplication

forsingle-serverdistributedstorage

system

s

Lietal.[80]

Secretsharingscheme

Secretsplittin

gtechniqueprotect

dataconfi

dentiality,supportfi

leandbo

ck-leveldedu

plicationin

distributedstoragesystem

s

Efficientd

eduplicationwith

high

reliability,dataconfi

dentialityand

integrity,lim

itedoverhead

Vishalakshi

etal.[81]

Cloud

edup

usingconvergent

encryp

tion

Block-leveldedu

plicationon

encryptedfiles,m

etadatamanager

(MM)takescare

ofactual

dedu

plicationandkeymanagem

ent

operations

Secure

andefficient

clou

dstorage

service

Bibaw

eandBaviscar[82]

Hyb

ridclou

darchite

cture

Supp

ortsauthorized

duplicatecheck

usingprivatecloudserver,

dedu

plicationisdo

neon

encryp

ted

dataatCSP

Low

costof

storage,flexiblysupp

ort

access

controlo

nencrypteddata,

prop

osed

mod

elissecure

from

insiderandoutsider

attacks.

123


Table10

Taxo

nomyof

text

dedu

plicationbasedon

clou

d

Autho

rsTechniqu

eDescriptio

nFind

ings

Zhaoetal.[41

]Liquid

Scalablededuplicationdistributed

filesystem

forvirtualm

achine

images,fastV

Mdeploymentw

ithpeer-to-peer

(P2P

)datatransfer

Avoidsadditio

nald

iskop

erations,

good

IOperformance,low

storage

consum

ption

Laietal.[51]

Near-exactd

efragm

entatio

n(N

ED)

scheme

Identifyandrewritefragmentsin

clou

dbackup

,defragm

entatio

nbasedon

segm

entreference

analysis

Improves

restoreperformance

onclou

dbackup

Mao

etal.[52]

SAR,anSS

D(solid-state

drive)—assisted

read

scheme

Exp

loite

dSS

Dsby

storingun

ique

datachun

kswith

high

reference

coun

t,absorb

rand

omreadsto

hard

disks

Accelerateread

performance

significantly

Fuetal.[75

]App

lication-aw

arelocalg

lobal

(ALG)source

deduplication

Intelligent

chun

king

metho

dto

minim

izecompu

tatio

nalo

verhead,

hash

functio

nsbasedon

application

awareness,alleviatedisk

lookup

bottleneck

Optim

izethelook

upperformance,

andstoragecost,h

igh

dedu

plicationefficiencyand

throug

hput

Wuetal.[83]

Deduplication-assisted

prim

ary

storagesystem

sin

clou

d-of-cloud

s(D

AC)

Datareductionanddatadistributio

napproach

tostoredatablocks

inmultip

leclou

dstorageproviders,

Replicationschemeisused

tostore

high

referenced

datablocks,and

erasureschemeisused

tostore

otherdatablocks

Improves

storageeffic

iencyand

networkband

width

Wangetal.[84

]I-sieve,basedon

iSCSI

Protocol

Designnovelindex

tables

atblock

level,multi-levelcache

using

solid

-statedrive

ReduceRAM

consum

ptionand

optim

izelook

upperformance

for

smallstorage

system

s

123

2062 R. Kaur et al.

Table10

continued

Autho

rsTechniqu

eDescriptio

nFind

ings

Leesakuletal.[85]

Dyn

amicdatadedu

plication

Who

le-filehashingtechniqu

e,maintainstorageefficiencyand

quality

ofserviceforclou

dstorage

system

Handles

thescalability

issue,

improves

performance

Sunetal.[86

]Datadedu

plicationover

engineering

oriented

cloudsystem

s(D

eDU),

Run

son

commod

ityhardware,HDFS

isused

formassstoragesystem

and

Hbase

forfastindexing

system

FastIndexing

,efficientfor

data-intensive

engineering

applications

Neelaveni

etal.[87

]Fileclassifier-basedlin

earindexing

dedu

plication(FC-LID

)FC

-LID

uses

Linearhashingwith

Representativegroup(LHRG)to

developindexsystem

toovercome

disk

bottleneckproblem

Lesscomputatio

nalo

verheadand

moreefficient

Lietal.[88]

Convergentencryptionin

secure

deduplicationsystem

Secure

dedu

plicationin

clou

denvironm

entu

sing

encryp

ted

keyw

ordsearch.C

onvergentk

eys

arecheckedforintegrity

purpose

Privacyandsecurity

maintainin

clou

denvironm

ent

Shin

etal.[89

]Secure

datadedu

plicationforclou

dstorage

Survey

onexistin

gsecure

dedu

plicationtechniqu

esbasedon

cryp

tographicandsecurity

protocol

solutio

ns,the

keydesign

decisions

forsecure

dedu

plicationaredata

granularity,d

edup

licationlocatio

n,du

plicatecheckbo

undary,and

system

architecture

Identifyvarioussecurity

threatswith

regard

todataconfi

dentiality,

integrity

andavailabilityin

cloud

storageforbetterefficiency

Pokaleetal.[90]

DelayDedup

eLoadbalancingtechniqu

eforfile

server

andcloudstorageserver.

The

chun

ksthatareno

tfrequ

ently

accessed

aredelayedfor

dedu

plicationprocessandho

tchunks

arededuplicated

firstto

redu

cetherespon

setim

e

Effectiv

elyreducestheresponse

time

andbalanced

theload

ofstorage

nodes.So

achievebetter

availabilityof

data

123


Fig. 9 Evolution of text-based deduplication techniques

Fig. 10 Research contribution of five broad categories

Table 10 presents taxonomy of text deduplication techniques based on cloud,description and important findings.

123

2064 R. Kaur et al.

Table11

Parametersof


Fixed-level

chun

king

Variable-

level

chun

k-ing

Spatial

locality

Tempo

ral

locality

Fullindex-

ing

Partial

indexing

Sparse

indexing

Inlin

emetho

dOffline

metho

d

Extremebinn

ing[69]

Yes

No

No

No

No

No

Yes

Yes

No

DDFS

[36]

No

Yes

Yes

No

Yes

No

No

Yes

No

Chu

nkstash[91]

No

Yes

Yes

No

Yes

No

No

Yes

No

Sparse

indexing

[35]

No

Yes

Yes

No

No

No

Yes

Yes

No

Guo

andEfstathop

oulo

[72]

Yes

No

Yes

No

No

Yes

No

Yes

No

Silo

[39]

No

Yes

Yes

No

Yes

No

No

Yes

No

∑-D

edup

e[67]

No

Yes

Yes

No

Yes

No

No

Yes

No

Dongetal.[92]

No

Yes

Yes

No

Yes

No

No

Yes

No

Chu

nkFarm

[70]

No

Yes

No

No

Yes

No

No

No

Yes

iDedup

[32]

Yes

No

Yes

Yes

Yes

No

No

Yes

No

CBR[61]

Yes

No

Yes

Yes

Yes

No

No

Yes

No

123


Apart from classification of text-based deduplication techniques and their discus-sion, this paper presents evolution of text-based deduplication techniques and helpsresearchers to find techniques in their sub-area. Figure 9 presents evolution of majortext-based deduplication techniques classified into five broad categories.

The articles are classified in five broad categories like granularity, locality, index-ing, security and cloud in text-based deduplication. Figure 10 represents percentageresearch article contribution in these five categories. Granularity has been a primeresearch focus of these articles as it has implicit performance enhancement capabili-ties. The growth of data in cloud has been second major focus of researchers followedby security, indexing and locality.

The exhaustive research is done in defining the parameters of deduplication tech-niques according to various taxonomies defined in Sect. 4 and as listed in Table 11.Table 11 presents the parameters of deduplication techniques. The parameters ofselected techniques are file-level chunking, variable-level chunking, spatial locality,temporal locality, full indexing, partial indexing, sparse indexing, inline method andoffline method based on timing. This table will help researchers to instantly comparededuplication techniques based on these parameters.

4.6 Multimedia-based deduplication techniques

Recent development and advancement in images and videos retrieval system haveproposed many methods to identify or extract the information (features) from imagesor videos. Also with the advent of Internet, smartphones and social networking sites,large amount of images and videos are being shared by the users across the world.

Fig. 11 Image deduplication techniques

123

2066 R. Kaur et al.

Table12

Multim

edia-based


Autho

rsTechniqu

eDescriptio

nFind

ings

Chenetal.[56]

Haarwavelet,B

+Tree

Grayblockfeatures

areused

from

images

toconstructB

+tree

index.

Further,fin

er-granu

laralgo

rithm

Haarwaveletisused

toextractthe

edge

inform

ationof

theim

ages

for

accuracy

optim

ization.

Centroid

selectionisdone

andlastduplicate

images

aredetected

Higherdeduplicationratebut

scalability

isamajor

challenge

Ram

aiah

etal.[94

]Content-based

imageretrieval

(CBIR

),histog

ram

refin

ement

Extractfeatures

usinghistogram

refin

ementtoelim

inateduplicate

ratio

ncardsbasedon

family

photog

raph

s,districtleveland

K-m

eans-based

clustering

algorithm

isused

tospeedup

the

deduplicationprocess

Dedup

licationprocessisdepend

ent

onhu

man

interventio

n

Zargaretal.[95]

Con

tent-based

imageretrieval(CBIR

)Block

trun

catio

ncoding

(BTC)

Photo-baseddeduplicationprocessto

findtheduplicateelectricity

bills

from

large-scaledatabase.

Photog

raph

saredividedinto

differentb

locksusingBTC.Images

ofsamesize

areputinsamecluster

Betterspaceutilizatio

n

Hua

etal.[96]

Principalcom

ponent

analysis

(PCA)—

scale-invariantfeature

extractio

ntechnique(SIFT),

Difference-of-G

aussian(D

oG)

SmartEye,in-networkcoarse-grained

deduplication.

DoG

,PCA-SIFTare

used

forfeatureextractio

nof

images

andhash

thesefeatures

into

space-efficient

bloom

filter.

Locality

sensitive

hashing(LSH

)identifi

essimilarim

ages

basedon

correlated

features

Obtainenergy

saving

sandim

prove

bandwidth

efficiency

123


Table12

continued

Autho

rsTechniqu

eDescriptio

nFind

ings

Lietal.[97]

Secure

perceptualsimilarity

deduplicationscheme(SPS

D)

Todetectsimilarity

between

duplicateim

ages,p

erceptualh

ash

algorithm

isused

togenerate

signatures

ofim

ages

Achievesahigh

deduplicationand

storageandband

width

saving

s

Deshm

ukhetal.[98

]MapReducetechnique

Fastduplicateim

ageidentifi

catio

nsystem

susingMapReduceand

Pearsoncorrelationtechniques

Reducetim

erequired

todetect

duplicateim

ages,improve

efficiencyandreliabilityof

the

system

Fatemaetal.[99

]Setp

artitioning

inhierarchicaltrees

(SPIHT)

SPIH

Talgorithm

compressthe

imageandusepartialencryptionto

preventanim

agefrom

clou

dserviceprovider

(CSP

).Uniqu

ehashes

aregeneratedbasedon

SPIH

Twaveletcoefficientsto

perform

secure

image

dedu

plication

Noextracompu

tatio

nalo

verheadfor

imageencryp

tion

Zheng

etal.[10

0]Scalablevideocoding

(SVC)

Implem

entedencryp

tedclou

dmedia

center

thatsupportssecure

video

dedu

plicationandho

st-encrypted

videocoding

techniqu

es

Scalability

ofSV

Cisan

issue

Yangetal.[10

1]Local-difference-pattern

(LDP)

LSH

(Video)

Images

orvideofram

esare

representedby

local-feature-based

fram

ework.

LSH

represents

indexing

structureto

detect

near-exactim

ages

andvideos

Lessprocessing

timeandlowstorage

overhead

123

2068 R. Kaur et al.

Table12

continued

Autho

rsTechniqu

eDescriptio

nFind

ings

Shen

etal.[48

]Near-duplicatevideoclip

(NDVC)

UQLIPSisafastandrobustNDVC

detectionsystem

basedon

visual

contentu

sing

boun

dedcoordinate

system

(BCS),frame

symbolization(FRAS).K

-nearest

neighb

oralgo

rithm

isappliedfor

similarity

search

Highaccuracy

andfastenou

ghto

supportreal-tim

esearch

Natureletal.[10

2]Fastshot-based

methodusing

discretecosine

transform

(DCT)

Detectd

uplicatesequencesfrom

videoshotsin

television

broadcast.

Video

issegm

entedinto

shots

(frames)andits

fram

esignatureis

compu

tedforexactretrieval

Detectd

uplicateshotsin

very

less

time

Katiyar

etal.[46

]ViD

edup

(201

1)Anapplication-aw

arededu

plication

system

forcompressing

videos

and

todetectvisualredu

ndancy

incontentratherthan

atbytelevel

Scalability

ismajor

challengeto

hand

le

Lietal.[103]

Video

deduplicationwith

privacy

preserving

Video

fram

esaredividedinto

fixed-sizeblocks,encrypted

and

uploaded

onCloud

.Identical

blocks

onserver

side

are

deduplicated

Savesloto

fstoragespace

123


Flickr has around 6 billion images [93], and Facebook has about 0.25 billion imagesuploaded daily. Most of the images uploaded are either modified, forwarded or copied[93], which leads to large amount of duplicate or near-duplicate images on the web.These duplicate images lead to wastage of costly storage in the storage system [54].There must be some efficient technique to remove these duplicates from the storagesystem. Imagededuplication is one such technique,which helps in removal of duplicateimages from the storage system. Different image deduplication techniques have beendiscussed based on their characteristics like image features extraction, image hashingalgorithm for indexing and distance measure to detect similarity between imagesor videos. Further, we explain the different feature extraction techniques to extractthe features of images to analyze exact or near-exact image duplicates. Figure 11defines technique applied to detect exact or near-exact image duplicates. The techniqueremains same for exact/near-exact image deduplication. Only difference lies in thestorage of image transformation for near-exact images.

4.6.1 Multimedia-based deduplication techniques

Table 12 discusses few findings of existing image and video deduplication tech-niques and their key features based on exact image deduplication, video deduplicationbased on exact video deduplication, near-video duplicates detection and security-basedimage deduplication. Deduplication technique was applied on images to detect similarimages and on videos to detect similar frames from videos using different parameterslike feature extraction algorithms like SIFT, SURF, PCA-SIFT and BRISK, Hashingalgorithms to generate hashes of features extracted from images and distance measureto check similarity between two images using some threshold.

Multimedia-based deduplication techniques are facing challenges like scalability,high accuracy and performance. The feature extraction techniques like SIFT, SURF,PCA-SIFT and BRISK need further enhancements to achieve scalability, accuracy andperformance. So there is a need of an autonomic, scalable and efficient multimediadeduplication technique.

The articles are classified in three broad categories like image- and video-baseddeduplication techniques and secure image deduplication techniques. Figure 12

Image Deduplication

50%

Secure image deduplication

8%

Video Deduplication

42%

Image Deduplication Secure image deduplication Video Deduplication

Fig. 12 Research contribution of image, video and secure deduplication techniques

123

2070 R. Kaur et al.

Table13

Techniques

todetectexacto

rnear-exactim

ages

Autho

rsTechniqu

eDescriptio

nFind

ings

Velmurugan

etal.[10

4]Sp

eededup

robustfeatures

(SURF)

and

K-dim

ension

al(K

d)tree

Features

areextractedusingSU

RF

algorithm,K

d-tree

with

BestB

inFirst

(BBF)

algo

rithm

isused

forindexing

and

finding

similarity

betweenfeatures

Improves

averageprecisionto

detect

thematched

images

Lietal.[105]

SURFanddensedescriptor

appliedto

wide-baselin

eStereo

(DAISY)

SURFfeatures

andDaisy

descriptor

iscombinedforim

agematching

Improves

matchingaccuracy

onrotatio

ninvariance.B

utno

tgoo

dforlarge-scaleim

agevariation

Leietal.[10

6]Kd-trees

Proposed

novelindexingstructure,i.e.,

clusterof

uniform

rand

omized

tree

todetectnear-duplicateim

agedetectionfast

Improves

thedetectionefficiencyand

search

space

Dongetal.[10

7]Scale-invariantfeature

transform

(SIFT)

Prop

osed

entropy-basedfiltering

metho

dfor

extractin

gSIFT

features

andqu

ery

expansionmethodto

detectnear

duplicates

ScalableusingHadoopcommodity

server

Keetal.[10

8]DoG

,PCA-SIFTandLSH

Proposed

asystem

usingDoG

and

PCA-SIFTto

detectnear

duplicates.L

SHisused

forefficient

similarity

search

Highcompu

tatio

nalo

verheadwhen

thedatabase

grow

s

Thomee

etal.[10

9]Con

tent-based

image

detection

Acomparativ

estud

yisdo

neto

detectthe

duplicates

usingcontent-basedduplicate

imagedetection.

Toassess

thefeasibility,

large-scaleim

ages

aretakenfrom

web

Com

parativ

eanalysisbasedon

descriptor

size,d

escriptio

ntim

eandmatchingtim

e

Fooetal.[11

0]SICO(Sim

ilarIm

age

Collator),P

CA-SIFT

SICO,a

novelsystem

usingPC

A-SIFTto

extractfeatures,LSH

forindexing

these

featureandHash-basedprobabilistic

countin

gisused

todetectnear-duplicate

images

Effectiv

eandefficient

method

Chu

metal.[11

1]SIFT

andk-means

clustering

Proposed

min

hash

algorithm

using

similarity

measure

todetectnear-duplicate

images

Noextracompu

tatio

nalcostand

improves

search

efficiency

123


Table13

continued

Autho

rsTechniqu

eDescriptio

nFind

ings

Lietal.[112]

SIFT

,k-m

eans

clustering

and

LSH

SIFT

features

areextractedandareclustered

into

severalclustersusingk-means

algo

rithm.L

SHisappliedforindexing

and

histogram

distance

isused

todetect

near-duplicateim

ages

Effectiv

emethodto

detectnear

duplicates

Seoetal.[11

3]Radon

transform,h

amming

distance

Prop

osed

imagefin

gerprintingmetho

dusing

Radon

transform

andperceptualhashing

formultim

ediacontent.Fo

rfin

gerprint

matching,

Ham

mingdistance

isused

Highlyrobustagainstaffine

transformation

Yuetal.[11

4]SIFT

technique

Proposed

SIFT

-based

algorithm

and

geom

etry

isom

orphicrelatio

nshipto

detect

homolog

ybetw

eentheim

ages

Robusttoim

ageem

bedd

ing

Gavrielides

etal.[11

5]Histogram

technique,

differentq

uantization

metho

ds

Imagefin

gerprintingmetho

dto

extract

robustandun

ique

imagedescriptor

byusingcolor-baseddescriptors

Betterperformance

ofthesystem

usingcolordescriptors

Nikolaidisetal.[11

6]R-trees,lineardiscriminant

analysis(LDA)

Prop

osed

color-baseddescriptorsmetho

dof

imageandvideofin

gerprintingusing

R-trees

andLDA.F

rame-basedvotin

gis

also

appliedon

videofin

gerprinting

Efficientfor

digitalrights

managem

ento

fim

ages

andvideo

Nian[117]

Local-based

binary

representatio

n(LBR)

Binarypatte

rn,H

istogram

Prop

osed

LBR,a

compactim

age

representatio

nmethodusingbinary

vector

andhistog

ram

foron

linenear-dup

licate

imagedetection

Robusttoim

agevariations.L

owcomputatio

nalspeed

andbetter

performance

Srinivasan

etal.[11

8]Fo

urier–Mellin

transform

(FMT)

Presentedim

agefin

gerprintingtechniqu

eusingFo

urier–Mellin

transform

todetect

near-duplicateim

ages

Fast,accurateandhighly

scalable

techniqu

e

123

2072 R. Kaur et al.

Table13

continued

Autho

rsTechniqu

eDescriptio

nFind

ings

Yao

etal.[11

9]Con

textuald

escriptors

Prop

osed

anewcontextualdescriptor

that

encodesthespatialrelations

ofthecontext

andmeasuresthecontextualsimilarity

ofim

ages

Betterperformance,effectiv

eand

efficient

methodforlarge-scale

imageretrieval

Stefan

Leutenegger

etal.[12

0]BRISK

Proposed

afastnovelm

ethodto

detectkey

pointsin

acontinuo

usscalespaceand

descriptor

inbinary

string

tomatch

the

similarim

ages

usingHam

mingdistance

Qualitymatches

atless

time,

applicableto

hard

real-tim

econstraintstasks

Chenetal.[12

1]SIFT

Proposed

fastim

ageretrievalm

ethodthat

uses

binarizedSIFT

features

andhashing.

Ham

mingdistance

iscalculated

for

matchingof

similarim

ages

Low

complexity

andfastretrieval

time

Huang

etal.[12

2]Im

agerelatio

nalg

raph

(IRG)

Prop

osed

anovelm

etho

dto

detect

near-duplicateim

ages

usinglocaland

glob

alfeatures.P

ageR

ankgraph

model-based

linkanalysisalgorithm

isused

toanalyzecontextualrelatio

nship

betweenim

ages

Effectiv

emethodto

detect

near-duplicateim

ages

Wangetal.[12

3]MapReduce

Com

bine

both

localand

glob

alim

age

descriptorson

large-scaleduplicate

discovery.Globalfeaturesdiscover

seed

imageclusterandlocalfeaturesto

merge

theclustersto

identifynear

duplicates

Effectiv

emethodforlarge-scale

imagedatain

term

sof

high

accuracy

andrecall

Zhaoetal.[12

4]HarrisandSIFT

Proposed

improved

Harriscorner

detector

andSIFT

usingNearestneighbor

algo

rithm

forfeaturematching

Highmatchingaccuracy

and

increasedmatchingspeed

123


Table13

continued

Autho

rsTechniqu

eDescriptio

nFind

ings

Luetal.[12

5]Discretecosine

transform

(DCT),Harriscorner

detector

Proposed

mesh-basedrobustim

agehashing

usingDCT-basedhash

extractio

nmethod

andhash

database

isconstructedfor

error-resilient

andfastsearching

Improves

imagehashingresistance

togeom

etricdistortio

ns

Leietal.[12

6]Radon

transform,discrete

Fouriertransform

(DFT

)Prop

osed

hashingmetho

dforim

age

authentic

ationbasedon

Radon

transform

andDFT

Robusttocontentp

reserving

operations

Hua

etal.[12

7]Features

from

accelerated

segm

enttest(FA

ST)DoG

,PC

A-SIFT,

LSH

,bloom

filter

Anoveln

ear-real-tim

emethodbasedon

DoG

andPC

A-SIFTto

detectim

age

features.B

loom

filterisused

forcompact

representatio

nof

thesefeatures.L

SHis

used

todetectsimilarim

ages

using

correlationprop

erty

Reducetheprocessing

latencyof

parallelq

ueries

123

2074 R. Kaur et al.

Fig. 13 Evolution of image deduplication techniques

presents percentage contribution of research articles presented in these three cate-gories. Exact image deduplication has been a prime focus of these articles followedby video duplicate detection.

4.6.2 Techniques to detect exact or near-exact images

Image fingerprinting is a prerequisite for matching similar images. Table 13 shows asurvey on image feature extraction and hashing techniques used in image processing todetect exact or near-exact images. These techniques detect duplicate images and theirslight variation. This study will bring out the findings that will help the researchersin developing an efficient deduplication technique for near-exact image and video.Table 13 depicts techniques to detect exact and near-exact image, description andtheir key findings.

Apart from classification of multimedia-based deduplication techniques and theirdiscussion, this paper presents evolution of multimedia-based deduplication tech-niques and helps researchers to find techniques in their sub-area. Figure 13 representsevolution of major multimedia-based deduplication techniques classified into image

123


Table 14 Deduplication tools and technologies

Tool Description

StorReduce of Amazon Deduplication tool for big data from Amazon

StorSimple of Microsoft Azure Block-level deduplication for cloud storage system from Microsoft

Avamar of EMC Variable-length deduplication from EMC

Quantum Inline dedupe for backup

SSET of NetApp Space Savings Estimation Tool (SSET) from NetApp

Data Domain of EMC Inline deduplication tool from EMC

WinPure Standalone software to clean up databases

Sepaton Sepaton DeltaStor deduplication software

Netrics Netrics helps to clean up projects and duplicate records

Revinetix Revinetix works at the file level on backup

Exagrid Deduplication tool for backup

CommVault Deduplication tool for backup for heterogeneous storage infrastructure

deduplication, exact or near-exact image detection and video deduplication techniques.Video deduplication techniques are based on exact video deduplication and near-exactvideo duplicates detection. Figure 13 shows the major contribution in different yearsby various researchers and the techniques they followed.

Deduplication in video and image came into existence in 2011 [46,94]. It has beenidentified that video deduplication techniques are applied on video frames. The framesare treated again as image and image deduplication techniques are applied on videodata. The article [48] published in 2007 detects near- duplicate video frames. It issimply a near-duplicate detection technique. Near-duplicate video detection researcharticles and video deduplication research articles are cited in Fig. 13.

This research article discussed various taxonomies of deduplication techniques.Apart from this, paper has identified deduplication tools provided by various storageenterprises. Table 14 lists deduplication tool and their brief description.

5 Discussion

This survey paper presents recent research work related to deduplication techniquesand is supplementary to the previous surveys. Deduplication techniques have beenexplored in detail and presented based on various taxonomies. An effort has beenmade to address the research issues, which remain still unresolved in deduplicationtechniques in large distributed storage system.

In the recent past, many deduplication techniques for storage systems have gainedattention and new techniques are evolved. The work on optimizing chunk size or gran-ularity, efficient detection of duplicate data in a cloud storage system, privacy, security,performance enhancement through indexing in deduplication techniques are expectedto remain the focus area in coming years. The work on scalability of distributed dedu-plication, fragmentation, disk bottleneck, I/O latency and performance enhancement

123

2076 R. Kaur et al.

in duplicate detection of multimedia data are emerging fields of study. A scalable,robust, efficient and distributed deduplication technique is required for a cloud storagesystem.

This survey presents review of 128 research articles in a systematic and categorizedmanner. This survey article also presents the latest research work on text-based andmultimedia-based deduplication techniques. The challenges and future directions fora next- generation efficient deduplication techniques for big data in cloud have alsobeen discussed for researchers in industry and academia.

5.1 Open challenges and future research directions

Based on the various issues discussed in the existing literature, some of the challengesin deduplication techniques have been identified and are discussed below.

(i) Exact or near-exact image deduplication for large-scale distributed storagesystemMost of the socialmedia images are either slightlymodified or same copyof an original image. The large amount of duplicate images or near-duplicateimages requires huge storage, impacts the performance and cost of a storagesystems. The exact image or near-exact image duplicate detection in a largedistributed storage system is an open research challenge. Such image detectionsrequire additional CPU, memory and bandwidth. Therefore, an efficient andreal-time deduplication technique for exact and near-exact image is a majorchallenge in distributed image storage system.

(ii) Store the transformation of near-exact images in a large-scale storage systemThe near-exact images are the modified version of an original image; therefore,it is not advisable to store the near-exact images. Only the transformation ofnear- exact images need to be stored, and these are reconstructed on applicationonline calls. At present, it is a huge challenge in itself to store the transformationof near-exact images in a large distributed storage system.

(iii) Performance issue for distributed storage deduplication In distributed dedu-plication techniques, deduplication is done on a distributed storage systemeither using inline or using offline approaches. The lookup for duplicatechunks is a resource-intensive task in large distributed storage system.These tasks increase the write latency of chunks [32]. Applying dedupli-cation technique at the time of write reduces the write performance ofchunks in terms of chunks per second. Offline deduplication has been widelyapplied, which runs as background service, but it requires extra tempo-rary storage and increases the I/O bandwidth [5]. Inline distributed dedu-plication technique and optimum use of resources in offline are an openchallenge. This problem is even more complex in a large scalable storagesystem.

(iv) Optimization technique for chunk sizeA file is divided into small chunks of sizevarying from 4 KB to 256 MB. These chunk details are indexed and cached forbetter performance. Even smaller chunks save the space, yet these generate largehash tables. On the other hand, the choice of large chunk size decreases the hashentries, increases the wastage of storage and takes a large time to compare. So, it

123


makes the process of deduplication even more resource intensive. The variable-size chunking degrades the performance as it generates large index structures.The choice of chunk size or granularity is an open problem. There must be anefficient method to calculate the optimum size of chunks for efficiency that willimprove the overall performance of the system.

(v) Disk bottleneck problem The data deduplication technique is mostly appliedin disk [2] based on secondary storage systems. Although the storage systemsare expanding, still the disk I/O operations [2] are having performance issues[67]. To increase the data streaming rate, the file is divided into chunks and isdistributed across multiple nodes in a distributed environment. To overcome thedisk bottleneck problem, chunks are stored on distributed nodes. The novel datadistribution techniques are evolving that will trigger a requirement of a novicededuplication technique.

(vi) Throughput and latencyA file is broken into smaller chunk sizes. The metadataof each chunk are indexed for best possible performance and are kept in mem-ory. All new incoming chunks are checked against a large list of chunk indices.So, the number of disk I/O operations is large. Thus, fingerprinting indexing hasbecome a bottleneck to efficient deduplication systems, which has a negativeimpact on throughput [128] and increases the latency of write operations. Tomeet the requirements of increasing size of datasets and scalability of dedupli-cation, system should use parallel and multiple streams on distributed storagenodes. It will increase the system throughput. So, work can be done to optimizededuplication throughput and latency in distributed storage systems.

(vii) Fragmentation issue Data Deduplication causes fragmentation on disk thatreduces the performance of read operations [32,61]. It increases lookup timefor sequential reads from the same data and also extra disk I/O is required toaccess on-disk metadata. Deduplication results in data fragmentation and needsto be addressed carefully. Fragmentation is another major issue of retention for along period and can reduce the locality of reference. Better handling of inherentfragmentation in disk data deduplication must be there.

(viii) Scalability and performance of deduplication The main challenges in dedupli-cation techniques are its scalability and performance. Each chunk is comparedwith every other chunk in a large-scale storage system, and if a match is found,the same will be deleted as per replication policy. However, it becomes difficultfor complete matching of chunks as the system grows. The centralized index hasits own issues and bottleneck to achieve throughput, scalability and availability.As the storage requirements are growing rapidly, it poses a great challenge toapply efficient distributed deduplication techniques on large-scale distributedstorage systems.

(ix) Privacy and security In large distributed storage systems, both data and meta-data are distributed to achieve scalability and availability. In such a distributedsystem, a security framework must be required to employ distributed dedupli-cation techniques to guard it against theft, attacks and to adhere the regulatorycompliances for privacy and security.

123

2078 R. Kaur et al.

6 Conclusion

Data deduplication and cloud computing have appeared as hottest trend in today’sscenario. The growing demand in cloud computing for efficient storage of digitaldata in a large storage systems has led to rise in demand of data deduplication. Hugedata load on storage systems are emphasizing the focus on development of novicetechniques to remove duplicate data. With the evolution of data deduplication and itsvarious techniques, cloud computing has potential to remove unnecessary duplicates.Deduplication is an important technique for reducing storage cost, bandwidth andenergy consumptions.

This research article presented a methodical survey on data deduplication tech-niques. This survey comprehensively analyzes and reviews the techniques of dedupli-cation. We have also comprehensively studied a few recent surveys encompassingsimilar topics. Existing surveys in this field have focused on deduplication tech-niques based on storage types. The primary focus of this survey is to explore text-and multimedia-based deduplication techniques. Based on the analysis, deduplica-tion poses many challenges which are to be addressed in text- and multimedia-baseddeduplication.

For improving the storage in cloud storage systems, the following research chal-lenges need to be solved:

1. There is a need to develop an efficient inline data deduplication technique withoptimum use of resources in a cloud storage system.

2. To make data deduplication cost effective and energy efficient in terms of space,there is a need to develop an efficient deduplication technique with optimal useof CPU, memory and network resources.

3. To resolve the issue of fingerprint indexing in memory, the system should paral-lelize backup streams to multiple nodes for efficient deduplication.

4. Disk I/O bottleneck is one of the important issues in storage systems that affectthe performance. To resolve this issue, distributedmulti-node deduplication tech-nique needs to be evolved.

5. Security and privacy are also very necessary in cloud environment to meet thecompliance.

Based on the literature survey of deduplication techniques, it has been observed thatthe distributed deduplication technique on cloud storage system is a promising field ofresearch. The gaps analyzed above necessitate in devising such techniques to reducestorage space, bandwidth, number of disks used, energy consumption costs and heatemissions. In future, the work on a scalable, robust, efficient and distributed dedupli-cation technique for a cloud storage system will remain in focus.

Acknowledgements This research was supported by Department of Science and Technology, Govern-ment of India under WOS (Women Scientists Scheme) sponsored research Project entitled “DistributedData Deduplication Technique for efficient Cloud Based Storage System” under File No: SR/WOS-A/ET-119/2016.

123


References

1. Gu M, Li X, Cao Y (2014) Optical storage arrays: a perspective for future big data storage. Light SciAppl 3(5):e177. https://doi.org/10.1038/lsa.2014.58

2. Tian Y, Khan SM, Jiménez DA, Loh GH (2014) Last-level cache deduplication. In: Proceedingsof the 28th ACM International Conference on Supercomputing, pp 53–62. https://doi.org/10.1145/2597652.2597655

3. Hovhannisyan H, Qi W, Lu K, Yang R, Wang J (2016) Whispers in the cloud storage: a novel cross-user deduplication-based covert channel design. Peer-to-Peer Networking and Applications, pp 1–10.https://doi.org/10.1007/s12083-016-0483-y

4. Mandagere N, Zhou P, Smith MA, Uttamchandani S (2008) Demystifying data deduplication. In:Proceedings of the ACM/IFIP/USENIX Middleware’08 Conference Companion, pp 12–17. https://doi.org/10.1145/1462735.1462739

5. Paulo J, Pereira J (2014) A survey and classification of storage deduplication systems. ACM ComputSurv (CSUR) 47(1):1–30. https://doi.org/10.1145/2611778

6. Mao B, Jiang H, Wu S, Fu Y, Tian L (2014) Read-performance optimization for deduplication-basedstorage systems in the cloud. In: ACM Transactions on Storage (TOS), vol 10(2). https://doi.org/10.1145/2512348

7. Di Pietro R, Sorniotti A (2016) Proof of ownership for deduplication systems: a secure, scalable, andefficient solution. Comput. Commun. 82:71–82. https://doi.org/10.1016/j.comcom.2016.01.011

8. Wang J, Chen X (2016) Efficient and secure storage for outsourced data: a survey. Data Sci Eng1(3):178–188. https://doi.org/10.1007/s41019-016-0018-9

9. Chen CP, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: asurvey on big data. Inf Sci 275:314–347. https://doi.org/10.1016/j.ins.2014.01.015

10. Venish A, Sankar KS (2015) Framework of data deduplication: a survey. Indian J Sci Technol. https://doi.org/10.17485/ijst/2015/v8i26/80754

11. XiaW, JiangH, FengD,Douglis F, Shilane P, HuaY, FuM, ZhangY, ZhouY (2016)A comprehensivestudy of the past present and future of data deduplication. Proc IEEE 104(9):1681–1710. https://doi.org/10.1109/JPROC.2016.2571298

12. Maan AJ (2013) Analysis and comparison of algorithms for lossless data compression. Int J InfComput Technol 3(3):139–46

13. Xia W, Jiang H, Feng D, Tian L, Fu M, Zhou Y (2014) Ddelta: a deduplication-inspired fast deltacompression approach. Perform Eval 79:258–272. https://doi.org/10.1016/j.peva.2014.07.016

14. Shanmugasundaram S, Lourdusamy R (2011) A comparative study of text compression algorithms.Int J Wisdom Based Comput 1(3):68–76

15. Bhadade US, Trivedi AI (2011) Lossless text compression using dictionaries. Int J Comput ApplAlgorithms 13(8):27–34

16. Witten IH, Neal RM, Cleary JG (1987) Arithmetic coding for data compression. Commun ACM30(6):520–40. https://doi.org/10.1145/214762.214771

17. Brereton P, Kitchenham BA, Budgen D, Turner M, Khalil M (2007) Lessons from applying thesystematic literature review process within the software engineering domain. J Syst Softw 80(4):571–83. https://doi.org/10.1016/j.jss.2006.07.009

18. Kitchenham B, Brereton OP, Budgen D, Turner M, Bailey J, Linkman S (2009) Systematic literaturereviews in software engineering—a systematic literature review. Inf Softw Technology 51(1):7–15.https://doi.org/10.1016/j.infsof.2008.09.009

19. IDC REPROT ON EXPONENTIAL DATA Gantz J, Reinsel D (2012) The digital universe in 2020:big data, bigger digital shadows, and biggest growth in the far east. In: IDC iView: IDC Ana-lyze the Future,pp 1–6. http://www.emc.com/collateral/analyst-reports/idc-digital-universe-united-states.pdf

20. Reed DA, Dongarra J (2015) Exascale computing and big data. Commun ACM 58(7):56–68. https://doi.org/10.1145/2699414

21. Barreto J, Ferreira P (2009) Efficient locally trackable deduplication in replicated systems. In: Pro-ceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware. Springer-VerlagNew York, Inc. USA, p 6

22. Meyer DT, BoloskyWJ (2012) A study of practical deduplication. ACMTrans Storage (TOS). https://doi.org/10.1145/2078861.2078864

123

https://doi.org/10.1038/lsa.2014.58

https://doi.org/10.1145/2597652.2597655

https://doi.org/10.1145/2597652.2597655

https://doi.org/10.1007/s12083-016-0483-y

https://doi.org/10.1145/1462735.1462739

https://doi.org/10.1145/1462735.1462739

https://doi.org/10.1145/2611778

https://doi.org/10.1145/2512348

https://doi.org/10.1145/2512348

https://doi.org/10.1016/j.comcom.2016.01.011

https://doi.org/10.1007/s41019-016-0018-9

https://doi.org/10.1016/j.ins.2014.01.015

https://doi.org/10.17485/ijst/2015/v8i26/80754

https://doi.org/10.17485/ijst/2015/v8i26/80754

https://doi.org/10.1109/JPROC.2016.2571298

https://doi.org/10.1109/JPROC.2016.2571298

https://doi.org/10.1016/j.peva.2014.07.016

https://doi.org/10.1145/214762.214771

https://doi.org/10.1016/j.jss.2006.07.009

https://doi.org/10.1016/j.infsof.2008.09.009

http://www.emc.com/collateral/analyst-reports/idc-digital-universe-united-states.pdf

http://www.emc.com/collateral/analyst-reports/idc-digital-universe-united-states.pdf

https://doi.org/10.1145/2699414

https://doi.org/10.1145/2699414

https://doi.org/10.1145/2078861.2078864

https://doi.org/10.1145/2078861.2078864

2080 R. Kaur et al.

23. Borges EN, de Carvalho MG, Galante R, Gonçalves MA, Laender AH (2011) An unsupervisedheuristic-based approach for bibliographic metadata deduplication. Inf Process Manag 47(5):706–718. https://doi.org/10.1016/j.ipm.2011.01.009

24. Alvarez C (2011) NetApp deduplication for FAS andV-Series deployment and implementation guide.In: Technical ReportTR-3505

25. Xu J, ZhangW, Zhang Z, Wang T, Huang T (2016) Clustering-based acceleration for virtual machineimage deduplication in the cloud environment. J Syst Softw 121:144–156. https://doi.org/10.1016/j.jss.2016.02.021

26. Paulo J, Pereira J (2014) Distributed Exact Deduplication for Primary Storage Infrastructures. InMagoutis K., Pietzuch P. (eds) Distributed applications and interoperable systems DAIS 2014, vol8460, LNCS Springer, Heidelberg. https://doi.org/10.1007/978-3-662-43352-2_5

27. Banu AF, Chandrasekar C (2012) A survey on deduplication methods. Int J Comput Trends Technol3(3):364–368

28. He Q, Li Z, Zhang X (2010) Data deduplication techniques. IEEE Int Conf Future Inf Technol ManagEng (FITME) 1:430–433. https://doi.org/10.1109/FITME.2010.5656539

29. Zhou R, Liu M, Li T (2013) Characterizing the efficiency of data deduplication for big data storagemanagement. In: IEEE International Symposium onWorkload Characterization (IISWC), pp 98–108:https://doi.org/10.1109/IISWC.2013.6704674

30. Ahmad RW, Gani A, Ab. Hamid SH et al (2015) Virtual machine migration in cloud data centers:a review, taxonomy, and open research issue. J Supercomput 71(7):2473–2515. https://doi.org/10.1007/s11227-015-1400-5

31. Hu Y, Li C, Liu L, Li T (2016) Hope: enabling efficient service orchestration in software-defineddata centers. In: Proceedings of the 2016 International Conference on Supercomputing, p 10 ACM.https://doi.org/10.1145/2925426.2926257

32. Srinivasan K, Bisson T, Goodson GR, Voruganti K (2012) iDedup: latency-aware, inline data dedu-plication for primary storage. In: Proceedings of the USENIX Conference on File and StorageTechnologies, vol 12, pp 24–24

33. Mao B, Jiang H, Wu S, Tian L (2016) Leveraging data deduplication to improve the performanceof primary storage systems in the cloud. IEEE Trans Comput 65(6):1775–1788. https://doi.org/10.1109/TC.2015.2455979

34. Kim C, Park KW, Park KH (2012) GHOST: GPGPU-offloaded high performance storage I/Odeduplication for primary storage system. In: Proceedings of the 2012 International Workshop onProgramming Models and Applications for Multicores and Manycores ACM, pp 17–26. https://doi.org/10.1145/2141702.2141705

35. Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P (2009) Sparse Indexing:Large Scale, Inline Deduplication Using Sampling and Locality. In Proceedings of the 7th USENIXConference on File and Storage Technologies, vol 9, pp 111–123

36. Zhu B, Li K, Patterson RH (2008) Avoiding the disk bottleneck in the data domain deduplication filesystem. Proc USENIX Conf File Storage Technol 8:1–14

37. Dubnicki C, Gryz L, Heldt L, Kaczmarczyk M, Kilian W, Strzelczak P, Szczepkowski J, UngureanuC, Welnicki M (2009) HYDRAstor: A scalable secondary storage. In: 7th USENIX Conference onFile and Storage Technologies (FAST), vol 9, pp 197–210

38. Li YK, XuM, Ng CH, Lee PP (2015) Efficient hybrid inline and out-of-line deduplication for backupstorage. ACM Trans Storage (TOS) 11(1):1–21. https://doi.org/10.1145/2641572

39. Xia W, Jiang H, Feng D, Hua Y (2015) Similarity and locality based indexing for high performancedata deduplication. IEEETransComput 64(4):1162–1176. https://doi.org/10.1109/TC.2014.2308181

40. NgCH,MaM,WongTY, Lee PP, Lui J (2011) Live deduplication storage of virtualmachine images inan open-source cloud. In: Proceedings of the 12th InternationalMiddleware Conference. InternationalFederation for Information Processing, pp 80–99

41. Zhao X, Zhang Y,Wu Y, Chen K, Jiang J, Li K (2013) Liquid: a scalable deduplication file system forvirtual machine images. IEEE Trans Parallel Distrib Syst 25(5):1257–1266. https://doi.org/10.1109/TPDS.2013.173

42. WaldspurgerCA (2002)Memory resourcemanagement inVMwareESXserver. In:ACMProceedingsof the 5th Symposium on Operating Systems Design and Implementation SIGOPS, vol 36(SI), pp181–194. https://doi.org/10.1145/844128.844146

43. Clements AT, Ahmad I, Vilayannur M, Li J (2009) Decentralized Deduplication in SAN Cluster FileSystems. In: USENIX Annual Technical Conference, pp 101–114

123

https://doi.org/10.1016/j.ipm.2011.01.009



https://doi.org/10.1007/978-3-662-43352-2_5

https://doi.org/10.1109/FITME.2010.5656539

https://doi.org/10.1109/IISWC.2013.6704674

https://doi.org/10.1007/s11227-015-1400-5

https://doi.org/10.1007/s11227-015-1400-5

https://doi.org/10.1145/2925426.2926257

https://doi.org/10.1109/TC.2015.2455979

https://doi.org/10.1109/TC.2015.2455979

https://doi.org/10.1145/2141702.2141705

https://doi.org/10.1145/2141702.2141705

https://doi.org/10.1145/2641572

https://doi.org/10.1109/TC.2014.2308181

https://doi.org/10.1109/TPDS.2013.173


https://doi.org/10.1145/844128.844146


44. Anand A, Sekar V, Akella A (2009) SmartRE: an architecture for coordinated network-wide redun-dancy elimination. ACM SIGCOMM Comput Commun Rev 39(4):87–98. https://doi.org/10.1145/1594977.1592580

45. Agarwal B, Akella A, Anand A, Balachandran A, Chitnis P, Muthukrishnan C, Ramjee R, Varghese G(2010). EndRE:AnEnd-system redundancy elimination service for enterprises. In:NSDI, pp 419–432

46. Katiyar A,Weissman JB (2011) ViDeDup: an application-aware framework for video de-duplication.In: Proceedings of the 3rd USENIX Conference on Hot Topics in Storage and File Systems (HotStorage), pp 1–5

47. Li C, Shilane P, Douglis F, Shim H, Smaldone S, Wallace G (2014) Nitro: a Capacity-optimized SSDcache for primary storage. In: USENIX Annual Technical Conference, pp 501–512

48. Shen HT, Zhou X, Huang Z, Shao J, Zhou X (2007) UQLIPS: a real-time near-duplicate video clipdetection system. In: Proceedings of the 33rd International Conference on Very Large Data BasesVLDB Endowment, pp 1374–1377

49. ChenF, LuoT, ZhangX (2011)CAFTL:Acontent-aware flash translation layer enhancing the lifespanof flash memory based solid state drives. In: Proceedings of 9th USENIX Conference on File StorageTechnology (FAST), vol 11, pp 77–90

50. Vrable M, Savage S, Voelker GM (2009) Cumulus: filesystem backup to the cloud. ACM TransStorage (TOS) 5(4):1–14. https://doi.org/10.1145/1629080.1629084

51. Lai R, Hua Y, Feng D, XiaW, FuM, Yang Y (2014) A near-exact defragmentation scheme to improverestore performance for cloud backup systems. In: Sun X et al (eds) Algorithms and architectures forparallel processing. LNCS, vol 8630. Springer, Cham, pp 457–471. https://doi.org/10.1007/978-3-319-11197-1_35

52. Mao B, Jiang H, Wu S, Fu Y, Tian L (2014) Read-performance optimization for deduplication-basedstorage systems in the cloud. ACM Trans Storage. https://doi.org/10.1145/2512348

53. Tan Y, Jiang H, Feng D, Tian L, Yan Z (2011) CABdedupe: a causality-based deduplication perfor-mance booster for cloud backup services. In: Parallel andDistributed Processing Symposium (IPDPS)IEEE International, pp 1266–1277

54. Nbt Yusof, Ismail A, Majid NAA (2016) Deduplication image middleware detection comparison instandalone cloud database. Int J Adv Comput Sci Technol (IJACST) 5(3):12–18

55. Nie Z, Hua Y, Feng D, Li Q, Sun Y (2014) Efficient storage support for real-time near-duplicate videoretrieval. In: Sun X et al (eds) Algorithms and architectures for parallel processing ICA3PP LNCS,vol 8631. Springer, Cham. https://doi.org/10.1007/978-3-319-11194-0_24

56. ChenM,Wang S, Tian L (2013) A high-precision duplicate image deduplication approach. J Comput8(11):2768–2775. https://doi.org/10.4304/jcp.8.11.2768-2775

57. Wang G, Chen S, Lin M, Liu X (2014) SBBS: A sliding blocking algorithm with backtracking sub-blocks for duplicate data detection. Expert Syst Appl 41(5):2415–2423. https://doi.org/10.1016/j.eswa.2013.09.040

58. Bobbarjung DR, Jagannathan S, Dubnicki C (2006) Improving duplicate elimination in storage sys-tems. ACM Trans Storage (TOS) 2(4):424–48. https://doi.org/10.1145/1210596.1210599

59. Kruus E, Ungureanu C, Dubnicki C (2010) Bimodal content defined chunking for backup streams.In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST), pp 239–252

60. Lim SH (2011) DeFFS: Duplication-eliminated flash file system. Comput Electr Eng 37(6):1122–1136. https://doi.org/10.1016/j.compeleceng.2011.06.007

61. KaczmarczykM, BarczynskiM, KilianW,Dubnicki C (2012) Reducing impact of data fragmentationcaused by in-line deduplication. In: Proceedings of the 5th Annual International Systems and StorageConference ACM, pp 1–12. https://doi.org/10.1145/2367589.2367600

62. Wildani A, Miller EL, Rodeh O (2013) Hands: A heuristically arranged non-backup in-line dedu-plication system. In: IEEE 29th International Conference on Data Engineering (ICDE), pp 446–457.https://doi.org/10.1109/ICDE.2013.6544846

63. Nam YJ, Park D, Du DH (2012) Assuring demanded read performance of data deduplication storagewith backup datasets. In: IEEE 20th International Symposium onModeling, Analysis and Simulationof Computer and Telecommunication Systems (MASCOTS), pp 201–208. https://doi.org/10.1109/MASCOTS.2012.32

64. Park D, Fan Z, Nam YJ, Du DH (2017) A lookahead read cache: improving read performance fordeduplication backup storage. J Comput Sci Technol 32(1):26–40. https://doi.org/10.1007/s11390-017-1680-8

123

https://doi.org/10.1145/1594977.1592580

https://doi.org/10.1145/1594977.1592580

https://doi.org/10.1145/1629080.1629084

https://doi.org/10.1007/978-3-319-11197-1_35

https://doi.org/10.1007/978-3-319-11197-1_35

https://doi.org/10.1145/2512348

https://doi.org/10.1007/978-3-319-11194-0_24

https://doi.org/10.4304/jcp.8.11.2768-2775

https://doi.org/10.1016/j.eswa.2013.09.040

https://doi.org/10.1016/j.eswa.2013.09.040

https://doi.org/10.1145/1210596.1210599

https://doi.org/10.1016/j.compeleceng.2011.06.007

https://doi.org/10.1145/2367589.2367600

https://doi.org/10.1109/ICDE.2013.6544846

https://doi.org/10.1109/MASCOTS.2012.32

https://doi.org/10.1109/MASCOTS.2012.32

https://doi.org/10.1007/s11390-017-1680-8

https://doi.org/10.1007/s11390-017-1680-8

2082 R. Kaur et al.

65. Xia W, Jiang H, Feng D, Tian L (2016) DARE: a deduplication-aware resemblance detection andelimination scheme for data reduction with low overheads. IEEE Trans Comput 65(6):1692–1705.https://doi.org/10.1109/TC.2015.2456015

66. Fu M, Feng D, Hua Y, He X, Chen Z, Liu J, Xia W, Huang F, Liu Q (2016) Reducing fragmentationfor in-line deduplication backup storage via exploiting backup history and cache knowledge. IEEETrans Parallel Distrib Syst 27(3):855–868. https://doi.org/10.1109/TPDS.2015.2410781

67. FuY, JiangH,XiaoN (2012)A scalable inline cluster deduplication framework for big data protection.In: Narasimhan P, Triantafillou P (eds) Middleware IFIP international federation for informationprocessing. LNCS, vol 7662. Springer, Berlin, pp 354–373

68. Rabin MO (1981) Fingerprinting by random polynomials. Harvard Aiken Computational LaboratoryTR-15-81. URL: http://cr.yp.to/bib/entries.html

69. Bhagwat D, Eshghi K, Long DD, Lillibridge M (2009) Extreme binning: scalable, parallel dedupli-cation for chunk-based file backup. In: Proceedings of IEEE International Symposium on Modeling,Analysis and Simulation of Computer and Telecommunication Systems, Computer Society, Wash-ington, DC, vol 9, pp 1–9. https://doi.org/10.1109/MASCOT.2009.5366623

70. Yang TM, Feng D, Niu ZY, Wan YP (2010) Scalable high performance de-duplication backup viahash join. J Zhejiang Uni Sci C Springer 11(5):315–327. https://doi.org/10.1631/jzus.C0910445

71. Min J, Yoon D,Won Y (2011) Efficient deduplication techniques for modern backup operation. IEEETrans Comput 60(6):824–840. https://doi.org/10.1109/TC.2010.263

72. Guo F, Efstathopoulos P (2011) Building a high-performance deduplication system. In: Proceedingsof USENIX Annual Technical Conference

73. Barreto J, Veiga L, Ferreira P (2012) Hash challenges: stretching the limits of compare-by-hash indistributed data deduplication. Inf Process Lett 112(10):380–385. https://doi.org/10.1016/j.ipl.2012.01.012

74. Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication.IEEE Trans Knowl Data Eng 24(9):1537–1555. https://doi.org/10.1109/TKDE.2011.127

75. FuY, JiangH,XiaoN,TianL, Liu F,XuL (2014)Application-aware local-global source deduplicationfor cloud backup services of personal storage. IEEE Trans Parall Distrib Syst 25(5):1155–1165.https://doi.org/10.1109/TPDS.2013.167

76. Harnik D, Pinkas B, Shulman-Peleg A (2010) Side channels in cloud services: deduplication in cloudstorage. IEEE Secur Priv 8(6):40–47. https://doi.org/10.1109/MSP.2010.187

77. Li J, Chen X, Li M, Li J, Lee PP, Lou W (2014) Secure deduplication with efficient and reliableconvergent key management. IEEE Trans Parallel Distrib Syst 25(6):1615–1625. https://doi.org/10.1109/TPDS.2013.284

78. Liu C, LiuX,WanL (2013) Policy-based de-duplication in secure cloud storage. In: YuanY,WuX, LuY (eds) Trustworthy Computing and Services. ISCTCS communications in computer and informationscience, vol 320. Springer, Berlin, pp 250–262. https://doi.org/10.1007/978-3-642-35795-4_32

79. Storer MW, Greenan K, Long DD, Miller EL (2008) Secure data deduplication. In: Proceedings ofthe 4th ACM International Workshop on Storage Security and Survivability, pp 1–10. https://doi.org/10.1145/1456469.14

80. Li J, Chen X, Huang X, Tang S, Xiang Y, Hassan MM, Alelaiwi A (2015) Secure distributed dedu-plication systems with improved reliability. IEEE Trans Comput 64(12):3569–3579. https://doi.org/10.1109/TC.2015.2401017

81. Vishalakshi NS, Sridevi S (2017) Survey on secure de-duplication with encrypted data for cloudstorage. Int J Adv Res Sci Eng Technol 4(1):3111–3117

82. BibaweCB,BaviscarV (2017) Secure authorized deduplication for data reductionwith low overheadsin hybrid cloud. Int J Innov Res Comput Commun Eng 5(2):1797–1804. https://doi.org/10.15680/IJIRCCE.2017.0502105

83. Wu S, Li KC,Mao B, LiaoM (2016) DAC: improving storage availability with deduplication-assistedcloud-of-clouds. Future Gener Comput Syst 74:190–198. https://doi.org/10.1016/j.future.2016.02.001

84. Wang J, Zhao Z, Xu Z, ZhangH, Li L, GuoY (2015) I-sieve: an inline high performance deduplicationsystem used in cloud storage. Tsinghua Sci Technol 20(1):17–27. https://doi.org/10.1109/TST.2015.7040510

85. Leesakul W, Townend P, Xu J (2014) Dynamic data deduplication in cloud storage. In: IEEE 8thInternational Symposium on Service Oriented System. Engineering, pp 320–325: https://doi.org/10.1109/SOSE.2014.46

123

https://doi.org/10.1109/TC.2015.2456015


http://cr.yp.to/bib/entries.html

https://doi.org/10.1109/MASCOT.2009.5366623

https://doi.org/10.1631/jzus.C0910445

https://doi.org/10.1109/TC.2010.263

https://doi.org/10.1016/j.ipl.2012.01.012

https://doi.org/10.1016/j.ipl.2012.01.012

https://doi.org/10.1109/TKDE.2011.127


https://doi.org/10.1109/MSP.2010.187



https://doi.org/10.1007/978-3-642-35795-4_32

https://doi.org/10.1145/1456469.14

https://doi.org/10.1145/1456469.14

https://doi.org/10.1109/TC.2015.2401017

https://doi.org/10.1109/TC.2015.2401017

https://doi.org/10.15680/IJIRCCE.2017.0502105

https://doi.org/10.15680/IJIRCCE.2017.0502105

https://doi.org/10.1016/j.future.2016.02.001

https://doi.org/10.1016/j.future.2016.02.001

https://doi.org/10.1109/TST.2015.7040510

https://doi.org/10.1109/TST.2015.7040510

https://doi.org/10.1109/SOSE.2014.46

https://doi.org/10.1109/SOSE.2014.46


86. Sun Z, Shen J, Yong J (2013) A novel approach to data deduplication over the engineering-orientedcloud systems. Integr Comput Aided Eng 20(1):45–57. https://doi.org/10.3233/ICA-120418

87. Neelaveni P, Vijayalakshmi M (2016) FC-LID: file classifier based linear indexing for deduplicationin cloud backup services. In: Bjørner N, Prasad S, Parida L (eds) Distributed computing and inter-net technology. LNCS, vol 9581. Springer, Cham, pp 213–222. https://doi.org/10.1007/978-3-319-28034-9_28

88. Li J, Chen X, Xhafa F, Barolli L (2015) Secure deduplication storage systems supporting keywordsearch. J Comput Syst Sci 81(8):1532–1541. https://doi.org/10.1016/j.jcss.2014.12.026

89. Shin Y, KooD, Hur J (2017) A survey of secure data deduplication schemes for cloud storage systems.ACM Comput Surv (CSUR) 49(4):1–38. https://doi.org/10.1145/3017428

90. Pokale MS, Dhok S, Kasbe V, Joshi G, Shinde N (2017) Data deduplication and load balancingtechniques on cloud systems. Int J Adv Res Comput Commun Eng 6(3):878–883. https://doi.org/10.17148/IJARCCE.2017.63205

91. Debnath BK, Sengupta S, Li J (2010) ChunkStash: speeding up inline storage deduplication usingflash memory. In: Proceedings of USENIX Annual Technical Conference (ATC), pp 1–16

92. DongW, Douglis F, Li K, Patterson RH, Reddy S, Shilane P (2011) Tradeoffs in scalable data routingfor deduplication clusters. In: Proceedings of USENIX Conference on File and Storage Technologies(FAST), vol 11, pp 15–29

93. Li J, Qian X, Li Q, Zhao Y, Wang L, Tang YY (2015) Mining near duplicate image groups. MultimedTools Appl 74(2):655–669

94. Ramaiah NP, Mohan CK (2011) De-duplication of photograph images using histogram refinement.In Recent Advances in Intelligent Computational Systems (RAICS) IEEE 391-395. https://doi.org/10.1109/RAICS.2011.6069341

95. Zargar AJ, Singh N, Rathee G, Singh AK (2015) Image data-deduplication using the block truncationcoding technique. In: Futuristic Trends on Computational Analysis and Knowledge Management(ABLAZE) International Conference on IEEE, pp 154–158. https://doi.org/10.1109/ABLAZE.2015.7154986

96. Hua Y, He W, Liu X, Feng D (2015) SmartEye: real-time and efficient cloud image sharing fordisaster environments. In: IEEE Conference on Computer Communications (INFOCOM), pp 1616–1624: https://doi.org/10.1109/INFOCOM.2015.7218541

97. Li X, Li J, Huang F (2016) A secure cloud storage system supporting privacy-preserving fuzzydeduplication. Soft Comput 20(4):1437–1448. https://doi.org/10.1007/s00500-015-1596-6

98. Deshmukh AS, Lambhate PD (2016) A methodological survey on mapreduce for identification ofduplicate images. Int J Sci Res (IJSR) 5(1):206–210

99. Rashid F, Miri A, Woungang I (2016) Secure image deduplication through image compression. J InfSecur Appl 27:54–64. https://doi.org/10.1016/j.jisa.2015.11.003

100. Zheng Y, Yuan X, Wang X, Jiang J, Wang C, Gui X (2015) Enabling encrypted cloud media centerwith secure deduplication. In: Proceedings of the 10th ACM Symposium on Information, Computerand Communications Security, pp 63–72. https://doi.org/10.1145/2714576.271462

101. Yang X, Zhu Q, Cheng KT (2009) Near-duplicate detection for images and videos. In: Proceedingsof the First ACM workshop on Large-Scale Multimedia Retrieval and Mining, pp 73–80: https://doi.org/10.1145/1631058.1631073

102. Naturel X, Gros P (2005) A fast shot matching strategy for detecting duplicate sequences in a tele-vision stream. In: ACM Proceedings of the 2nd International Workshop on Computer Vision MeetsDatabases, pp 21–27. https://doi.org/10.1145/1160939.1160947

103. Li X, Lin J, Li J, Jin B (2016) A Video Deduplication Scheme with Privacy Preservation in IoT. In:International Symposium on Computational Intelligence and Intelligent Systems. Communicationsin Computer and Information Science, vol 575. Springer, Singapore, pp 409–417. https://doi.org/10.1007/978-981-10-0356-1_43

104. Velmurugan K, Baboo LD (2011) Content-based image retrieval using SURF and colour moments.Global J Comput Sci Technol 11(10)

105. Li L (2014) Image matching algorithm based on feature-point and DAISY descriptor. J Multim9(6):829–834. https://doi.org/10.4304/jmm.9.6.829-834

106. Lei Y, QiuG, Zheng L, Huang J (2014) Fast near-duplicate image detection using uniform randomizedtrees. ACM Trans Multim Comput Commun Appl (TOMM) 10(4):1–15. https://doi.org/10.1145/2602186

123

https://doi.org/10.3233/ICA-120418

https://doi.org/10.1007/978-3-319-28034-9_28

https://doi.org/10.1007/978-3-319-28034-9_28

https://doi.org/10.1016/j.jcss.2014.12.026

https://doi.org/10.1145/3017428

https://doi.org/10.17148/IJARCCE.2017.63205

https://doi.org/10.17148/IJARCCE.2017.63205

https://doi.org/10.1109/RAICS.2011.6069341

https://doi.org/10.1109/RAICS.2011.6069341

https://doi.org/10.1109/ABLAZE.2015.7154986

https://doi.org/10.1109/ABLAZE.2015.7154986

https://doi.org/10.1109/INFOCOM.2015.7218541

https://doi.org/10.1007/s00500-015-1596-6

https://doi.org/10.1016/j.jisa.2015.11.003

https://doi.org/10.1145/2714576.271462

https://doi.org/10.1145/1631058.1631073

https://doi.org/10.1145/1631058.1631073

https://doi.org/10.1145/1160939.1160947

https://doi.org/10.1007/978-981-10-0356-1_43

https://doi.org/10.1007/978-981-10-0356-1_43

https://doi.org/10.4304/jmm.9.6.829-834

https://doi.org/10.1145/2602186

https://doi.org/10.1145/2602186

Documents

Data deduplication techniques for efficient cloud storage ...static.tongtianta.site/paper_pdf/a051c3e0-1a19-11e9-81a1-00163e08bb86.pdfData deduplication techniques for efﬁcient cloud