Upload
others
View
21
Download
0
Embed Size (px)
Citation preview
J Supercomput (2018) 74:2035–2085https://doi.org/10.1007/s11227-017-2210-8
Data deduplication techniques for efficient cloudstorage management: a systematic review
Ravneet Kaur1 · Inderveer Chana1 ·Jhilik Bhattacharya1
Published online: 20 December 2017© Springer Science+Business Media, LLC, part of Springer Nature 2017
Abstract The exponential growth of digital data in cloud storage systems is a criticalissue presently as a large amount of duplicate data in the storage systems exerts an extraloadon it.Deduplication is an efficient technique that has gained attention in large-scalestorage systems. Deduplication eliminates redundant data, improves storage utiliza-tion and reduces storage cost. This paper presents a broad methodical literature reviewof existing data deduplication techniques along with various existing taxonomies ofdeduplication techniques that have been based on cloud data storage. Furthermore, thepaper investigates deduplication techniques based on text and multimedia data alongwith their corresponding taxonomies as these techniques have different challengesfor duplicate data detection. This research work is useful to identify deduplicationtechniques based on text, image and video data. It also discusses existing challengesand significant research directions in deduplication for future researchers, and arti-cle concludes with a summary of valuable suggestions for future enhancements indeduplication.
Keywords Data deduplication ·Data reduction · Storage systems ·Cloud computing ·Big data
B Ravneet [email protected]
Inderveer [email protected]
Jhilik [email protected]
1 Computer Science and Engineering Department, Thapar Institute of Engineering and Technology(Deemed to be University), Bhadson Road, Patiala 147004, India
123
2036 R. Kaur et al.
1 Introduction
With the advent of cloud computing in recent years, the data volumes in cloud areincreasing significantly due to continued growth of internet, adoption of smartphonesand social networking platforms. In 2011, International Data Corporation (IDC)reported that data volume created and copied in the world will be 35ZB by 2020[1]. The enterprises are facing problems in storing and processing a large amount ofdata volumes. In order to enhance the reliability and availability and provide disasterrecovery, data are generally duplicated on multiple storage locations. Most of theseduplicated data exert an extra load on the storage system in terms of additional spaceand bandwidth to transfer the duplicated data on the network. An efficient data storagemanagement is critical, and data deduplication technique is considered as an enablingtechnology for efficient storage [2] of big data. Deduplication technique is a specialdata compression technique [2] to eliminate redundant data and reduce network trans-mission rate and storage space in the cloud storage systems [3,4]. The techniques findout the duplicate data, save only one copy of the data [2] and strategically use logi-cal pointers for duplicated data [3,5]. Deduplication addresses the growing demandfor storage capacity [6]. Many cloud storage providers like Amazon S3, Bitcasa andMicrosoft Azure [7] and backup services such as Dropbox and Memopal are employ-ing data deduplication techniques [8] to improve storage efficiency.
The deduplication techniques are data type specific, and different techniques areemployed on different types of data such as text, image and video data. All three typesof data have different storage formats and implicit characteristics. Based on type ofdata, deduplication techniques have different processes to find and remove duplicateinformation. So, type of data is important for the development of deduplication tech-niques. The format of information is critical for reading, finding and matching theinformation. Bit- level matching is required to find duplication in executable files. Thetechniques to check duplicates in text, image and video have different processes dueto varied format of data.
Aminimumnumber of data replica called replication factor aremaintained in a largedistributed storage system to achieve high data availability. Any duplicate data abovereplication factor is removed to reduce storage requirement, storage cost, computationand energy. Due to these significant benefits to industry, deduplication techniques fora large distributed storage systems gained momentum in academia and industry. Still,these techniques are facing challenges due to efficiency and efficacy of data matchingtechniques. The researchers in academia and industry are working to develop efficientdistributed deduplication techniques.
Figure 1 represents data deduplication inwhich duplicate segments of same data arereduced to unique segments of data. The whole file is divided into fixed- or variable-size segments. With deduplication process, only single copy of each segment is stored[10] and pointers are used for duplicate segments. If deduplication engine come acrossa piece of data that is already stored somewhere in the storage system, it saves pointerin the data copy place that leads back to original copy. It helps in freeing up theblocks in the storage system, thus freeing the memory space. Figure 1 presents thededuplication process.
123
Data deduplication techniques for efficient cloud... 2037
Fig. 1 Data deduplication process
1.1 Motivation
After making an assessments of the current research in deduplication techniques, aneed was felt to comprehensively review the literature available on subject. However,this section summarizes motivations, contribution and the novelty of this article.
(i) The role of deduplication technique to improve the performance of a large storagesystem has been discussed. The necessity of a deduplication technique, its meritsand demerits has also been studied.
(ii) The existing deduplication techniques have been categorized as storage based,point of application based and level based. Further, deduplication technique hasbeen classified based on text, image and video. The parameters of text deduplica-tion technique and their importance have been explained. A comparison of text-,image- and video-based deduplication and their classification based on varioustaxonomies are presented. Therefore, this paper reviews literature and presentsa comprehensive view on deduplication techniques.
(iii) Future research directions in the field of deduplication have been highlighted forresearchers of academia and industry.
This paper refers quality journals and proceedings of various conferences and reportsof many research centers. This article has been organized into six sections. Section 2describes background, evolution of deduplication, redundant data reduction techniquesand its comparison,merits anddemerits of deduplication techniques. Section 3presentsa review method, research questions and a research methodology used to select andreview the previous researchmaterial. It also presents a framework for analysis and dis-cussion of research material. Section 4 focuses on presenting a generic deduplicationprocess, taxonomyof deduplication techniques. Further, the techniques are categorizedbased on text, image and video. Section 5 carries a discussion on open challenges andfuture research directions in the area of deduplication techniques. Section 6 concludesthe review and provides recommendations for future research.
123
2038 R. Kaur et al.
2 Background
With impetuous growth of data, the term big data came into existence and used mainlyto describe massive datasets [9]. It typically includes unstructured data that requiremore real-time analysis as compared to traditional datasets. The frequency with whichthe data are generated is very crucial and a major challenge to handle [10]. Thissection first provides necessary background related to traditional data compressionapproaches and data deduplication approaches. Detailed studies of data deduplicationhave been conducted by Paulo and Pereira [5] and Xia et al. [11]. Both these surveyshave extensively explored data deduplication techniques in the storage systems. Thissurvey summarizes existing data deduplication techniques on text-, image- and video-based deduplication techniques and is an enhancement of the existing survey.
2.1 Evolution of deduplication
In early 1950s, data reduction techniques [11,12] were introduced, which can becategorized into lossless and lossy [12] data reduction techniques. Later in 1990s,space- efficient approach, i.e., intelligent delta compression techniques, was proposedto target the compression of very similar files or similar chunks. Data deduplicationterm got coined in early 2000s to help the large storage systems at high granular-ity level [1]. It eliminates both inter-file- and intra-file- level redundancy over largedatasets across multiple distributed storage servers unlike traditional data compres-sion techniques which remove redundancy over small group of files based on intra-fileredundancy. Cryptographic hashes of each file or chunk are calculated to identify theduplicates. These techniques have also been applied on multimedia contents in 2008,and it searches the duplicate multimedia content bymeasuring the similarity of imagesor frames in videos using feature extraction and hashing techniques. These data dedu-plication techniques came into existence to solve the issue related to increasing datasize in the storage systems [13]. Huffman coding and dictionary coding are the tra-ditional data compression techniques that work at byte or string level [11–13], whilededuplication techniques eliminate redundancy at file or chunk level.
Redundant data reduction techniques came into existence in 1950s [11], whichwere largely lossless and lossy data compression techniques, followed by delta com-pression in 1990s. In 2000, data deduplication techniques came later followed bymultimedia deduplication. Figure 2 shows the evolution of deduplication techniques.These techniques are discussed in the following sections.
Data Reduction Techniques (1950)
Delta Encoding Techniques (1990)
Data Deduplication Techniques (2000)
Deduplication Techniques on Multimedia data
(2011)
Fig. 2 Evolution of deduplication
123
Data deduplication techniques for efficient cloud... 2039
2.2 Redundant data reduction techniques
These redundant data reduction techniques have been developed to cope with increas-ing amount of digital data and to identify redundancy from byte to string level andfrom chunk to file level. The organization of redundant data reduction techniques andtheir evolution are further shown in Fig. 3.
Data compression is a bit-rate reduction approach, which represents information incompact form. It reduces the required storage space and tries to find redundant data.Data compression is broadly classified into lossless and lossy compression techniques[11,12]. In lossless compression technique, the exact original data are reconstructedfrom the compressed data. Lossy compression reduced the data by identifying unnec-essary information like jpeg image compression. It reconstructs an approximationof the original data. Videos and audios data are compressed using lossy compressiontechniques [12]. In this section, necessary background related to redundant data reduc-tion techniques is presented, which shows the evolution of both traditional losslessdata compression approaches, delta compression techniques and data deduplicationapproaches. A taxonomy of all the approaches and their evolution are presented.
2.2.1 Lossless data compression techniques
Data compression was coined by Claude E. Shanon. The techniques like entropyencoding, run-length encoding and dictionary encoding are lossless data compressiontechniques [14]. The string of characters is represented by few bit sequences. A largeamount of redundant data are found in such strings and are removed in such datapatterns.
(a) Byte levelThe early data compression techniques use entropy encoding to identifyredundancy at byte level. Huffman coding and arithmetic coding are the types ofentropy encoding used to represent frequently occurring pattern with fewer bits.Huffman coding developed by David A. Huffman uses frequency-sorted binarytree to generate the optimal prefix code [12]. It replaces fixed-length codes withvariable-length codes [12,15]. Frequently used symbols are represented by shortencodings [12] than less frequently used symbols. Arithmetic coding developedby Elias in 1960 [16] encodes entire message into fixed floating point number. A
Fig. 3 Organization of redundant data reduction techniques
123
2040 R. Kaur et al.
string of characters is represented using fixed number of bits per character [12,15,16]. Frequently used characters are storedwith fewer bits, and less frequently usedcharacters are stored with more bits. The entropy encoding-based approacheshave limited scalability and are not efficient for large storage systems.
(b) String level String-level approach was proposed to search and eliminate repeatedstrings. The two main approaches at byte level are LZ77/LZ78 and LZW/LZO[11]. LZ77/LZ78 [13,14] proposedbyLempel andZiv in 1970s [11] is dictionary-level approach to support sliding window that detects and eliminates repeated setof strings. LZW/LZO, proposed by Terry Welch in 1980s, is the variant of LZcompression to speed up [11] or improve the compression process. The organi-zation of redundant data reduction techniques is presented in Fig. 3.
2.2.2 Delta compression techniques
In 1990s, delta compression technique came into existence to target the compression ofsimilar files or chunks [11].Most widely used applications are remote synchronizationand backup storage system. It uses byte-wise sliding window to find matched stringsbetween similar chunks. The differences between sequential file and complete file arestored in the form of “delts” or “diffs” [11,13]. The string level is one of the deltacompression techniques.String level Xdelta and Zdelta are delta compression techniques that use byte-wisesliding window [13] to identify duplicate strings between source chunk and targetchunk for delta calculation. But it is a very time-consuming approach and is not ascalable.
2.2.3 Data deduplication techniques
Data deduplication technique was firstly proposed in 2000 to support global compres-sion at coarse granularity [11]. The previous approaches are very time-consuming toidentify similar chunks and are not scalable,whereas data deduplication techniques canbe applied at file level or sub-file level. It compresses data by using fixed- or variable-size chunks. The hash values of these chunks are generated using cryptographic hashfunctions, and duplicates are detected by matching hash values.
(a) File-level deduplication These techniques are applied at file level, and file is con-sidered as a single unit. It checks the backup file index to compare the attributesstored in the file [3]. If the same file exists, it adds a pointer to the existing file,otherwise it updates and stores the index value. So, only one instance of the fileis saved, and it is also called as single-instance storage. Here, whole-file hash-ing approach is simple to apply. Since, file hash numbers are easy to generate,and it requires quite less processing power. A few or one byte change in filetriggers a generation of a different or new hash value that requires different stor-age. This issue of file-level deduplication leads to the introduction of block-leveldeduplication techniques.
(b) Sub-file-level (block-level) deduplication In these techniques, a file is broken intomultiple smaller blocks, which are of fixed- or variable-size blocks [9]. MD5,
123
Data deduplication techniques for efficient cloud... 2041
SHA-I, Rabin fingerprinting and similar hash algorithms are used to identifysimilar blocks. So, a unique block is written to the disk and its index is updated.Otherwise, a pointer is added to the same data block’s original location. It requiresmore processing power because the number of identifiers increases significantlythat need to be processed. Block-level deduplication is further categorized asfixed- length or variable-length deduplication.• Fixed-length block deduplication Fixed-length approaches examine blocks ofdata with a predetermined length. It divides the files into fixed-size blocks[9]. The system does not take backup of same block of data twice. The mainadvantage of this approach is its simplicity. A single character insertion indata requires a shifting of data by one byte. The following all data blocks arebacked up again. This shortcoming of fixed-length block approaches leads tothe innovation of variable-length techniques.
• Variable-length block deduplication Variable-length deduplication techniquedivides the file into variable-length data blocks. Variable-block algorithms usedifferent methods for determining the block length. This helps the boundariesof data blocks to “float” within data stream so that changes in one part of blockhave no impact on the boundaries in other locations of blocks [9]. The file ispartitioned into content-dependent manner, and the segment is of any bytes inlength within a range. It provides greater granularity control and flexibility toinsert data in block.
2.3 Comparison of lossless compression, delta compression and datadeduplication
Table 1 shows the comparisonof redundant data reduction techniques basedonparame-ters like target data, granularity level, approaches, scalability, evolution and processingtime and is an enhancement of existing work [13]. Table 1 is adapted and enhancedfrom tabular data in [13].
Table 1 Comparison of redundant data reduction technologies
Lossless datacompression
Delta compression Data deduplication
Target data All data Similar data Duplicate data
Granularity Byte or string level String level Chunk or file level
Approaches Huffman coding,dictionary coding
KMP-based copy andinsert
Content-definedchunking, Hashing orfingerprint
Scalability Weak Weak Strong
Evolution (year) 1950s 1990s 2000
Processing time High processing timeHuffman/dictionarycoding
Optimized by usingRabin–Karp stringmatching
Less processing time asfile- or sub-file-leveldeduplication areapplied
123
2042 R. Kaur et al.
2.4 Merits and demerits of deduplication
The deduplication has significant advantage in storage system. These techniquesrequire resources to employ and draw benefits. The paper highlighted the importantmerits and demerits of deduplication techniques.
2.4.1 Merits of deduplication
The following merits of deduplication are identified and presented below:
(i) Reduce storage space Deduplication assists in reducing storage space requiredfor backups, file or other data applications. As only a unique copy of data isstored and duplicate copies are removed. So, it creates more free space to storemore data.
(ii) Improves network bandwidthAs the unique copies are stored on disk and logicalpointers are created for duplicate data, there is no need to transmit duplicatecopies over the network. The deduplication helps in reducing network bandwidthrequirements.
(iii) Reduce energy consumptions Deduplication is a storage optimization techniquethat reduces storage and energy requirements. The reduced storage space requiresless electricity and coolants. Thus, it saves energy requirements and reduces loadon system resources.
(iv) Reduce overall storage cost Deduplication helps in significant savings in termsof time, space, network bandwidth, human resources and budget. It leads to betterefficiency and efficacy of storage system.
2.4.2 Demerits of deduplication
(i) Impact on storage performance In primary storage system, fixed-size approachleads to multiple chunks stored at different memory locations. It leads to frag-mentation issue, which adversely impacts the performance. The deduplicationtechnique requires additional resources like CPU, memory and bandwidth for itsexecution. Any inefficient deduplication technique impacts the performance ofa large storage system.
(ii) Loss of data integrity The data blocks are indexed through hash values for betterlookup. The identical hashes can be generated for different data blocks due tohash collision that can cause loss of data integrity. So, hash collisions must becarefully addressed to avoid any loss of data and its integrity.
(iii) Backup appliance issues Data deduplication may require a separate hardwaredevice to transfer and process data. Such backup appliancemay lead to additionalcost and impact storage performance.
(iv) Privacy and security The deduplication techniques have full access to completestorage. It can be exploited to get complete access of storage. The security ofdeduplication techniques should be carefully designed to guard system from suchsecurity breaches and loss of private data.
123
Data deduplication techniques for efficient cloud... 2043
(v) Reduce availability The data are duplicated to improve high availability in alarge distributed storage system. Any reduction to such duplicate copies affectsthe availability of storage system. A minimum number of copies need to beretained to maintain high availability of a storage system.
3 Review method
The systematic reviewof deduplication technique in large-scale storage system is basedon the proposed guidelines of Kitchenman et al. [17,18]. It presents the purpose of thisresearch and the steps involved for conducting the review. The main steps involvedfor conducting the review include creating a review framework and elaborating themethodology in depth, discussing the findings and further exploring new challenges.
3.1 Planning of review
The procedure involves the list of research questions on deduplication techniquesin Sect. 3.2, the search of different databases, and the identification and analysis ofthe existing techniques. The primary studies are searched exhaustively either throughelectronic databases queries or through manual searches on the refereed journals andconference proceedings. It was refined by certain exclusion criteria, identification ofprimary studies, to extract the data and to synthesize the data.
3.2 Research questions
This systematic review focuses on identification and classification of existing literatureon deduplication techniques. The review questions are identified after the discussionwith co-authors and are listed in Table 2. Themain objective of this review is to presentthe consolidated latest researchwork in data deduplication concepts by recommendinganswers to the review questions.
3.3 Study selection procedure
The refinement of search process is done with the help of search keywords. Theresearch literature found is selected based on detailed survey. The survey focuseson identification and classification of the existing literature on data deduplication,multimedia deduplication, deduplication techniques and its various challenges. Theresearch methodology adopted in this study is based on finding out relevant researchpapers from different databases and then listing out the questions that are desired tobe addressed.
3.4 Sources of information
The scholarly online electronic databases are selected to find relevant informationfrom wide publications. The following electronic databases have been searched forresearch articles.
123
2044 R. Kaur et al.
Table2
Reviewqu
estio
nsandmotivation
Sr.n
o.Reviewqu
estio
nsMotivation
1Whatisdatadeduplication?
Discussits
evolutionandits
merits
anddemerits
a)Discuss
thededuplicationprocess?
b)How
canwecategorize
deduplicationtechniques?
Itwill
exploredatadedu
plication,
itsevolution,
advantages,d
isadvantages
andvariouscategories.Itw
illfurtherhelp
theresearchersto
gain
adeep
understand
ingof
thecompletededu
plicationprocess.The
mainaim
ofthisreview
isto
clearlyunderstand
theconcept,currentstatus,issues
and
future
requ
irem
entsof
dedu
plicationthatim
provestorageefficiency
2Whatare
thededu
plicationtechniqu
esappliedon
text
data?Discuss
the
catego
rizatio
nof
text
dedu
plicationtechniqu
esandtheirkeyfin
ding
s?Itwill
discusstext-based
deduplicationtechniques.T
heaim
isto
discuss
text-based
techniques
andtheirkeyfin
dings.Text-based
deduplication
techniques
arefurthercategorizedbasedon
parameters
3Whatare
thededuplicationtechniques
appliedon
multim
edia?Discuss
the
catego
rizatio
nof
imageandvideodedu
plicationtechniqu
esandtheirkey
finding
s?
Itwill
exploremultim
edia-based
deduplicationtechniques
which
are
furthercategorizedas
imagebasedandvideobased.The
aim
isto
discuss
themultim
edia-based
techniques
asthesehave
poseddifferentchallenges
than
text-based
deduplicationtechniques.T
heaim
isto
gain
deep
know
ledg
eof
multim
ediadedu
plicationtechniqu
es
4Discuss
thevariousresearch
opportunities
inthefield
ofdeduplication
Itaimsto
provideinform
ationabouto
penresearch
issues
forprospective
researchersworking
inthefield
ofdedu
plication.
The
open
research
areasandits
sub-area
ofdeduplicationarediscussed
123
Data deduplication techniques for efficient cloud... 2045
Table 3 Search strings (2005–2017)
Sr. no. Keywords Synonyms Content type and date(2005–2017)
1. Deduplication Data deduplication Journal, Conference,Workshop, Magazine, andTransaction
2. Deduplication storage Data deduplication in storagesystem
3. Data deduplication Data deduplication in cloud
4. Deduplication architecture Architecture of deduplication
5. Deduplication techniques Techniques of deduplication
5. Deduplication tools Simulation tools in deduplicationin cloud
6. Deduplication evolution Review of existing research indeduplication
7. Deduplication analysis Analysis of research gaps indeduplication
8. Deduplication comparison Comparison of existing research
9. Need for deduplication Practical benefits ofdeduplication on storagesystems
10. Image-based deduplication Virtual images deduplication oncloud
11. Images deduplication Image and video deduplication
12. Deduplication similar images Exact or near-exact deduplication
13. Deduplication duplicateimages
Image fingerprinting
• Springer• IEEE explore• Science Direct• ACM Digital Library• Elsevier• Google Scholar
3.4.1 Other sources
Apart from above sources, books, technical reports and online literature relevant to thisreview are included. The central objective is to widen the scope of literature coverageand give comprehensiveness to this review. The following sources are categorized:
• Other review articles• Books and Technical Reports• Tools and other Online sources.
123
2046 R. Kaur et al.
3.5 Search criteria
The initial search criteria involve the title (“deduplication”), (“techniques of dedupli-cation”), (“Deduplication techniques applied in cloud computing”), (“deduplicationbased on images”). The keyword “deduplication” is involved in almost all the searchesin abstract. The lookup for information relevant to this is quite wide in terms of period,coverage and quantity. It is time-consuming method of review.
Table 3 presents the search string based on keywords, synonyms and period ofsearch. The papers were included from different journals and conferences workshops.Although we have deliberated in considering the search strings to ensure broadness,yet some research papers were not in searches due to many reasons. It may happen asthe article does not have search string in article title and not in abstract. The researchcommunity is using word “deduplication in storage system” and “deduplication ofimages in storage system”. Thus, an attempt has been made to identify such articlesand included by searching manually with the following keywords.
3.6 Inclusion and exclusion criteria
To ensure the coherence of our search, the process of search is done extensively and theprocedure for selection is shown in Fig. 4. Several research papers have been excluded
Fig. 4 Systematic review technique
123
Data deduplication techniques for efficient cloud... 2047
due to either inappropriate title or abstract was not providing efficient details. As thesearch string “deduplication” and “storage system” are widely used on other fieldsof research, the quantity of irrelevant articles is large. As shown in Fig. 4, the initialsearch returned over 754 total papers, which were narrowed down to 128 as the finallist. These were reduced to 354 articles based on relevant title, while 205 left afterreading abstract and conclusion. One hundred and eighty were left based on full text,i.e., that were relevant for literature review. Finally, 128 were selected by principle ofinclusion and exclusion as shown in Fig. 4.
4 Data deduplication
As data are growing exponentially [19,20] in cloud storage services. These data areduplicated on distributed storage system for high reliability, availability and disasterrecovery. The minimum number of replications of data called replication factor isimportant to guard system from disasters and high availability. Any number beyondreplication factormust be removed fromstorage system.Otherwise duplicate data exertmore pressure in the storage system in terms ofmore space and bandwidth. To reduce orcontrol these data duplication, deduplication techniques are applied tomake the storagesystemmore efficient in terms of cost and utilization. The application of deduplicationtechniques depends upon the type of data such as structured, unstructured and semi-structure. These data can further be classified as text, image and video.
The duplicate data affect storage performance, efficiency of storage system andnetwork bandwidth [21]. This has led researchers to focus on the development ofoptimized deduplication techniques for storage systems. This includes deletion ofduplicated data and efficient data delivery in a storage system. Deduplication can bedefined as a technique, which automatically removes duplicate data in the storagesystems. The data reduction in deduplication is reported by Microsoft and NetApp.Microsoft has conducted a study on the file system to estimate the balance in spacesavings between whole-file- and sub-file-level deduplication on 857 desktop windowsmachine over a period of 4 weeks [22]. Based on the study, whole-file-level deduplica-tion achieves 75% of the space savings, while block-level deduplication achieves 32%of the original requirements. Data deduplication was also applied on a digital librarythat identified the duplicated metadata bibliographic record using similarity functionson two real datasets [23]. The datasets are metadata records of two real digital libraries(BDB-Comp and DBLP) and article citation data from Cora collection. Based on thestudy, the results show that the quality of metadata deduplication improves from 2 to62% in digital library dataset and from 7 to 188% in article citation dataset. NetAppreports that deduplication can reduce 95% of duplicate data in the storage systems[24]. Experimental results show that the typical savings are 95% for backup, 72% forVMware, 30% for email and 35% for file services [24].
4.1 Generic deduplication process
The generic process of deduplication consists of data chunking, fingerprinting calcu-lation, index lookup [10,25] and chunk store as shown in Fig. 5. Index lookup plays
123
2048 R. Kaur et al.
Fig. 5 Steps of deduplication process
an important role to find duplicate chunks [25]. In Fig. 5, the file processed for dedu-plication is first broken down into fixed- or variable-size blocks referred to as objects.Data deduplication compares and eliminates blocks that are of same fingerprints. Theunique are stored and index is updated. The four generic steps of the deduplicationprocess are as follows.
• A hash value is calculated first for each chunk of data using cryptographic hashfunction.
• A comparison is made between the hash values of chunks and existing hashes.• The same hash values find duplicate chunk, and data are replaced with a logicalpointer to the object already present in the database.
• New chunk is added and index is updated.
4.2 Classification of deduplication techniques
The existing deduplication techniques have been classified based on storage type,i.e., primary and secondary deduplication, source and target based, processing time,i.e., inline and post-process deduplication, based on level, i.e., local- and global-leveldeduplication and cloud based. Figure 6 represents taxonomy of data deduplicationcategorization based on four parameters like storage based, type based, timing basedand level based.
These are further classified as following [4]:
• Storage based—Primary and secondary storage-based deduplication• Type based—Source- and target-level-based deduplication• Timing based—Inline- and post-process-based deduplication• Level based—Local deduplication and distributed-/global-based deduplication
123
Data deduplication techniques for efficient cloud... 2049
Fig. 6 Taxonomy of data deduplication techniques
4.2.1 Storage-based deduplication
The deduplication has been classified based on the type of storage. Deduplicationis applied on either primary storage [26] or secondary storage [27]. Table 4 depictsstorage-based deduplication techniques.
• Primary storage The primary storage-based deduplication runs on main memoryor active storages that are directly accessible to CPU. CPU continuously reads andexecutes the instructions as required. The primary storage-based deduplication ismainly used for primary workloads, which are latency sensitive [3,26]. The mailservers in memory data are an examples of primary storage.
• Secondary storage It is an auxiliary or external storage system [27] that has notdirect access to CPU. It back ups primary storage data for data protection andrecovery. These systems are accessed only for retrieving old data anddata retention.The examples are storage archives, snapshots and backup storage.
4.2.2 Type-based deduplication
The deduplication process is executed either on source side or on target side. Basedon these two types, deduplication is characterized into source-based and target-baseddeduplication.
• Source-based deduplication The complete deduplication is performed on the dataat the source side before it is being transferred to the backup target [28]. Thesoftware installed on the servers uses CPU and memory of the source side, and itchecks the duplicate before transferring data to backup server. So, it also reduces
123
2050 R. Kaur et al.
Table4
Deduplicationapplications
used
indifferentstorage
system
s
Deduplicationcategories
Significance
Deduplicationadvantages
Tool
andtechniques
Prim
arystoragesystem
sMainor
activ
estoragesystem
,used
forprim
aryworkloads.E
xamples:
mailserversor
user
home
directories
Reduceprim
arystoragespaceand
cost,m
akestorageenergy
efficient
iDedup
[32],d
Dedup,S
DFS
,Ocarina,P
ermabit,
ZFS
,POD[33],
HPD
edup,G
HOST
[34]
Secondarystoragesystem
sAuxiliarystoragesystem
,infrequently
accessed.E
xamples:
storagearchives
andbackup
storage
Reducesecond
ary/backup
storage
spaceandcost
Sparse
indexing
[35],D
DFS
[36],
HydraStor
[37],R
evDedup
[38],
Silo
[39],D
EDE
Virtualmachine
system
Virtualstorage[40]
andprocessing
throug
hVMs
Virtualmachine
storageefficiency
andredu
ceVM
migratio
ntim
e[30]
Liquid[41],H
OPE
[31]
VMware
ESX
[42],V
Mflo
ck,D
EDE[43]
Networksystem
Distributed
storageandcachingfor
widearea
network(W
AN)storage
optim
ization
Reducetim
eto
storeandprocess
networkstorageon
WAN
SmartRE[44],E
ndRE[45]
SSD-based
multim
edia
storagesystem
SSDFaststorageandaccess
ofmultim
ediacontents
Reducestoragespaceandcostof
solid
-statedevice
(SSD
),make
storageenergy
efficient,reduce
multim
ediasize
andits
processing
cost
ViDedup
[46],N
itro[47],U
QLIPS
[48],C
AFT
L[49]
Cloud
storagesystem
sCloud
Storageprovidingaccess
toprivate,pu
blicandcombinatio
nof
both
users
Improves
clou
dstorageefficiency,
costandreduce
bandwidth
utilizatio
n
Cum
ulus
[50],N
ED[51],S
AR[52],
CABdedu
pe[53]
123
Data deduplication techniques for efficient cloud... 2051
the bandwidth requirements, storage and time required to backup data. At the sametime, it consumes additional CPU and I/O resources to detect the duplicates.
• Target-side deduplication The deduplication is done on the targeted storage device[4] commonly on backup servers. In this, dedicated hardware deduplication appli-ances handle all deduplication functionalities [28]. It improves storage utilizationwith the additional cost of dedicated appliance. There is no overhead on the datasource and used for large storage systems. It requires additional network resourcesand is further classified as inline or post-process as discussed below.
4.2.3 Timing-based deduplication
Timing-based deduplication refers to the timewhendeduplication algorithm is applied.It places a limitation on time duration to perform deduplication [5]. The crucial strat-egy in timing-based deduplication is the deduplication operations like searching forduplicates. It can be done as synchronous/in-band or as asynchronous/out-of-bandoperation. The timing-based deduplication has been further categorized into inlinededuplication and post-process deduplication.
• Inline deduplication The data in deduplication are performed at the source side orbefore it is being written to disk [28]. So, there is no need of additional disk spaceto hold and protect the data to be backed up. Therefore, it increases efficiency as thedata are passed and processed only once. Inline deduplication requires additionalcomputation.
• Post-process deduplication Deduplication is done after backup data is writtentemporary to a storage system, i.e., to a disk. It is also called as offline deduplication[28]. It is usually faster than inline deduplication as it helps in reducing the backuptime.
4.2.4 Level-based deduplication
Data deduplication can be categorized as local-level-based and global-level-baseddeduplication as discussed below
• Local-level deduplication Local data deduplication supports local deduplicationat local area network (LAN) level. Local deduplication is only practical in a singleVM and detects replicas in a single node. It has a negative impact on performanceas it cannot completely remove all the duplicates [29]. It has slightly better perfor-mance in multi-node deployment as it can exploit parallelism and indexing withincreased number of nodes and also maintains data availability.
• Global-level deduplication Global-level deduplication also known as common fileelimination is performed in distributed environment, i.e., across multiple datasets.It is also known as multi-node deduplication and has a cluster of multiple nodesthat work together as a unit. The data sent to one node in the cluster are comparedwith previous data sent on that appliance and to the data sent to any other nodein that cluster. The main goal is to apply deduplication on distributed storage thatusesmultiple storage servers. It eliminates redundant disk accesses and removes all
123
2052 R. Kaur et al.
possible replicas within or across VMs. Further, it has additional hashing overhead[29].
4.2.5 Cloud-based data deduplication on storage systems
Data deduplication technique is widely employed in cloud storage [30,31] environ-ment in backup [31] and archives storage system as it helps in reducing the storagespace requirements and storage cost. Deduplication technique helps in reducing inter-net bandwidth over the network or the amount of data being uploaded to the cloudas only one physical copy is being stored instead of duplicate data copies. It helpsin improving the speed of cloud backup [31], resulting in faster and efficient dataprotection operations.
Deduplication to cloud storage can be set up using direct deduplication to cloud,deduplication to cloud on secondary storage copies and deduplication using cloudgateway. Further, deduplication can be employed in different storage systems from pri-mary to secondary, virtual machines [30] to cloud storage systems. Private, public andhybrid cloud storage system exploits advantages of deduplication technique. Table 4lists deduplication applications used in different storage systems, significance, advan-tages and deduplication applications and techniques. It will help researchers to identifydeduplication techniques based on primary and secondary storage systems, virtualmachine systems, network systems, SSD-based multimedia and cloud storage system.
4.3 Classification of data deduplication techniques based on type of data
Data deduplication techniques are data type specific. Text, image and video are threemain types of data, and all three types of data have different storage formats andimplicit characteristics. The format of information stored in data is critical for find-ing the matched information. As a result, data type is an important parameter for thedevelopment of deduplication techniques. It is among the primary focus domains thesedays and researchers are dedicating their time and attention to efficiently apply dedu-plication technique to remove the duplicates or redundancy. Text, images and videosdata are highly redundant [54] on Internet. With the advent of social networking plat-forms, such redundancy has grown and exert additional load on cloud storage systems[54]. Finding duplicate data on such a large heterogeneous platform is a challengingtask for researchers in industry and academia. Figure 7 presents deduplication tech-niques based on type of data. Text data deduplication has been further classified as filelevel and sub-file level. Multimedia deduplication techniques have been categorizedas image and video. Image-based deduplication is further categorized as exact imagededuplication and near-exact deduplication. The video deduplication is also calledframe-based deduplication.
4.3.1 Text deduplication
In text-based deduplication technique, byte-by-byte comparison is made to get exactmatch of text and duplicates that are identified. It works at file level and sub-file level.
123
Data deduplication techniques for efficient cloud... 2053
Fig. 7 Classification of deduplication techniques based on type of data
File-level deduplication also known as single-instance storage works at the file levelby eliminating duplicate files. Sub-file level also known as block-level deduplicationcan be of fixed-size or variable-size blocks. Fixed-length approach examines blocksof data with a predetermined length. It divides the files into fixed-size blocks [9].Variable-length deduplication technology divides the file into variable-length datablocks.
4.3.2 Multimedia deduplication
Multimedia deduplication is further categorized as image-based deduplication andvideo-based deduplication. Image deduplication techniques are based on image detec-tion techniques that are further classified as exact image detection and near-exact imagedetection. Exact image deduplication techniques are based on the exact duplicateimages without considering image transformation. Near-exact images are modified orcopied version of an original image. The images are modified by applying cropping,modification, scaling, adding noise, compression, rotation, etc. The techniques to findsuch duplicate have different techniques and accuracies.
Video deduplication techniques are frame-based deduplication techniques. First avideo is converted into frames, and then, these frames represent visual features [55].Based on the visual features or descriptors, hash functions are generated from eachkeyframe of video. Visual features are used to detect duplicate sequences from theframes between queried video and video library [55].
4.4 Comparison of text-based deduplication and image-based deduplication
Text-based deduplication is compared with image-based deduplication techniquesbased on partitioning methods, indexing/hashing techniques and lookup methods,format of storage, matching technique and accuracy. Table 5 lists the parameters
123
2054 R. Kaur et al.
Table 5 Comparison between text-based deduplication and image-based deduplication
Parameters Text-based deduplication Image-based deduplication
Partition method Data partitioning is done in theform of chunks like fixed- orvariable-size chunks
Image preprocessing is done, features areextracted
Techniques used Cryptographic hash functions areused to calculate hashes ofvarious chunks
Features are extracted using different featureextraction techniques such as SIFT, SURFand FAST. Further hashing techniques areapplied on image features
Index lookup An exact matching for indexlookup is done in text-baseddata deduplication to detectduplicate files or chunks in thestorage systems
An approximate matching for index lookupis required to match the similar images
Storage Only one copy of file is storedand duplicates are removed
Centroid selection is done in images,centroid image is stored and the near-exactimage transformation is stored in the formof transformation matrix
Matching technique Byte-by-byte comparison is madeto get exact match of text andduplicates are found
Compare the number of the same elementsin the feature extracted from images andthen duplicate images are detected
Accuracy Complete matching is done here Exact or near-exact images are detected here
and comparative description under text-based deduplication and image-based dedu-plication. As video is converted into frames to employ image-based deduplicationtechniques as both are based on visual contents. So, a comparison of text-based dedu-plication and image-based deduplication is presented in Table 5. Table 5 is adaptedand enhanced from tabular data in [56].
4.5 Text-based deduplication techniques
Various works have been reported on text-based deduplication techniques of differentstorage systems such as secondary storage system and backups. Several authors havediscussed text-based deduplication techniques, which are broadly categorized basedon granularity, locality, indexing, security and cloud. Figure 8 presents taxonomy oftext-based deduplication techniques.
Fig. 8 Taxonomy of text-based deduplication techniques
123
Data deduplication techniques for efficient cloud... 2055
(a) Taxonomy of text deduplication based on granularity Granularity is a method todivide data into chunks and is a fundamental factor for removing duplicates. Thechunks may be of fixed or variable size. The file and sub-file are two levels ofgranularity. In file-level granularity, deduplication is performed on file and is alsoknown as object granularity. The sub-file-level granularity is performed to dividea file into fixed- or variable- size chunks or blocks [5]. The variable chunks insub-file level provides better matching efficiency. The finer granularity of chunksimproves efficiency and precision of detecting duplicates [57]. The granularityrequires additional computation and memory. Table 6 lists taxonomy of textdeduplication techniques based on granularity proposed in research articles, theirdescription and key findings.
(b) Taxonomy of text deduplication based on locality The storage systems exploitlocality in caching strategies and on-disk layout [5]. Two types of localityemployed in storage are temporal locality and spatial locality. In temporal local-ity, chunks are referenced recently in a particular memory location at the momentof time and are expected to be referenced samememory location in the near future[5]. So, duplicate chunks appear several times in a short time duration. In spatiallocality, chunks with nearby addresses in the memory location are referencedclose together in time. Spatial locality refers to a particular memory locationthat is referenced at a particular time, then it is very likely that its close memorylocationswill be referenced in near future. Table 7 depicts taxonomy of text dedu-plication techniques based on locality, description and keys finding suggested inresearch articles.
(c) Taxonomy of text deduplication based on indexing Indexing provides an efficientdata structure to lookup for duplicated data [5]. To search for exact duplicates,hashing technique is used to summarize content, which leads to the identificationof signatures. Hashing computation needs additional CPU resources and hashcollision avoidance methods. Hashing collisions can be avoided by comparingthe contents of the two chunks with identical signatures. Rabin fingerprint [68] isanother technique that is used to compare similarity of two chunks. Table 8 depictstaxonomy of text deduplication techniques based on indexing, their descriptionand key findings.
(d) Taxonomy of text deduplication based on security Security is a key aspect forcloud storage system. The deduplication technique has authorization and authen-tication to access complete storage system. The deduplication technique requiresa security framework to thwart the attackers and prevent system from securitybreaches. In cloud storage system, security is essential in data sharing, maintain-ing data confidentiality and integrity, preventing data leakage and offsite datastorage facilities [75]. Table 9 presents taxonomy of text deduplication tech-niques based on security, their description and important findings.
(e) Taxonomy of text deduplication based on cloud Data deduplication is widelyemployed on cloud-based storage systems to improve the storage efficiencyand storage cost. Network bandwidth, high throughput, computational overhead,deduplication efficiency and low energy consumption are the key challenges forapplying data deduplication on cloud-based storage services. Table 10 shows dif-ferent cloud-based text deduplication techniques suggested in different articles.
123
2056 R. Kaur et al.
Table6
Taxo
nomyof
text
dedu
plicationbasedon
granularity
Autho
r(s)
Techniqu
eDescriptio
nFind
ings
Lietal.[38]
Reverse
dedu
plication(R
evDedup
)Hyb
ridinlin
eandou
t-of-line
dedu
plicationtechniqu
e,op
timized
readsto
thelatestbackup
s
Improves
storageeffic
iency,high
restorethroug
hput
fornewbackup
s
Laietal.[51]
Near-exactd
efragm
entatio
n(N
ED)
scheme
Identifyandrewritefragmentsin
clou
dbackup
,defragm
entatio
nbasedon
segm
entreference
analysis
Improves
restoreperformance
onclou
dbackup
Wangetal.[57
]Slidingblocking
algo
rithm
with
backtracking
sub-blocks
calle
dSB
BS
Chu
nk-leveldu
plication,
rsyn
crolling
checksum
algo
rithm
are
used
forweakhash
checkto
match
slidingblockandAdler-32
checksum
forweakhash
check
whenbacktracking
sub-blocks
and
MD5hash
algorithm
forstrong
hash
check
Improves
duplicatedetection
precision,
efficiently
detect
duplicatedatain
sub-blocks
Bob
barjun
getal.[58]
Fing
erdiffsupp
ortsflexiblevariable
chun
ksObjectp
artitioning
techniqu
e,dynamically
choosesapartitioning
strategy,m
erge
consecutive
duplicatechun
ksinto
bigg
eron
es
Improvestorageandband
width
utilizatio
n
Kruus
etal.[59]
Con
tent-defi
nedchun
king
Bim
odalalgo
rithms,chun
ksize
vary
dynamically,re-chun
ktheun
ique
butd
uplic
ateadjacent
chun
ks
Achievesreasonableduplication
elim
inationratio
123
Data deduplication techniques for efficient cloud... 2057
Table6
continued
Autho
r(s)
Techniqu
eDescriptio
nFind
ings
Lim
etal.[60]
Duplicated
elim
inated
FlashFile
System
(DeFF
S)Variable-sizedblocks
increase
flexibility,no
n-overlapp
ing
duplicatechun
king
algo
rithm
Reducedu
plicatewriteof
dataand
prolon
gedflash
mem
orylifecycles
Kaczm
arczyk
etal.[61
]Context-based
rewritin
g(C
BR)
Deduplicationbasedon
stream
and
disk
context,rewriteshighly
fragmentedduplicates
Improves
restoreperformance
and
band
width
Wild
anietal.[62]
Heuristically
ArrangedNon
-Backu
pInlin
eDeduplicationSy
stem
(HANDS)
Scalablechunk-baseddeduplication,
N-N
eigh
borhoo
dPartition
ing
(NNP)
metho
dused
forgrou
ping
ofcorrelated
segm
entsinto
index
cache
Reducein-m
emoryindexstorage,
dynamically
pre-fetches
fingerprintsfrom
disk
into
mem
ory
cache
Nam
etal.[63]
CFL
-SD(C
ache-awareChu
nkFragmentatio
nLevelandSelective
Dedup
lication)
Selectivelydeduplicates
theinput
chun
ksbasedon
chun
kfragment
level
Betterread
performance
atreasonablecostof
write
performance
Park
etal.[64
]Lookaheadread
cache
Noveldedupe
storageread
cache
design
forabackup
application,
exploitfuturedatachun
kaccess
patte
rns
Fastread
performance
Xiaetal.[65
]Deduplication-aw
areresemblance
detectionandElim
inationscheme
(DARE)
Exploits
duplicate-adjacency
(DupAdj),forefficient
resemblance
detectionin
backup
s,im
proves
superfeatureapproach
toenhance
theresemblance
detection
efficiency
Lesscompu
tatio
naland
indexing
overheads,high
throug
hput
Fuetal.[66
]History-awarerewritin
galgo
rithm
(HAR),cache-aw
arefilter(CAF)
Reducefragmentatio
nissueby
exploitin
gcachekn
owledg
e,defragmentatio
nby
exploitin
ghistoricalinform
ationof
backup
system
s
Improves
restoreperformance,
redu
cegarbagecolle
ctionoverhead
123
2058 R. Kaur et al.
Table7
Taxo
nomyoftext
deduplicationbasedon
locality
Autho
rsTechniqu
eDescriptio
nFind
ings
Srinivasan
etal.[32]
IDedup
Spatialand
temporallocality,
selectivededuplicatesequencesof
disk
blocks
Minim
izeextraI/Oseeks,reduce
mem
oryandCPU
consum
ption
Lillibridgeetal.[35]
Sparse
indexing
Samplingandsparse
indexto
find
similarsegm
ents,exploits
the
inherent
localitywith
inbackup
stream
s
Reduces
in-m
emorydedu
pmetadata
size,b
ettermem
oryconsum
ption,
only
fewseeksrequ
ired
Zhu
etal.[36]
Datadomaindeduplicationfile
system
(DDFS
)Exploits
inmem
orybloom
filterto
identifynewsegm
entsandstream
Stream
-Informed
Segm
ento
riented
metadatapre-fetch,
locality
preservedcachingmaintainlocality
offin
gerprints
Highcacheratio
bymaintainlocality
offin
gerprints
Xiaetal.[39
]Similarity-locality
(SiLo)
based
Indexing
Exploits
similarity-and
locality-basedstatelessalgorithm
todistributeandparallelizethedata
chun
ksto
severalb
acku
pno
des
Highthroug
hput,loadbalancing,
low
RAM
overhead
improveindex
scalability
Fuetal.[67
]Scalableinlin
ecluster-based
dedu
plication
Exploitdatalocalityandsimilarity
usinghand
-printingTechniqu
e,basedon
localstatefulroutin
galgo
rithm
Highglob
aldedu
plication
effectiveness,high
parallel
dedu
plicationthroug
hput,low
RAM
usagein
each
node
123
Data deduplication techniques for efficient cloud... 2059
Table8
Taxo
nomyof
text
dedu
plicationbasedon
indexing
Autho
rsTechniqu
eDescriptio
nFind
ings
Bhagw
atetal.[69]
Extremebinning,distributedfile
backup
system
Exploits
filesimilarity,splits
the
chunkindexinto
twotiers,fi
leallocatio
nthroughstatelessrouting
algorithm,u
sefilerepresentativ
eindexforprim
aryindex
Reasonablethroug
hput,scalableto
multip
lenodes,alleviatedisk
bottleneckprob
lem
Yangetal.[70]
Chu
nkfarm
,post-process
dedu
plication
Cluster
ofbackup
servers,index
look
upin-batch
(ILB)andindex
updatein-batch
(IUB)hash
algo
rithmsforfin
gerprint
look
upexploitm
emorycache)
Highwritethroug
hput
andscalability
Min
etal.[71]
Con
text-awarechun
king
LRU-based
IndexPartition
ingfor
effic
ient
fingerprint
look
up,
Increm
entalM
odulo-K(INC-K
)for
effic
ient
chun
king
Efficientloo
kupof
fingerprints,
redu
cescompu
tatio
nalo
verheadof
sign
aturegeneratio
n
Guo
andEfstathop
oulo
[72]
Sing
le-nod
ededu
plicationsystem
Prog
ressivesampled
indexing
for
fine-grainedindexing
,Group
edmarkandsw
eepmechanism
deals
with
issueof
chunkgarbage
collection,
minim
izes
disk
seeks
Improvesingle-nodescalability,
optim
izethroug
hput
Barreto
etal.[73
]Hashchalleng
esRedun
dant
chun
ksareidentifi
edby
exchanging
substantially
less
metadata,no
additio
nalcom
plexity
Reducecommun
icationoverheads
Christen[74]
Indexing
techniqu
esforrecord
linkage
anddedu
plication
Survey
of12
variations
ofsix
indexing
techniqu
esusingrealand
synthetic
datasets,estim
ategood
candidaterecord
pairs,heuristic
approaches
split
therecordsin
into
blocks
database
Scalability
andperformance
were
exam
ined
onvariousdatasets
123
2060 R. Kaur et al.
Table9
Taxonomyof
text
deduplicationbasedon
security
Autho
rsTechniqu
eDescriptio
nFind
ings
Harniketal.[76]
Cross-userdedu
plication
Hyb
ridapproach,som
etim
esturn
off
cross-user
deduplication
Reduced
therisk
ofdataleakage,
preservesprivacyof
user
data
Lietal.[77]
Dekey,ram
psecretsharingscheme
(RSS
S)DeK
eyusingRSS
SdistributesCE
keys
tomultip
leserversfor
efficient
CEkeymanagem
ent
Incurred
limite
dencoding
/decod
ing
overhead,preserves
security
and
confi
dentialityof
data
Liu
etal.[78
]Prox
yRe-encryp
tion
Enabledifferenttrustrelatio
nsam
ongclou
dstoragecompo
nents,
differentu
sersdecryp
tthe
shared
dedu
plicated
chun
ksandaccess
the
same
Improves
protectio
nof
user
privacy
Storer
etal.[79]
Convergentencryption
Encryptionkeys
aregeneratedfrom
chun
kdata,identicalchun
ksencrypttosamecipher
text.E
ach
fileisencryp
tedusingaun
ique
key.
Asymmetrickeypairsareused
tomanagethekeys
forsecurity
purpose
Space-efficient
secure
deduplication
forsingle-serverdistributedstorage
system
s
Lietal.[80]
Secretsharingscheme
Secretsplittin
gtechniqueprotect
dataconfi
dentiality,supportfi
leandbo
ck-leveldedu
plicationin
distributedstoragesystem
s
Efficientd
eduplicationwith
high
reliability,dataconfi
dentialityand
integrity,lim
itedoverhead
Vishalakshi
etal.[81]
Cloud
edup
usingconvergent
encryp
tion
Block-leveldedu
plicationon
encryptedfiles,m
etadatamanager
(MM)takescare
ofactual
dedu
plicationandkeymanagem
ent
operations
Secure
andefficient
clou
dstorage
service
Bibaw
eandBaviscar[82]
Hyb
ridclou
darchite
cture
Supp
ortsauthorized
duplicatecheck
usingprivatecloudserver,
dedu
plicationisdo
neon
encryp
ted
dataatCSP
Low
costof
storage,flexiblysupp
ort
access
controlo
nencrypteddata,
prop
osed
mod
elissecure
from
insiderandoutsider
attacks.
123
Data deduplication techniques for efficient cloud... 2061
Table10
Taxo
nomyof
text
dedu
plicationbasedon
clou
d
Autho
rsTechniqu
eDescriptio
nFind
ings
Zhaoetal.[41
]Liquid
Scalablededuplicationdistributed
filesystem
forvirtualm
achine
images,fastV
Mdeploymentw
ithpeer-to-peer
(P2P
)datatransfer
Avoidsadditio
nald
iskop
erations,
good
IOperformance,low
storage
consum
ption
Laietal.[51]
Near-exactd
efragm
entatio
n(N
ED)
scheme
Identifyandrewritefragmentsin
clou
dbackup
,defragm
entatio
nbasedon
segm
entreference
analysis
Improves
restoreperformance
onclou
dbackup
Mao
etal.[52]
SAR,anSS
D(solid-state
drive)—assisted
read
scheme
Exp
loite
dSS
Dsby
storingun
ique
datachun
kswith
high
reference
coun
t,absorb
rand
omreadsto
hard
disks
Accelerateread
performance
significantly
Fuetal.[75
]App
lication-aw
arelocalg
lobal
(ALG)source
deduplication
Intelligent
chun
king
metho
dto
minim
izecompu
tatio
nalo
verhead,
hash
functio
nsbasedon
application
awareness,alleviatedisk
lookup
bottleneck
Optim
izethelook
upperformance,
andstoragecost,h
igh
dedu
plicationefficiencyand
throug
hput
Wuetal.[83]
Deduplication-assisted
prim
ary
storagesystem
sin
clou
d-of-cloud
s(D
AC)
Datareductionanddatadistributio
napproach
tostoredatablocks
inmultip
leclou
dstorageproviders,
Replicationschemeisused
tostore
high
referenced
datablocks,and
erasureschemeisused
tostore
otherdatablocks
Improves
storageeffic
iencyand
networkband
width
Wangetal.[84
]I-sieve,basedon
iSCSI
Protocol
Designnovelindex
tables
atblock
level,multi-levelcache
using
solid
-statedrive
ReduceRAM
consum
ptionand
optim
izelook
upperformance
for
smallstorage
system
s
123
2062 R. Kaur et al.
Table10
continued
Autho
rsTechniqu
eDescriptio
nFind
ings
Leesakuletal.[85]
Dyn
amicdatadedu
plication
Who
le-filehashingtechniqu
e,maintainstorageefficiencyand
quality
ofserviceforclou
dstorage
system
Handles
thescalability
issue,
improves
performance
Sunetal.[86
]Datadedu
plicationover
engineering
oriented
cloudsystem
s(D
eDU),
Run
son
commod
ityhardware,HDFS
isused
formassstoragesystem
and
Hbase
forfastindexing
system
FastIndexing
,efficientfor
data-intensive
engineering
applications
Neelaveni
etal.[87
]Fileclassifier-basedlin
earindexing
dedu
plication(FC-LID
)FC
-LID
uses
Linearhashingwith
Representativegroup(LHRG)to
developindexsystem
toovercome
disk
bottleneckproblem
Lesscomputatio
nalo
verheadand
moreefficient
Lietal.[88]
Convergentencryptionin
secure
deduplicationsystem
Secure
dedu
plicationin
clou
denvironm
entu
sing
encryp
ted
keyw
ordsearch.C
onvergentk
eys
arecheckedforintegrity
purpose
Privacyandsecurity
maintainin
clou
denvironm
ent
Shin
etal.[89
]Secure
datadedu
plicationforclou
dstorage
Survey
onexistin
gsecure
dedu
plicationtechniqu
esbasedon
cryp
tographicandsecurity
protocol
solutio
ns,the
keydesign
decisions
forsecure
dedu
plicationaredata
granularity,d
edup
licationlocatio
n,du
plicatecheckbo
undary,and
system
architecture
Identifyvarioussecurity
threatswith
regard
todataconfi
dentiality,
integrity
andavailabilityin
cloud
storageforbetterefficiency
Pokaleetal.[90]
DelayDedup
eLoadbalancingtechniqu
eforfile
server
andcloudstorageserver.
The
chun
ksthatareno
tfrequ
ently
accessed
aredelayedfor
dedu
plicationprocessandho
tchunks
arededuplicated
firstto
redu
cetherespon
setim
e
Effectiv
elyreducestheresponse
time
andbalanced
theload
ofstorage
nodes.So
achievebetter
availabilityof
data
123
Data deduplication techniques for efficient cloud... 2063
Fig. 9 Evolution of text-based deduplication techniques
Fig. 10 Research contribution of five broad categories
Table 10 presents taxonomy of text deduplication techniques based on cloud,description and important findings.
123
2064 R. Kaur et al.
Table11
Parametersof
deduplicationtechniques
Fixed-level
chun
king
Variable-
level
chun
k-ing
Spatial
locality
Tempo
ral
locality
Fullindex-
ing
Partial
indexing
Sparse
indexing
Inlin
emetho
dOffline
metho
d
Extremebinn
ing[69]
Yes
No
No
No
No
No
Yes
Yes
No
DDFS
[36]
No
Yes
Yes
No
Yes
No
No
Yes
No
Chu
nkstash[91]
No
Yes
Yes
No
Yes
No
No
Yes
No
Sparse
indexing
[35]
No
Yes
Yes
No
No
No
Yes
Yes
No
Guo
andEfstathop
oulo
[72]
Yes
No
Yes
No
No
Yes
No
Yes
No
Silo
[39]
No
Yes
Yes
No
Yes
No
No
Yes
No
∑-D
edup
e[67]
No
Yes
Yes
No
Yes
No
No
Yes
No
Dongetal.[92]
No
Yes
Yes
No
Yes
No
No
Yes
No
Chu
nkFarm
[70]
No
Yes
No
No
Yes
No
No
No
Yes
iDedup
[32]
Yes
No
Yes
Yes
Yes
No
No
Yes
No
CBR[61]
Yes
No
Yes
Yes
Yes
No
No
Yes
No
123
Data deduplication techniques for efficient cloud... 2065
Apart from classification of text-based deduplication techniques and their discus-sion, this paper presents evolution of text-based deduplication techniques and helpsresearchers to find techniques in their sub-area. Figure 9 presents evolution of majortext-based deduplication techniques classified into five broad categories.
The articles are classified in five broad categories like granularity, locality, index-ing, security and cloud in text-based deduplication. Figure 10 represents percentageresearch article contribution in these five categories. Granularity has been a primeresearch focus of these articles as it has implicit performance enhancement capabili-ties. The growth of data in cloud has been second major focus of researchers followedby security, indexing and locality.
The exhaustive research is done in defining the parameters of deduplication tech-niques according to various taxonomies defined in Sect. 4 and as listed in Table 11.Table 11 presents the parameters of deduplication techniques. The parameters ofselected techniques are file-level chunking, variable-level chunking, spatial locality,temporal locality, full indexing, partial indexing, sparse indexing, inline method andoffline method based on timing. This table will help researchers to instantly comparededuplication techniques based on these parameters.
4.6 Multimedia-based deduplication techniques
Recent development and advancement in images and videos retrieval system haveproposed many methods to identify or extract the information (features) from imagesor videos. Also with the advent of Internet, smartphones and social networking sites,large amount of images and videos are being shared by the users across the world.
Fig. 11 Image deduplication techniques
123
2066 R. Kaur et al.
Table12
Multim
edia-based
deduplicationtechniques
Autho
rsTechniqu
eDescriptio
nFind
ings
Chenetal.[56]
Haarwavelet,B
+Tree
Grayblockfeatures
areused
from
images
toconstructB
+tree
index.
Further,fin
er-granu
laralgo
rithm
Haarwaveletisused
toextractthe
edge
inform
ationof
theim
ages
for
accuracy
optim
ization.
Centroid
selectionisdone
andlastduplicate
images
aredetected
Higherdeduplicationratebut
scalability
isamajor
challenge
Ram
aiah
etal.[94
]Content-based
imageretrieval
(CBIR
),histog
ram
refin
ement
Extractfeatures
usinghistogram
refin
ementtoelim
inateduplicate
ratio
ncardsbasedon
family
photog
raph
s,districtleveland
K-m
eans-based
clustering
algorithm
isused
tospeedup
the
deduplicationprocess
Dedup
licationprocessisdepend
ent
onhu
man
interventio
n
Zargaretal.[95]
Con
tent-based
imageretrieval(CBIR
)Block
trun
catio
ncoding
(BTC)
Photo-baseddeduplicationprocessto
findtheduplicateelectricity
bills
from
large-scaledatabase.
Photog
raph
saredividedinto
differentb
locksusingBTC.Images
ofsamesize
areputinsamecluster
Betterspaceutilizatio
n
Hua
etal.[96]
Principalcom
ponent
analysis
(PCA)—
scale-invariantfeature
extractio
ntechnique(SIFT),
Difference-of-G
aussian(D
oG)
SmartEye,in-networkcoarse-grained
deduplication.
DoG
,PCA-SIFTare
used
forfeatureextractio
nof
images
andhash
thesefeatures
into
space-efficient
bloom
filter.
Locality
sensitive
hashing(LSH
)identifi
essimilarim
ages
basedon
correlated
features
Obtainenergy
saving
sandim
prove
bandwidth
efficiency
123
Data deduplication techniques for efficient cloud... 2067
Table12
continued
Autho
rsTechniqu
eDescriptio
nFind
ings
Lietal.[97]
Secure
perceptualsimilarity
deduplicationscheme(SPS
D)
Todetectsimilarity
between
duplicateim
ages,p
erceptualh
ash
algorithm
isused
togenerate
signatures
ofim
ages
Achievesahigh
deduplicationand
storageandband
width
saving
s
Deshm
ukhetal.[98
]MapReducetechnique
Fastduplicateim
ageidentifi
catio
nsystem
susingMapReduceand
Pearsoncorrelationtechniques
Reducetim
erequired
todetect
duplicateim
ages,improve
efficiencyandreliabilityof
the
system
Fatemaetal.[99
]Setp
artitioning
inhierarchicaltrees
(SPIHT)
SPIH
Talgorithm
compressthe
imageandusepartialencryptionto
preventanim
agefrom
clou
dserviceprovider
(CSP
).Uniqu
ehashes
aregeneratedbasedon
SPIH
Twaveletcoefficientsto
perform
secure
image
dedu
plication
Noextracompu
tatio
nalo
verheadfor
imageencryp
tion
Zheng
etal.[10
0]Scalablevideocoding
(SVC)
Implem
entedencryp
tedclou
dmedia
center
thatsupportssecure
video
dedu
plicationandho
st-encrypted
videocoding
techniqu
es
Scalability
ofSV
Cisan
issue
Yangetal.[10
1]Local-difference-pattern
(LDP)
LSH
(Video)
Images
orvideofram
esare
representedby
local-feature-based
fram
ework.
LSH
represents
indexing
structureto
detect
near-exactim
ages
andvideos
Lessprocessing
timeandlowstorage
overhead
123
2068 R. Kaur et al.
Table12
continued
Autho
rsTechniqu
eDescriptio
nFind
ings
Shen
etal.[48
]Near-duplicatevideoclip
(NDVC)
UQLIPSisafastandrobustNDVC
detectionsystem
basedon
visual
contentu
sing
boun
dedcoordinate
system
(BCS),frame
symbolization(FRAS).K
-nearest
neighb
oralgo
rithm
isappliedfor
similarity
search
Highaccuracy
andfastenou
ghto
supportreal-tim
esearch
Natureletal.[10
2]Fastshot-based
methodusing
discretecosine
transform
(DCT)
Detectd
uplicatesequencesfrom
videoshotsin
television
broadcast.
Video
issegm
entedinto
shots
(frames)andits
fram
esignatureis
compu
tedforexactretrieval
Detectd
uplicateshotsin
very
less
time
Katiyar
etal.[46
]ViD
edup
(201
1)Anapplication-aw
arededu
plication
system
forcompressing
videos
and
todetectvisualredu
ndancy
incontentratherthan
atbytelevel
Scalability
ismajor
challengeto
hand
le
Lietal.[103]
Video
deduplicationwith
privacy
preserving
Video
fram
esaredividedinto
fixed-sizeblocks,encrypted
and
uploaded
onCloud
.Identical
blocks
onserver
side
are
deduplicated
Savesloto
fstoragespace
123
Data deduplication techniques for efficient cloud... 2069
Flickr has around 6 billion images [93], and Facebook has about 0.25 billion imagesuploaded daily. Most of the images uploaded are either modified, forwarded or copied[93], which leads to large amount of duplicate or near-duplicate images on the web.These duplicate images lead to wastage of costly storage in the storage system [54].There must be some efficient technique to remove these duplicates from the storagesystem. Imagededuplication is one such technique,which helps in removal of duplicateimages from the storage system. Different image deduplication techniques have beendiscussed based on their characteristics like image features extraction, image hashingalgorithm for indexing and distance measure to detect similarity between imagesor videos. Further, we explain the different feature extraction techniques to extractthe features of images to analyze exact or near-exact image duplicates. Figure 11defines technique applied to detect exact or near-exact image duplicates. The techniqueremains same for exact/near-exact image deduplication. Only difference lies in thestorage of image transformation for near-exact images.
4.6.1 Multimedia-based deduplication techniques
Table 12 discusses few findings of existing image and video deduplication tech-niques and their key features based on exact image deduplication, video deduplicationbased on exact video deduplication, near-video duplicates detection and security-basedimage deduplication. Deduplication technique was applied on images to detect similarimages and on videos to detect similar frames from videos using different parameterslike feature extraction algorithms like SIFT, SURF, PCA-SIFT and BRISK, Hashingalgorithms to generate hashes of features extracted from images and distance measureto check similarity between two images using some threshold.
Multimedia-based deduplication techniques are facing challenges like scalability,high accuracy and performance. The feature extraction techniques like SIFT, SURF,PCA-SIFT and BRISK need further enhancements to achieve scalability, accuracy andperformance. So there is a need of an autonomic, scalable and efficient multimediadeduplication technique.
The articles are classified in three broad categories like image- and video-baseddeduplication techniques and secure image deduplication techniques. Figure 12
Image Deduplication
50%
Secure image deduplication
8%
Video Deduplication
42%
Image Deduplication Secure image deduplication Video Deduplication
Fig. 12 Research contribution of image, video and secure deduplication techniques
123
2070 R. Kaur et al.
Table13
Techniques
todetectexacto
rnear-exactim
ages
Autho
rsTechniqu
eDescriptio
nFind
ings
Velmurugan
etal.[10
4]Sp
eededup
robustfeatures
(SURF)
and
K-dim
ension
al(K
d)tree
Features
areextractedusingSU
RF
algorithm,K
d-tree
with
BestB
inFirst
(BBF)
algo
rithm
isused
forindexing
and
finding
similarity
betweenfeatures
Improves
averageprecisionto
detect
thematched
images
Lietal.[105]
SURFanddensedescriptor
appliedto
wide-baselin
eStereo
(DAISY)
SURFfeatures
andDaisy
descriptor
iscombinedforim
agematching
Improves
matchingaccuracy
onrotatio
ninvariance.B
utno
tgoo
dforlarge-scaleim
agevariation
Leietal.[10
6]Kd-trees
Proposed
novelindexingstructure,i.e.,
clusterof
uniform
rand
omized
tree
todetectnear-duplicateim
agedetectionfast
Improves
thedetectionefficiencyand
search
space
Dongetal.[10
7]Scale-invariantfeature
transform
(SIFT)
Prop
osed
entropy-basedfiltering
metho
dfor
extractin
gSIFT
features
andqu
ery
expansionmethodto
detectnear
duplicates
ScalableusingHadoopcommodity
server
Keetal.[10
8]DoG
,PCA-SIFTandLSH
Proposed
asystem
usingDoG
and
PCA-SIFTto
detectnear
duplicates.L
SHisused
forefficient
similarity
search
Highcompu
tatio
nalo
verheadwhen
thedatabase
grow
s
Thomee
etal.[10
9]Con
tent-based
image
detection
Acomparativ
estud
yisdo
neto
detectthe
duplicates
usingcontent-basedduplicate
imagedetection.
Toassess
thefeasibility,
large-scaleim
ages
aretakenfrom
web
Com
parativ
eanalysisbasedon
descriptor
size,d
escriptio
ntim
eandmatchingtim
e
Fooetal.[11
0]SICO(Sim
ilarIm
age
Collator),P
CA-SIFT
SICO,a
novelsystem
usingPC
A-SIFTto
extractfeatures,LSH
forindexing
these
featureandHash-basedprobabilistic
countin
gisused
todetectnear-duplicate
images
Effectiv
eandefficient
method
Chu
metal.[11
1]SIFT
andk-means
clustering
Proposed
min
hash
algorithm
using
similarity
measure
todetectnear-duplicate
images
Noextracompu
tatio
nalcostand
improves
search
efficiency
123
Data deduplication techniques for efficient cloud... 2071
Table13
continued
Autho
rsTechniqu
eDescriptio
nFind
ings
Lietal.[112]
SIFT
,k-m
eans
clustering
and
LSH
SIFT
features
areextractedandareclustered
into
severalclustersusingk-means
algo
rithm.L
SHisappliedforindexing
and
histogram
distance
isused
todetect
near-duplicateim
ages
Effectiv
emethodto
detectnear
duplicates
Seoetal.[11
3]Radon
transform,h
amming
distance
Prop
osed
imagefin
gerprintingmetho
dusing
Radon
transform
andperceptualhashing
formultim
ediacontent.Fo
rfin
gerprint
matching,
Ham
mingdistance
isused
Highlyrobustagainstaffine
transformation
Yuetal.[11
4]SIFT
technique
Proposed
SIFT
-based
algorithm
and
geom
etry
isom
orphicrelatio
nshipto
detect
homolog
ybetw
eentheim
ages
Robusttoim
ageem
bedd
ing
Gavrielides
etal.[11
5]Histogram
technique,
differentq
uantization
metho
ds
Imagefin
gerprintingmetho
dto
extract
robustandun
ique
imagedescriptor
byusingcolor-baseddescriptors
Betterperformance
ofthesystem
usingcolordescriptors
Nikolaidisetal.[11
6]R-trees,lineardiscriminant
analysis(LDA)
Prop
osed
color-baseddescriptorsmetho
dof
imageandvideofin
gerprintingusing
R-trees
andLDA.F
rame-basedvotin
gis
also
appliedon
videofin
gerprinting
Efficientfor
digitalrights
managem
ento
fim
ages
andvideo
Nian[117]
Local-based
binary
representatio
n(LBR)
Binarypatte
rn,H
istogram
Prop
osed
LBR,a
compactim
age
representatio
nmethodusingbinary
vector
andhistog
ram
foron
linenear-dup
licate
imagedetection
Robusttoim
agevariations.L
owcomputatio
nalspeed
andbetter
performance
Srinivasan
etal.[11
8]Fo
urier–Mellin
transform
(FMT)
Presentedim
agefin
gerprintingtechniqu
eusingFo
urier–Mellin
transform
todetect
near-duplicateim
ages
Fast,accurateandhighly
scalable
techniqu
e
123
2072 R. Kaur et al.
Table13
continued
Autho
rsTechniqu
eDescriptio
nFind
ings
Yao
etal.[11
9]Con
textuald
escriptors
Prop
osed
anewcontextualdescriptor
that
encodesthespatialrelations
ofthecontext
andmeasuresthecontextualsimilarity
ofim
ages
Betterperformance,effectiv
eand
efficient
methodforlarge-scale
imageretrieval
Stefan
Leutenegger
etal.[12
0]BRISK
Proposed
afastnovelm
ethodto
detectkey
pointsin
acontinuo
usscalespaceand
descriptor
inbinary
string
tomatch
the
similarim
ages
usingHam
mingdistance
Qualitymatches
atless
time,
applicableto
hard
real-tim
econstraintstasks
Chenetal.[12
1]SIFT
Proposed
fastim
ageretrievalm
ethodthat
uses
binarizedSIFT
features
andhashing.
Ham
mingdistance
iscalculated
for
matchingof
similarim
ages
Low
complexity
andfastretrieval
time
Huang
etal.[12
2]Im
agerelatio
nalg
raph
(IRG)
Prop
osed
anovelm
etho
dto
detect
near-duplicateim
ages
usinglocaland
glob
alfeatures.P
ageR
ankgraph
model-based
linkanalysisalgorithm
isused
toanalyzecontextualrelatio
nship
betweenim
ages
Effectiv
emethodto
detect
near-duplicateim
ages
Wangetal.[12
3]MapReduce
Com
bine
both
localand
glob
alim
age
descriptorson
large-scaleduplicate
discovery.Globalfeaturesdiscover
seed
imageclusterandlocalfeaturesto
merge
theclustersto
identifynear
duplicates
Effectiv
emethodforlarge-scale
imagedatain
term
sof
high
accuracy
andrecall
Zhaoetal.[12
4]HarrisandSIFT
Proposed
improved
Harriscorner
detector
andSIFT
usingNearestneighbor
algo
rithm
forfeaturematching
Highmatchingaccuracy
and
increasedmatchingspeed
123
Data deduplication techniques for efficient cloud... 2073
Table13
continued
Autho
rsTechniqu
eDescriptio
nFind
ings
Luetal.[12
5]Discretecosine
transform
(DCT),Harriscorner
detector
Proposed
mesh-basedrobustim
agehashing
usingDCT-basedhash
extractio
nmethod
andhash
database
isconstructedfor
error-resilient
andfastsearching
Improves
imagehashingresistance
togeom
etricdistortio
ns
Leietal.[12
6]Radon
transform,discrete
Fouriertransform
(DFT
)Prop
osed
hashingmetho
dforim
age
authentic
ationbasedon
Radon
transform
andDFT
Robusttocontentp
reserving
operations
Hua
etal.[12
7]Features
from
accelerated
segm
enttest(FA
ST)DoG
,PC
A-SIFT,
LSH
,bloom
filter
Anoveln
ear-real-tim
emethodbasedon
DoG
andPC
A-SIFTto
detectim
age
features.B
loom
filterisused
forcompact
representatio
nof
thesefeatures.L
SHis
used
todetectsimilarim
ages
using
correlationprop
erty
Reducetheprocessing
latencyof
parallelq
ueries
123
2074 R. Kaur et al.
Fig. 13 Evolution of image deduplication techniques
presents percentage contribution of research articles presented in these three cate-gories. Exact image deduplication has been a prime focus of these articles followedby video duplicate detection.
4.6.2 Techniques to detect exact or near-exact images
Image fingerprinting is a prerequisite for matching similar images. Table 13 shows asurvey on image feature extraction and hashing techniques used in image processing todetect exact or near-exact images. These techniques detect duplicate images and theirslight variation. This study will bring out the findings that will help the researchersin developing an efficient deduplication technique for near-exact image and video.Table 13 depicts techniques to detect exact and near-exact image, description andtheir key findings.
Apart from classification of multimedia-based deduplication techniques and theirdiscussion, this paper presents evolution of multimedia-based deduplication tech-niques and helps researchers to find techniques in their sub-area. Figure 13 representsevolution of major multimedia-based deduplication techniques classified into image
123
Data deduplication techniques for efficient cloud... 2075
Table 14 Deduplication tools and technologies
Tool Description
StorReduce of Amazon Deduplication tool for big data from Amazon
StorSimple of Microsoft Azure Block-level deduplication for cloud storage system from Microsoft
Avamar of EMC Variable-length deduplication from EMC
Quantum Inline dedupe for backup
SSET of NetApp Space Savings Estimation Tool (SSET) from NetApp
Data Domain of EMC Inline deduplication tool from EMC
WinPure Standalone software to clean up databases
Sepaton Sepaton DeltaStor deduplication software
Netrics Netrics helps to clean up projects and duplicate records
Revinetix Revinetix works at the file level on backup
Exagrid Deduplication tool for backup
CommVault Deduplication tool for backup for heterogeneous storage infrastructure
deduplication, exact or near-exact image detection and video deduplication techniques.Video deduplication techniques are based on exact video deduplication and near-exactvideo duplicates detection. Figure 13 shows the major contribution in different yearsby various researchers and the techniques they followed.
Deduplication in video and image came into existence in 2011 [46,94]. It has beenidentified that video deduplication techniques are applied on video frames. The framesare treated again as image and image deduplication techniques are applied on videodata. The article [48] published in 2007 detects near- duplicate video frames. It issimply a near-duplicate detection technique. Near-duplicate video detection researcharticles and video deduplication research articles are cited in Fig. 13.
This research article discussed various taxonomies of deduplication techniques.Apart from this, paper has identified deduplication tools provided by various storageenterprises. Table 14 lists deduplication tool and their brief description.
5 Discussion
This survey paper presents recent research work related to deduplication techniquesand is supplementary to the previous surveys. Deduplication techniques have beenexplored in detail and presented based on various taxonomies. An effort has beenmade to address the research issues, which remain still unresolved in deduplicationtechniques in large distributed storage system.
In the recent past, many deduplication techniques for storage systems have gainedattention and new techniques are evolved. The work on optimizing chunk size or gran-ularity, efficient detection of duplicate data in a cloud storage system, privacy, security,performance enhancement through indexing in deduplication techniques are expectedto remain the focus area in coming years. The work on scalability of distributed dedu-plication, fragmentation, disk bottleneck, I/O latency and performance enhancement
123
2076 R. Kaur et al.
in duplicate detection of multimedia data are emerging fields of study. A scalable,robust, efficient and distributed deduplication technique is required for a cloud storagesystem.
This survey presents review of 128 research articles in a systematic and categorizedmanner. This survey article also presents the latest research work on text-based andmultimedia-based deduplication techniques. The challenges and future directions fora next- generation efficient deduplication techniques for big data in cloud have alsobeen discussed for researchers in industry and academia.
5.1 Open challenges and future research directions
Based on the various issues discussed in the existing literature, some of the challengesin deduplication techniques have been identified and are discussed below.
(i) Exact or near-exact image deduplication for large-scale distributed storagesystemMost of the socialmedia images are either slightlymodified or same copyof an original image. The large amount of duplicate images or near-duplicateimages requires huge storage, impacts the performance and cost of a storagesystems. The exact image or near-exact image duplicate detection in a largedistributed storage system is an open research challenge. Such image detectionsrequire additional CPU, memory and bandwidth. Therefore, an efficient andreal-time deduplication technique for exact and near-exact image is a majorchallenge in distributed image storage system.
(ii) Store the transformation of near-exact images in a large-scale storage systemThe near-exact images are the modified version of an original image; therefore,it is not advisable to store the near-exact images. Only the transformation ofnear- exact images need to be stored, and these are reconstructed on applicationonline calls. At present, it is a huge challenge in itself to store the transformationof near-exact images in a large distributed storage system.
(iii) Performance issue for distributed storage deduplication In distributed dedu-plication techniques, deduplication is done on a distributed storage systemeither using inline or using offline approaches. The lookup for duplicatechunks is a resource-intensive task in large distributed storage system.These tasks increase the write latency of chunks [32]. Applying dedupli-cation technique at the time of write reduces the write performance ofchunks in terms of chunks per second. Offline deduplication has been widelyapplied, which runs as background service, but it requires extra tempo-rary storage and increases the I/O bandwidth [5]. Inline distributed dedu-plication technique and optimum use of resources in offline are an openchallenge. This problem is even more complex in a large scalable storagesystem.
(iv) Optimization technique for chunk sizeA file is divided into small chunks of sizevarying from 4 KB to 256 MB. These chunk details are indexed and cached forbetter performance. Even smaller chunks save the space, yet these generate largehash tables. On the other hand, the choice of large chunk size decreases the hashentries, increases the wastage of storage and takes a large time to compare. So, it
123
Data deduplication techniques for efficient cloud... 2077
makes the process of deduplication even more resource intensive. The variable-size chunking degrades the performance as it generates large index structures.The choice of chunk size or granularity is an open problem. There must be anefficient method to calculate the optimum size of chunks for efficiency that willimprove the overall performance of the system.
(v) Disk bottleneck problem The data deduplication technique is mostly appliedin disk [2] based on secondary storage systems. Although the storage systemsare expanding, still the disk I/O operations [2] are having performance issues[67]. To increase the data streaming rate, the file is divided into chunks and isdistributed across multiple nodes in a distributed environment. To overcome thedisk bottleneck problem, chunks are stored on distributed nodes. The novel datadistribution techniques are evolving that will trigger a requirement of a novicededuplication technique.
(vi) Throughput and latencyA file is broken into smaller chunk sizes. The metadataof each chunk are indexed for best possible performance and are kept in mem-ory. All new incoming chunks are checked against a large list of chunk indices.So, the number of disk I/O operations is large. Thus, fingerprinting indexing hasbecome a bottleneck to efficient deduplication systems, which has a negativeimpact on throughput [128] and increases the latency of write operations. Tomeet the requirements of increasing size of datasets and scalability of dedupli-cation, system should use parallel and multiple streams on distributed storagenodes. It will increase the system throughput. So, work can be done to optimizededuplication throughput and latency in distributed storage systems.
(vii) Fragmentation issue Data Deduplication causes fragmentation on disk thatreduces the performance of read operations [32,61]. It increases lookup timefor sequential reads from the same data and also extra disk I/O is required toaccess on-disk metadata. Deduplication results in data fragmentation and needsto be addressed carefully. Fragmentation is another major issue of retention for along period and can reduce the locality of reference. Better handling of inherentfragmentation in disk data deduplication must be there.
(viii) Scalability and performance of deduplication The main challenges in dedupli-cation techniques are its scalability and performance. Each chunk is comparedwith every other chunk in a large-scale storage system, and if a match is found,the same will be deleted as per replication policy. However, it becomes difficultfor complete matching of chunks as the system grows. The centralized index hasits own issues and bottleneck to achieve throughput, scalability and availability.As the storage requirements are growing rapidly, it poses a great challenge toapply efficient distributed deduplication techniques on large-scale distributedstorage systems.
(ix) Privacy and security In large distributed storage systems, both data and meta-data are distributed to achieve scalability and availability. In such a distributedsystem, a security framework must be required to employ distributed dedupli-cation techniques to guard it against theft, attacks and to adhere the regulatorycompliances for privacy and security.
123
2078 R. Kaur et al.
6 Conclusion
Data deduplication and cloud computing have appeared as hottest trend in today’sscenario. The growing demand in cloud computing for efficient storage of digitaldata in a large storage systems has led to rise in demand of data deduplication. Hugedata load on storage systems are emphasizing the focus on development of novicetechniques to remove duplicate data. With the evolution of data deduplication and itsvarious techniques, cloud computing has potential to remove unnecessary duplicates.Deduplication is an important technique for reducing storage cost, bandwidth andenergy consumptions.
This research article presented a methodical survey on data deduplication tech-niques. This survey comprehensively analyzes and reviews the techniques of dedupli-cation. We have also comprehensively studied a few recent surveys encompassingsimilar topics. Existing surveys in this field have focused on deduplication tech-niques based on storage types. The primary focus of this survey is to explore text-and multimedia-based deduplication techniques. Based on the analysis, deduplica-tion poses many challenges which are to be addressed in text- and multimedia-baseddeduplication.
For improving the storage in cloud storage systems, the following research chal-lenges need to be solved:
1. There is a need to develop an efficient inline data deduplication technique withoptimum use of resources in a cloud storage system.
2. To make data deduplication cost effective and energy efficient in terms of space,there is a need to develop an efficient deduplication technique with optimal useof CPU, memory and network resources.
3. To resolve the issue of fingerprint indexing in memory, the system should paral-lelize backup streams to multiple nodes for efficient deduplication.
4. Disk I/O bottleneck is one of the important issues in storage systems that affectthe performance. To resolve this issue, distributedmulti-node deduplication tech-nique needs to be evolved.
5. Security and privacy are also very necessary in cloud environment to meet thecompliance.
Based on the literature survey of deduplication techniques, it has been observed thatthe distributed deduplication technique on cloud storage system is a promising field ofresearch. The gaps analyzed above necessitate in devising such techniques to reducestorage space, bandwidth, number of disks used, energy consumption costs and heatemissions. In future, the work on a scalable, robust, efficient and distributed dedupli-cation technique for a cloud storage system will remain in focus.
Acknowledgements This research was supported by Department of Science and Technology, Govern-ment of India under WOS (Women Scientists Scheme) sponsored research Project entitled “DistributedData Deduplication Technique for efficient Cloud Based Storage System” under File No: SR/WOS-A/ET-119/2016.
123
Data deduplication techniques for efficient cloud... 2079
References
1. Gu M, Li X, Cao Y (2014) Optical storage arrays: a perspective for future big data storage. Light SciAppl 3(5):e177. https://doi.org/10.1038/lsa.2014.58
2. Tian Y, Khan SM, Jiménez DA, Loh GH (2014) Last-level cache deduplication. In: Proceedingsof the 28th ACM International Conference on Supercomputing, pp 53–62. https://doi.org/10.1145/2597652.2597655
3. Hovhannisyan H, Qi W, Lu K, Yang R, Wang J (2016) Whispers in the cloud storage: a novel cross-user deduplication-based covert channel design. Peer-to-Peer Networking and Applications, pp 1–10.https://doi.org/10.1007/s12083-016-0483-y
4. Mandagere N, Zhou P, Smith MA, Uttamchandani S (2008) Demystifying data deduplication. In:Proceedings of the ACM/IFIP/USENIX Middleware’08 Conference Companion, pp 12–17. https://doi.org/10.1145/1462735.1462739
5. Paulo J, Pereira J (2014) A survey and classification of storage deduplication systems. ACM ComputSurv (CSUR) 47(1):1–30. https://doi.org/10.1145/2611778
6. Mao B, Jiang H, Wu S, Fu Y, Tian L (2014) Read-performance optimization for deduplication-basedstorage systems in the cloud. In: ACM Transactions on Storage (TOS), vol 10(2). https://doi.org/10.1145/2512348
7. Di Pietro R, Sorniotti A (2016) Proof of ownership for deduplication systems: a secure, scalable, andefficient solution. Comput. Commun. 82:71–82. https://doi.org/10.1016/j.comcom.2016.01.011
8. Wang J, Chen X (2016) Efficient and secure storage for outsourced data: a survey. Data Sci Eng1(3):178–188. https://doi.org/10.1007/s41019-016-0018-9
9. Chen CP, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: asurvey on big data. Inf Sci 275:314–347. https://doi.org/10.1016/j.ins.2014.01.015
10. Venish A, Sankar KS (2015) Framework of data deduplication: a survey. Indian J Sci Technol. https://doi.org/10.17485/ijst/2015/v8i26/80754
11. XiaW, JiangH, FengD,Douglis F, Shilane P, HuaY, FuM, ZhangY, ZhouY (2016)A comprehensivestudy of the past present and future of data deduplication. Proc IEEE 104(9):1681–1710. https://doi.org/10.1109/JPROC.2016.2571298
12. Maan AJ (2013) Analysis and comparison of algorithms for lossless data compression. Int J InfComput Technol 3(3):139–46
13. Xia W, Jiang H, Feng D, Tian L, Fu M, Zhou Y (2014) Ddelta: a deduplication-inspired fast deltacompression approach. Perform Eval 79:258–272. https://doi.org/10.1016/j.peva.2014.07.016
14. Shanmugasundaram S, Lourdusamy R (2011) A comparative study of text compression algorithms.Int J Wisdom Based Comput 1(3):68–76
15. Bhadade US, Trivedi AI (2011) Lossless text compression using dictionaries. Int J Comput ApplAlgorithms 13(8):27–34
16. Witten IH, Neal RM, Cleary JG (1987) Arithmetic coding for data compression. Commun ACM30(6):520–40. https://doi.org/10.1145/214762.214771
17. Brereton P, Kitchenham BA, Budgen D, Turner M, Khalil M (2007) Lessons from applying thesystematic literature review process within the software engineering domain. J Syst Softw 80(4):571–83. https://doi.org/10.1016/j.jss.2006.07.009
18. Kitchenham B, Brereton OP, Budgen D, Turner M, Bailey J, Linkman S (2009) Systematic literaturereviews in software engineering—a systematic literature review. Inf Softw Technology 51(1):7–15.https://doi.org/10.1016/j.infsof.2008.09.009
19. IDC REPROT ON EXPONENTIAL DATA Gantz J, Reinsel D (2012) The digital universe in 2020:big data, bigger digital shadows, and biggest growth in the far east. In: IDC iView: IDC Ana-lyze the Future,pp 1–6. http://www.emc.com/collateral/analyst-reports/idc-digital-universe-united-states.pdf
20. Reed DA, Dongarra J (2015) Exascale computing and big data. Commun ACM 58(7):56–68. https://doi.org/10.1145/2699414
21. Barreto J, Ferreira P (2009) Efficient locally trackable deduplication in replicated systems. In: Pro-ceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware. Springer-VerlagNew York, Inc. USA, p 6
22. Meyer DT, BoloskyWJ (2012) A study of practical deduplication. ACMTrans Storage (TOS). https://doi.org/10.1145/2078861.2078864
123
2080 R. Kaur et al.
23. Borges EN, de Carvalho MG, Galante R, Gonçalves MA, Laender AH (2011) An unsupervisedheuristic-based approach for bibliographic metadata deduplication. Inf Process Manag 47(5):706–718. https://doi.org/10.1016/j.ipm.2011.01.009
24. Alvarez C (2011) NetApp deduplication for FAS andV-Series deployment and implementation guide.In: Technical ReportTR-3505
25. Xu J, ZhangW, Zhang Z, Wang T, Huang T (2016) Clustering-based acceleration for virtual machineimage deduplication in the cloud environment. J Syst Softw 121:144–156. https://doi.org/10.1016/j.jss.2016.02.021
26. Paulo J, Pereira J (2014) Distributed Exact Deduplication for Primary Storage Infrastructures. InMagoutis K., Pietzuch P. (eds) Distributed applications and interoperable systems DAIS 2014, vol8460, LNCS Springer, Heidelberg. https://doi.org/10.1007/978-3-662-43352-2_5
27. Banu AF, Chandrasekar C (2012) A survey on deduplication methods. Int J Comput Trends Technol3(3):364–368
28. He Q, Li Z, Zhang X (2010) Data deduplication techniques. IEEE Int Conf Future Inf Technol ManagEng (FITME) 1:430–433. https://doi.org/10.1109/FITME.2010.5656539
29. Zhou R, Liu M, Li T (2013) Characterizing the efficiency of data deduplication for big data storagemanagement. In: IEEE International Symposium onWorkload Characterization (IISWC), pp 98–108:https://doi.org/10.1109/IISWC.2013.6704674
30. Ahmad RW, Gani A, Ab. Hamid SH et al (2015) Virtual machine migration in cloud data centers:a review, taxonomy, and open research issue. J Supercomput 71(7):2473–2515. https://doi.org/10.1007/s11227-015-1400-5
31. Hu Y, Li C, Liu L, Li T (2016) Hope: enabling efficient service orchestration in software-defineddata centers. In: Proceedings of the 2016 International Conference on Supercomputing, p 10 ACM.https://doi.org/10.1145/2925426.2926257
32. Srinivasan K, Bisson T, Goodson GR, Voruganti K (2012) iDedup: latency-aware, inline data dedu-plication for primary storage. In: Proceedings of the USENIX Conference on File and StorageTechnologies, vol 12, pp 24–24
33. Mao B, Jiang H, Wu S, Tian L (2016) Leveraging data deduplication to improve the performanceof primary storage systems in the cloud. IEEE Trans Comput 65(6):1775–1788. https://doi.org/10.1109/TC.2015.2455979
34. Kim C, Park KW, Park KH (2012) GHOST: GPGPU-offloaded high performance storage I/Odeduplication for primary storage system. In: Proceedings of the 2012 International Workshop onProgramming Models and Applications for Multicores and Manycores ACM, pp 17–26. https://doi.org/10.1145/2141702.2141705
35. Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P (2009) Sparse Indexing:Large Scale, Inline Deduplication Using Sampling and Locality. In Proceedings of the 7th USENIXConference on File and Storage Technologies, vol 9, pp 111–123
36. Zhu B, Li K, Patterson RH (2008) Avoiding the disk bottleneck in the data domain deduplication filesystem. Proc USENIX Conf File Storage Technol 8:1–14
37. Dubnicki C, Gryz L, Heldt L, Kaczmarczyk M, Kilian W, Strzelczak P, Szczepkowski J, UngureanuC, Welnicki M (2009) HYDRAstor: A scalable secondary storage. In: 7th USENIX Conference onFile and Storage Technologies (FAST), vol 9, pp 197–210
38. Li YK, XuM, Ng CH, Lee PP (2015) Efficient hybrid inline and out-of-line deduplication for backupstorage. ACM Trans Storage (TOS) 11(1):1–21. https://doi.org/10.1145/2641572
39. Xia W, Jiang H, Feng D, Hua Y (2015) Similarity and locality based indexing for high performancedata deduplication. IEEETransComput 64(4):1162–1176. https://doi.org/10.1109/TC.2014.2308181
40. NgCH,MaM,WongTY, Lee PP, Lui J (2011) Live deduplication storage of virtualmachine images inan open-source cloud. In: Proceedings of the 12th InternationalMiddleware Conference. InternationalFederation for Information Processing, pp 80–99
41. Zhao X, Zhang Y,Wu Y, Chen K, Jiang J, Li K (2013) Liquid: a scalable deduplication file system forvirtual machine images. IEEE Trans Parallel Distrib Syst 25(5):1257–1266. https://doi.org/10.1109/TPDS.2013.173
42. WaldspurgerCA (2002)Memory resourcemanagement inVMwareESXserver. In:ACMProceedingsof the 5th Symposium on Operating Systems Design and Implementation SIGOPS, vol 36(SI), pp181–194. https://doi.org/10.1145/844128.844146
43. Clements AT, Ahmad I, Vilayannur M, Li J (2009) Decentralized Deduplication in SAN Cluster FileSystems. In: USENIX Annual Technical Conference, pp 101–114
123
Data deduplication techniques for efficient cloud... 2081
44. Anand A, Sekar V, Akella A (2009) SmartRE: an architecture for coordinated network-wide redun-dancy elimination. ACM SIGCOMM Comput Commun Rev 39(4):87–98. https://doi.org/10.1145/1594977.1592580
45. Agarwal B, Akella A, Anand A, Balachandran A, Chitnis P, Muthukrishnan C, Ramjee R, Varghese G(2010). EndRE:AnEnd-system redundancy elimination service for enterprises. In:NSDI, pp 419–432
46. Katiyar A,Weissman JB (2011) ViDeDup: an application-aware framework for video de-duplication.In: Proceedings of the 3rd USENIX Conference on Hot Topics in Storage and File Systems (HotStorage), pp 1–5
47. Li C, Shilane P, Douglis F, Shim H, Smaldone S, Wallace G (2014) Nitro: a Capacity-optimized SSDcache for primary storage. In: USENIX Annual Technical Conference, pp 501–512
48. Shen HT, Zhou X, Huang Z, Shao J, Zhou X (2007) UQLIPS: a real-time near-duplicate video clipdetection system. In: Proceedings of the 33rd International Conference on Very Large Data BasesVLDB Endowment, pp 1374–1377
49. ChenF, LuoT, ZhangX (2011)CAFTL:Acontent-aware flash translation layer enhancing the lifespanof flash memory based solid state drives. In: Proceedings of 9th USENIX Conference on File StorageTechnology (FAST), vol 11, pp 77–90
50. Vrable M, Savage S, Voelker GM (2009) Cumulus: filesystem backup to the cloud. ACM TransStorage (TOS) 5(4):1–14. https://doi.org/10.1145/1629080.1629084
51. Lai R, Hua Y, Feng D, XiaW, FuM, Yang Y (2014) A near-exact defragmentation scheme to improverestore performance for cloud backup systems. In: Sun X et al (eds) Algorithms and architectures forparallel processing. LNCS, vol 8630. Springer, Cham, pp 457–471. https://doi.org/10.1007/978-3-319-11197-1_35
52. Mao B, Jiang H, Wu S, Fu Y, Tian L (2014) Read-performance optimization for deduplication-basedstorage systems in the cloud. ACM Trans Storage. https://doi.org/10.1145/2512348
53. Tan Y, Jiang H, Feng D, Tian L, Yan Z (2011) CABdedupe: a causality-based deduplication perfor-mance booster for cloud backup services. In: Parallel andDistributed Processing Symposium (IPDPS)IEEE International, pp 1266–1277
54. Nbt Yusof, Ismail A, Majid NAA (2016) Deduplication image middleware detection comparison instandalone cloud database. Int J Adv Comput Sci Technol (IJACST) 5(3):12–18
55. Nie Z, Hua Y, Feng D, Li Q, Sun Y (2014) Efficient storage support for real-time near-duplicate videoretrieval. In: Sun X et al (eds) Algorithms and architectures for parallel processing ICA3PP LNCS,vol 8631. Springer, Cham. https://doi.org/10.1007/978-3-319-11194-0_24
56. ChenM,Wang S, Tian L (2013) A high-precision duplicate image deduplication approach. J Comput8(11):2768–2775. https://doi.org/10.4304/jcp.8.11.2768-2775
57. Wang G, Chen S, Lin M, Liu X (2014) SBBS: A sliding blocking algorithm with backtracking sub-blocks for duplicate data detection. Expert Syst Appl 41(5):2415–2423. https://doi.org/10.1016/j.eswa.2013.09.040
58. Bobbarjung DR, Jagannathan S, Dubnicki C (2006) Improving duplicate elimination in storage sys-tems. ACM Trans Storage (TOS) 2(4):424–48. https://doi.org/10.1145/1210596.1210599
59. Kruus E, Ungureanu C, Dubnicki C (2010) Bimodal content defined chunking for backup streams.In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST), pp 239–252
60. Lim SH (2011) DeFFS: Duplication-eliminated flash file system. Comput Electr Eng 37(6):1122–1136. https://doi.org/10.1016/j.compeleceng.2011.06.007
61. KaczmarczykM, BarczynskiM, KilianW,Dubnicki C (2012) Reducing impact of data fragmentationcaused by in-line deduplication. In: Proceedings of the 5th Annual International Systems and StorageConference ACM, pp 1–12. https://doi.org/10.1145/2367589.2367600
62. Wildani A, Miller EL, Rodeh O (2013) Hands: A heuristically arranged non-backup in-line dedu-plication system. In: IEEE 29th International Conference on Data Engineering (ICDE), pp 446–457.https://doi.org/10.1109/ICDE.2013.6544846
63. Nam YJ, Park D, Du DH (2012) Assuring demanded read performance of data deduplication storagewith backup datasets. In: IEEE 20th International Symposium onModeling, Analysis and Simulationof Computer and Telecommunication Systems (MASCOTS), pp 201–208. https://doi.org/10.1109/MASCOTS.2012.32
64. Park D, Fan Z, Nam YJ, Du DH (2017) A lookahead read cache: improving read performance fordeduplication backup storage. J Comput Sci Technol 32(1):26–40. https://doi.org/10.1007/s11390-017-1680-8
123
2082 R. Kaur et al.
65. Xia W, Jiang H, Feng D, Tian L (2016) DARE: a deduplication-aware resemblance detection andelimination scheme for data reduction with low overheads. IEEE Trans Comput 65(6):1692–1705.https://doi.org/10.1109/TC.2015.2456015
66. Fu M, Feng D, Hua Y, He X, Chen Z, Liu J, Xia W, Huang F, Liu Q (2016) Reducing fragmentationfor in-line deduplication backup storage via exploiting backup history and cache knowledge. IEEETrans Parallel Distrib Syst 27(3):855–868. https://doi.org/10.1109/TPDS.2015.2410781
67. FuY, JiangH,XiaoN (2012)A scalable inline cluster deduplication framework for big data protection.In: Narasimhan P, Triantafillou P (eds) Middleware IFIP international federation for informationprocessing. LNCS, vol 7662. Springer, Berlin, pp 354–373
68. Rabin MO (1981) Fingerprinting by random polynomials. Harvard Aiken Computational LaboratoryTR-15-81. URL: http://cr.yp.to/bib/entries.html
69. Bhagwat D, Eshghi K, Long DD, Lillibridge M (2009) Extreme binning: scalable, parallel dedupli-cation for chunk-based file backup. In: Proceedings of IEEE International Symposium on Modeling,Analysis and Simulation of Computer and Telecommunication Systems, Computer Society, Wash-ington, DC, vol 9, pp 1–9. https://doi.org/10.1109/MASCOT.2009.5366623
70. Yang TM, Feng D, Niu ZY, Wan YP (2010) Scalable high performance de-duplication backup viahash join. J Zhejiang Uni Sci C Springer 11(5):315–327. https://doi.org/10.1631/jzus.C0910445
71. Min J, Yoon D,Won Y (2011) Efficient deduplication techniques for modern backup operation. IEEETrans Comput 60(6):824–840. https://doi.org/10.1109/TC.2010.263
72. Guo F, Efstathopoulos P (2011) Building a high-performance deduplication system. In: Proceedingsof USENIX Annual Technical Conference
73. Barreto J, Veiga L, Ferreira P (2012) Hash challenges: stretching the limits of compare-by-hash indistributed data deduplication. Inf Process Lett 112(10):380–385. https://doi.org/10.1016/j.ipl.2012.01.012
74. Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication.IEEE Trans Knowl Data Eng 24(9):1537–1555. https://doi.org/10.1109/TKDE.2011.127
75. FuY, JiangH,XiaoN,TianL, Liu F,XuL (2014)Application-aware local-global source deduplicationfor cloud backup services of personal storage. IEEE Trans Parall Distrib Syst 25(5):1155–1165.https://doi.org/10.1109/TPDS.2013.167
76. Harnik D, Pinkas B, Shulman-Peleg A (2010) Side channels in cloud services: deduplication in cloudstorage. IEEE Secur Priv 8(6):40–47. https://doi.org/10.1109/MSP.2010.187
77. Li J, Chen X, Li M, Li J, Lee PP, Lou W (2014) Secure deduplication with efficient and reliableconvergent key management. IEEE Trans Parallel Distrib Syst 25(6):1615–1625. https://doi.org/10.1109/TPDS.2013.284
78. Liu C, LiuX,WanL (2013) Policy-based de-duplication in secure cloud storage. In: YuanY,WuX, LuY (eds) Trustworthy Computing and Services. ISCTCS communications in computer and informationscience, vol 320. Springer, Berlin, pp 250–262. https://doi.org/10.1007/978-3-642-35795-4_32
79. Storer MW, Greenan K, Long DD, Miller EL (2008) Secure data deduplication. In: Proceedings ofthe 4th ACM International Workshop on Storage Security and Survivability, pp 1–10. https://doi.org/10.1145/1456469.14
80. Li J, Chen X, Huang X, Tang S, Xiang Y, Hassan MM, Alelaiwi A (2015) Secure distributed dedu-plication systems with improved reliability. IEEE Trans Comput 64(12):3569–3579. https://doi.org/10.1109/TC.2015.2401017
81. Vishalakshi NS, Sridevi S (2017) Survey on secure de-duplication with encrypted data for cloudstorage. Int J Adv Res Sci Eng Technol 4(1):3111–3117
82. BibaweCB,BaviscarV (2017) Secure authorized deduplication for data reductionwith low overheadsin hybrid cloud. Int J Innov Res Comput Commun Eng 5(2):1797–1804. https://doi.org/10.15680/IJIRCCE.2017.0502105
83. Wu S, Li KC,Mao B, LiaoM (2016) DAC: improving storage availability with deduplication-assistedcloud-of-clouds. Future Gener Comput Syst 74:190–198. https://doi.org/10.1016/j.future.2016.02.001
84. Wang J, Zhao Z, Xu Z, ZhangH, Li L, GuoY (2015) I-sieve: an inline high performance deduplicationsystem used in cloud storage. Tsinghua Sci Technol 20(1):17–27. https://doi.org/10.1109/TST.2015.7040510
85. Leesakul W, Townend P, Xu J (2014) Dynamic data deduplication in cloud storage. In: IEEE 8thInternational Symposium on Service Oriented System. Engineering, pp 320–325: https://doi.org/10.1109/SOSE.2014.46
123
Data deduplication techniques for efficient cloud... 2083
86. Sun Z, Shen J, Yong J (2013) A novel approach to data deduplication over the engineering-orientedcloud systems. Integr Comput Aided Eng 20(1):45–57. https://doi.org/10.3233/ICA-120418
87. Neelaveni P, Vijayalakshmi M (2016) FC-LID: file classifier based linear indexing for deduplicationin cloud backup services. In: Bjørner N, Prasad S, Parida L (eds) Distributed computing and inter-net technology. LNCS, vol 9581. Springer, Cham, pp 213–222. https://doi.org/10.1007/978-3-319-28034-9_28
88. Li J, Chen X, Xhafa F, Barolli L (2015) Secure deduplication storage systems supporting keywordsearch. J Comput Syst Sci 81(8):1532–1541. https://doi.org/10.1016/j.jcss.2014.12.026
89. Shin Y, KooD, Hur J (2017) A survey of secure data deduplication schemes for cloud storage systems.ACM Comput Surv (CSUR) 49(4):1–38. https://doi.org/10.1145/3017428
90. Pokale MS, Dhok S, Kasbe V, Joshi G, Shinde N (2017) Data deduplication and load balancingtechniques on cloud systems. Int J Adv Res Comput Commun Eng 6(3):878–883. https://doi.org/10.17148/IJARCCE.2017.63205
91. Debnath BK, Sengupta S, Li J (2010) ChunkStash: speeding up inline storage deduplication usingflash memory. In: Proceedings of USENIX Annual Technical Conference (ATC), pp 1–16
92. DongW, Douglis F, Li K, Patterson RH, Reddy S, Shilane P (2011) Tradeoffs in scalable data routingfor deduplication clusters. In: Proceedings of USENIX Conference on File and Storage Technologies(FAST), vol 11, pp 15–29
93. Li J, Qian X, Li Q, Zhao Y, Wang L, Tang YY (2015) Mining near duplicate image groups. MultimedTools Appl 74(2):655–669
94. Ramaiah NP, Mohan CK (2011) De-duplication of photograph images using histogram refinement.In Recent Advances in Intelligent Computational Systems (RAICS) IEEE 391-395. https://doi.org/10.1109/RAICS.2011.6069341
95. Zargar AJ, Singh N, Rathee G, Singh AK (2015) Image data-deduplication using the block truncationcoding technique. In: Futuristic Trends on Computational Analysis and Knowledge Management(ABLAZE) International Conference on IEEE, pp 154–158. https://doi.org/10.1109/ABLAZE.2015.7154986
96. Hua Y, He W, Liu X, Feng D (2015) SmartEye: real-time and efficient cloud image sharing fordisaster environments. In: IEEE Conference on Computer Communications (INFOCOM), pp 1616–1624: https://doi.org/10.1109/INFOCOM.2015.7218541
97. Li X, Li J, Huang F (2016) A secure cloud storage system supporting privacy-preserving fuzzydeduplication. Soft Comput 20(4):1437–1448. https://doi.org/10.1007/s00500-015-1596-6
98. Deshmukh AS, Lambhate PD (2016) A methodological survey on mapreduce for identification ofduplicate images. Int J Sci Res (IJSR) 5(1):206–210
99. Rashid F, Miri A, Woungang I (2016) Secure image deduplication through image compression. J InfSecur Appl 27:54–64. https://doi.org/10.1016/j.jisa.2015.11.003
100. Zheng Y, Yuan X, Wang X, Jiang J, Wang C, Gui X (2015) Enabling encrypted cloud media centerwith secure deduplication. In: Proceedings of the 10th ACM Symposium on Information, Computerand Communications Security, pp 63–72. https://doi.org/10.1145/2714576.271462
101. Yang X, Zhu Q, Cheng KT (2009) Near-duplicate detection for images and videos. In: Proceedingsof the First ACM workshop on Large-Scale Multimedia Retrieval and Mining, pp 73–80: https://doi.org/10.1145/1631058.1631073
102. Naturel X, Gros P (2005) A fast shot matching strategy for detecting duplicate sequences in a tele-vision stream. In: ACM Proceedings of the 2nd International Workshop on Computer Vision MeetsDatabases, pp 21–27. https://doi.org/10.1145/1160939.1160947
103. Li X, Lin J, Li J, Jin B (2016) A Video Deduplication Scheme with Privacy Preservation in IoT. In:International Symposium on Computational Intelligence and Intelligent Systems. Communicationsin Computer and Information Science, vol 575. Springer, Singapore, pp 409–417. https://doi.org/10.1007/978-981-10-0356-1_43
104. Velmurugan K, Baboo LD (2011) Content-based image retrieval using SURF and colour moments.Global J Comput Sci Technol 11(10)
105. Li L (2014) Image matching algorithm based on feature-point and DAISY descriptor. J Multim9(6):829–834. https://doi.org/10.4304/jmm.9.6.829-834
106. Lei Y, QiuG, Zheng L, Huang J (2014) Fast near-duplicate image detection using uniform randomizedtrees. ACM Trans Multim Comput Commun Appl (TOMM) 10(4):1–15. https://doi.org/10.1145/2602186
123