Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
03/11/09
Deduplication Storage System
Kai LiCharles Fitzmorris Professor, Princeton University
&Chief Scientist and Co-Founder, Data Domain, Inc.
03/11/09 2
The World Is Becoming Data-Centric
Business & personal lifebecoming digital
Make better decisions
CERN Tier 0
Science ⇒ e-sciencecausing data explosion
Accelerate science discovery
How Much Information(IDC Estimates in 2007)?
Data size is on the way to zetabytes (109 TB) 161 exabytes (108 TB) of information was created or replicated
worldwide in 2006 IDC estimates 6X growth by 2010 to 988 exabytes (a
zetabyte) / year New technical information 2X every 2 years
Examples
Wal-Mart (US) 500 TB, 10
7
transactions / day in 2004 Google’s BigTable (US) 1-2 petabytes AT&T has 11 exabytes (107 TB) of wireline, wireless, Internet
data
03/11/09 3
Challenges in A Digital World
Find information Search engines do well for text document data
Audio, images, video, sensor data are much more difficult
Open problems in analysis, visualization, summarization
Access data Cloud Computing and P2P address location and management issues
But, $/bandwidth vary and improve slowly
Store and protect data Protect all versions of data frequently
Recover any version of data quickly
03/11/09 4
03/11/09 5
A Traditional Data Center
Mirroredstorage
$$$
Clients Server Primarystorage
$$$$
A Data Center w/ Networked Disk Storage?
Costs Purchase: $$$$ Bandwidth: $$$$$ Power: $$$
03/11/09 6
Mirroredstorage
DedicatedFibre
Clients Server Primarystorage
WAN
Onsite
Remote
20 × primary (3-month retention)
20 × primary
Trends for Bytes/$
Storage increases exponentially Gap between adjacent
classes is about 3-5X
Bandwidth becomes flat Low-end WAN: slow
improvements High-end WAN: almost
no improvement
03/11/09 7
1998
FC stora
ge
WAN BW
ATA stora
geTape st
orage
Flash st
orage
2008
Mbytes
A T3 line example T3 is 45Mbits/sec or ~486GB/day T3 cost is about $72k/year
The problem Bandwidth costs 4x – 20x data center primary storage WAN Mbytes/$ increases slowly (<< data growth rate) Situation will get worse
Policies Dataset size Cost
Daily full backups ~500GB $300/GB/ 2 Years
Daily incremental & weekly full backups
~3TB $48/GB/ 2 Years
Replication with WAN Is Expensive
03/11/09 8
Promises Purchase: ~Tape libraries Space: >10X reduction WAN BW: >10X reduction Power: >10X reduction
A Data Center with Deduplication Storage
03/11/09 9
Mirroredstorage
Clients Server Primarystorage
WANOnsite
Remote
High-Level Idea of Deduplication
03/11/09 10
Traditional localcompression
~2× compression
Encode in a sliding window of bytes (e.g. 100K)
Deduplication
~10-50× compression
Large window ⇒ more redundant data
Backup Data Example
View from Backup Software (tar or similar format)
First Full Backup Incr 1 Incr 2 Second Full Backup
A B C D E F G H I JDeduplicated Storage:Redundancies pooled, compressed
= Unique variable segments
= Redundant data segments
= Compressed unique segments
A B C D A E F G A B H A E I B J C D E F G H
DataStream
03/11/09 11
Two Deduplication Approaches
Deltas Computing a sketch for each segment [Broder97]
Find a similar segment by comparing sketches
Compute deltas with the most similar segment
But, need to read segments from disks
Fingerprinting Computing a fingerprint as ID for each segment
Use an index to lookup if the segment is a duplicate
Efficiency depends on index lookups
03/11/09 12
Segmentation Two approaches
Fixed: simple but cannot handle shifts well Content-based variable: independent of shifts
Segment size Smaller ⇒ higher compression Smaller ⇒ more memory, less throughput Example: 100GB index / 20TB for 4K segment size
03/11/09 13
A X C D A Y C D A B C D A B C D …
Rolling fingerprintinguntil fp = xxxx0000
. . .
Main Components
Interfaces (NFS, CIFS, VTL, …)
Object-Oriented File System
Deduplication
RAID (e.g. RAID-6)
GC & Verification
Disk Shelves
03/11/09 14
Data Layout Rep
licat
ion
Design Challenges
Very reliable and self-healing Backup data is the last stop Multi-dimensional verifications and self-healing
High throughput + high compression at low cost Why high throughput: a day has 24 hours Why high compression: make disks cost like tapes Controller cost should be small
03/11/09 15
Typical Alternatives
Caching the index? File buffer cache has low hit rate (<80%) Fingerprints are random
Parallelize the index? 7200 RPM Seagate 1TB drive: <120 seeks/sec 120 lookups/s @ 8KB segment: 0.96 MB/sec/disk Need 200 disks to achieve 200 MB/s! [Venti02]
Buffer the data? Need a very large disk buffer Long delay to move data offsite A day still has 24 hours
03/11/09 16
High Throughput, High Compression at Low Cost
Main ideas Layout data on disk with “duplicate locality” Use a sophisticated cache for the fingerprint index
• Use a summary data structure for new data• Use “locality-preserved caching” for old data
See a FAST paper for detailsBenjamin Zhu, Kai Li and Hugo Patterson. Avoiding the Disk
Bottleneck in a Deduplication Storage System. In Proceedings of The 6th USENIX Conference on File and Storage Technologies (FAST’08). February 2008
03/11/09 17
Summary VectorGoal: Use minimal memory to test for new data
⇒ Summarize what segments have been stored, with Bloom filter (Bloom’70) in RAM
⇒ If Summary Vector says no, it’s new segment
Approximation
Index Data Structure
Summary Vector
1 1 1
Set bits fp(si)
h1
h2
h3
1 0 1 1
Read bits fp(si)
h1
h2
h303/11/09 18
Known Analysis Results
Bloom filter with m bits k independent hash functions
After inserting n keys, the probability of a false positive is:
Examples: m/n = 6, k = 4: p = 0.0561 m/n = 8, k = 6: p = 0.0215 …
Experimental data validate the analysis results03/11/09 19
( )kmkn
kkn
em
p /11
11 −−≈
−−=
Stream Informed Segment Layout
Goal: Capture “duplicate locality” on disk Segments from the same stream are stored in the
same “containers” Metadata (index data) are also in the containers
03/11/09 20
Datasection
Metadatasection
Datasection
Metadatasection
Datasection
Metadatasection
… …
Stream 1 Stream 2 Stream 3
Locality Preserved Caching (LPC)Goal: Maintain “duplicate locality” in the cache
Disk Index has all <fingerprint, containerID> pairs
Index Cache caches a subset of such pairs
On a miss, lookup Disk Index to find containerID
Load the metadata of a container into Index Cache,replace if needed
03/11/09 21
DiskIndex
Metadata
DataContainerID Index
CacheMiss
Loadmetadata
Replacement
Container
Putting Them Together
03/11/09 22
IndexCache
Duplicate
No
A fingerprint
DiskIndex data
metadata
data
metadata
data
metadata
data
metadata
SummaryVectorNew
Maybe
Replacement
Evaluation
What to evaluate Disk I/O reduction results
• Drive the evaluation with two real datasets• Observe results at different parts of the system
Write and read throughput• A synthetic benchmark with multiple streams• Mimic multiple backups
Deduplication results• Report results from two data centers
Platform 2 × qardcore 2.8Ghz Intel CPUs, 8GB RAM, 10GE NIC, 1GB
NVRAM, 15 7,200 RPM ATA disks
03/11/09 23
Disk I/O Reduction ResultsExchange data (2.56TB)
135-daily full backups
Engineering data (2.39TB)
100-day daily inc, weekly full
# disk I/Os % of total # disk I/Os % of total
No summary, No SISL/LPC
328,613,503 100.00% 318,236,712 100.00%
Summary only 274,364,788 83.49% 259,135,171 81.43%
SISL/LPC only 57,725,844 17.57% 60,358,875 18.97%
Summary & SISL/LPC
3,477,129 1.06% 1,257,316 0.40%
03/11/09 24
Write Throughput
03/11/09 25
Platform: 2xQuadcore Xeon+15 disks+10GE
MB
/s
Ge
Generations
Read Throughput
03/11/09 26
Platform: 2xQuadcore Xeon+15 disks+10GE
Ge
Generations
MB
/s
Real World Example at Datacenter A
03/11/09 27
Real World Compression at Datacenter A
03/11/09 28
Real World Example at Datacenter B
03/11/09 29
Real World Compression at Datacenter B
03/11/09 30
Related Work “Local” compression
Relatively small windows of bytes [Ziv&Lempel77,…] Larger windows [Bentley&McILroy99]
File-level deduplication system Use file hashes to detect duplicate files CAS systems, not addressing throughput issues
Fixed-size block deduplication storage prototype [Venti02] Fixed-size segments, not addressing throughput issues
Segment-level deduplication Content-based segmentation methods [Manber93, Brin94] Variable-size segment deduplication for network traffic [Spring00, LBFS01,
TAPER05,…] Variable-size segment vs. delta deduplication [Kulkarni04]
Use of Bloom filters Summary data structure [Bloom70, Fan98, Broder02] Detect duplicates [TAPER05]
03/11/09 31
Summary Deduplication storage replaces tape library
Purchase cost: < tape library solution Space reduction: 10 – 30x Bandwidth reduction: 10-100x Power reduction: > 10x (~compression ratio)
Scalable deduplication NFS throughput ~110MB/s on 2 × 2.6Ghz CPUs w/ 15 disks ~210MB/s on 2 × 2-core 3Ghz CPUs w/ 15 disks ~360MB/s on 2 × 4-core 2.8Ghz CPUs w/ 15 disks
(~750MB/s for OST interface) Impact
Over 10,000 systems deployed to many data centers >70% replicate data over WAN for disaster recovery See white papers at http://www.datadomain.com
03/11/09 32
Beyond the Backup Use Case
Nearline storage (widely in use already) Besides throughput, optimize for file IO’s/sec 4-10X compression, depending on data
Archiving (beginning) Lock, shredding, encryption, … 4-10X compression, depending on data
Future issues FLASH + Dedup storage New storage eco-system for data centers Storage infrastructure for Cloud Computing
03/11/09 33