Deduplication Storage System

03/11/09

Deduplication Storage System

Kai LiCharles Fitzmorris Professor, Princeton University

&Chief Scientist and Co-Founder, Data Domain, Inc.

03/11/09 2

The World Is Becoming Data-Centric

Business & personal lifebecoming digital

Make better decisions

CERN Tier 0

Science ⇒ e-sciencecausing data explosion

Accelerate science discovery

How Much Information(IDC Estimates in 2007)?

Data size is on the way to zetabytes (109 TB) 161 exabytes (108 TB) of information was created or replicated

worldwide in 2006 IDC estimates 6X growth by 2010 to 988 exabytes (a

zetabyte) / year New technical information 2X every 2 years

Examples

Wal-Mart (US) 500 TB, 10

7

transactions / day in 2004 Google’s BigTable (US) 1-2 petabytes AT&T has 11 exabytes (107 TB) of wireline, wireless, Internet

data

03/11/09 3

Challenges in A Digital World

Find information Search engines do well for text document data

Audio, images, video, sensor data are much more difficult

Open problems in analysis, visualization, summarization

Access data Cloud Computing and P2P address location and management issues

But, $/bandwidth vary and improve slowly

Store and protect data Protect all versions of data frequently

Recover any version of data quickly

03/11/09 4

03/11/09 5

A Traditional Data Center

Mirroredstorage

$$$

Clients Server Primarystorage

$$$$

A Data Center w/ Networked Disk Storage?

Costs Purchase: $$$$ Bandwidth: $$$$$ Power: $$$

03/11/09 6

Mirroredstorage

DedicatedFibre


WAN

Onsite

Remote

20 × primary (3-month retention)

20 × primary

Trends for Bytes/$

Storage increases exponentially Gap between adjacent

classes is about 3-5X

Bandwidth becomes flat Low-end WAN: slow

improvements High-end WAN: almost

no improvement

03/11/09 7

1998

FC stora

ge

WAN BW

ATA stora

geTape st

orage

Flash st

orage

2008

Mbytes

A T3 line example T3 is 45Mbits/sec or ~486GB/day T3 cost is about $72k/year

The problem Bandwidth costs 4x – 20x data center primary storage WAN Mbytes/$ increases slowly (<< data growth rate) Situation will get worse

Policies Dataset size Cost

Daily full backups ~500GB $300/GB/ 2 Years

Daily incremental & weekly full backups

~3TB $48/GB/ 2 Years

Replication with WAN Is Expensive

03/11/09 8

Promises Purchase: ~Tape libraries Space: >10X reduction WAN BW: >10X reduction Power: >10X reduction

A Data Center with Deduplication Storage

03/11/09 9

Mirroredstorage


WANOnsite

Remote

High-Level Idea of Deduplication

03/11/09 10

Traditional localcompression

~2× compression

Encode in a sliding window of bytes (e.g. 100K)

Deduplication

~10-50× compression

Large window ⇒ more redundant data

Backup Data Example

View from Backup Software (tar or similar format)

First Full Backup Incr 1 Incr 2 Second Full Backup

A B C D E F G H I JDeduplicated Storage:Redundancies pooled, compressed

= Unique variable segments

= Redundant data segments

= Compressed unique segments

A B C D A E F G A B H A E I B J C D E F G H

DataStream

03/11/09 11

Two Deduplication Approaches

Deltas Computing a sketch for each segment [Broder97]

Find a similar segment by comparing sketches

Compute deltas with the most similar segment

But, need to read segments from disks

Fingerprinting Computing a fingerprint as ID for each segment

Use an index to lookup if the segment is a duplicate

Efficiency depends on index lookups

03/11/09 12

Segmentation Two approaches

Fixed: simple but cannot handle shifts well Content-based variable: independent of shifts

Segment size Smaller ⇒ higher compression Smaller ⇒ more memory, less throughput Example: 100GB index / 20TB for 4K segment size

03/11/09 13

A X C D A Y C D A B C D A B C D …

Rolling fingerprintinguntil fp = xxxx0000

. . .

Main Components

Interfaces (NFS, CIFS, VTL, …)

Object-Oriented File System

Deduplication

RAID (e.g. RAID-6)

GC & Verification

Disk Shelves

03/11/09 14

Data Layout Rep

licat

ion

Design Challenges

Very reliable and self-healing Backup data is the last stop Multi-dimensional verifications and self-healing

High throughput + high compression at low cost Why high throughput: a day has 24 hours Why high compression: make disks cost like tapes Controller cost should be small

03/11/09 15

Typical Alternatives

Caching the index? File buffer cache has low hit rate (<80%) Fingerprints are random

Parallelize the index? 7200 RPM Seagate 1TB drive: <120 seeks/sec 120 lookups/s @ 8KB segment: 0.96 MB/sec/disk Need 200 disks to achieve 200 MB/s! [Venti02]

Buffer the data? Need a very large disk buffer Long delay to move data offsite A day still has 24 hours

03/11/09 16

High Throughput, High Compression at Low Cost

Main ideas Layout data on disk with “duplicate locality” Use a sophisticated cache for the fingerprint index

• Use a summary data structure for new data• Use “locality-preserved caching” for old data

See a FAST paper for detailsBenjamin Zhu, Kai Li and Hugo Patterson. Avoiding the Disk

Bottleneck in a Deduplication Storage System. In Proceedings of The 6th USENIX Conference on File and Storage Technologies (FAST’08). February 2008

03/11/09 17

Summary VectorGoal: Use minimal memory to test for new data

⇒ Summarize what segments have been stored, with Bloom filter (Bloom’70) in RAM

⇒ If Summary Vector says no, it’s new segment

Approximation

Index Data Structure

Summary Vector

1 1 1

Set bits fp(si)

h1

h2

h3

1 0 1 1

Read bits fp(si)

h1

h2

h303/11/09 18

Known Analysis Results

Bloom filter with m bits k independent hash functions

After inserting n keys, the probability of a false positive is:

Examples: m/n = 6, k = 4: p = 0.0561 m/n = 8, k = 6: p = 0.0215 …

Experimental data validate the analysis results03/11/09 19

( )kmkn

kkn

em

p /11

11 −−≈

−−=

Stream Informed Segment Layout

Goal: Capture “duplicate locality” on disk Segments from the same stream are stored in the

same “containers” Metadata (index data) are also in the containers

03/11/09 20

Datasection

Metadatasection

Datasection

Metadatasection

Datasection

Metadatasection

… …

Stream 1 Stream 2 Stream 3

Locality Preserved Caching (LPC)Goal: Maintain “duplicate locality” in the cache

Disk Index has all <fingerprint, containerID> pairs

Index Cache caches a subset of such pairs

On a miss, lookup Disk Index to find containerID

Load the metadata of a container into Index Cache,replace if needed

03/11/09 21

DiskIndex

Metadata

DataContainerID Index

CacheMiss

Loadmetadata

Replacement

Container

Putting Them Together

03/11/09 22

IndexCache

Duplicate

No

A fingerprint

DiskIndex data

metadata

data

metadata

data

metadata

data

metadata

SummaryVectorNew

Maybe

Replacement

Evaluation

What to evaluate Disk I/O reduction results

• Drive the evaluation with two real datasets• Observe results at different parts of the system

Write and read throughput• A synthetic benchmark with multiple streams• Mimic multiple backups

Deduplication results• Report results from two data centers

Platform 2 × qardcore 2.8Ghz Intel CPUs, 8GB RAM, 10GE NIC, 1GB

NVRAM, 15 7,200 RPM ATA disks

03/11/09 23

Disk I/O Reduction ResultsExchange data (2.56TB)

135-daily full backups

Engineering data (2.39TB)

100-day daily inc, weekly full

# disk I/Os % of total # disk I/Os % of total

No summary, No SISL/LPC

328,613,503 100.00% 318,236,712 100.00%

Summary only 274,364,788 83.49% 259,135,171 81.43%

SISL/LPC only 57,725,844 17.57% 60,358,875 18.97%

Summary & SISL/LPC

3,477,129 1.06% 1,257,316 0.40%

03/11/09 24

Write Throughput

03/11/09 25

Platform: 2xQuadcore Xeon+15 disks+10GE

MB

/s

Ge

Generations

Read Throughput

03/11/09 26

Platform: 2xQuadcore Xeon+15 disks+10GE

Ge

Generations

MB

/s

Real World Example at Datacenter A

03/11/09 27

Real World Compression at Datacenter A

03/11/09 28

Real World Example at Datacenter B

03/11/09 29

Real World Compression at Datacenter B

03/11/09 30

Related Work “Local” compression

Relatively small windows of bytes [Ziv&Lempel77,…] Larger windows [Bentley&McILroy99]

File-level deduplication system Use file hashes to detect duplicate files CAS systems, not addressing throughput issues

Fixed-size block deduplication storage prototype [Venti02] Fixed-size segments, not addressing throughput issues

Segment-level deduplication Content-based segmentation methods [Manber93, Brin94] Variable-size segment deduplication for network traffic [Spring00, LBFS01,

TAPER05,…] Variable-size segment vs. delta deduplication [Kulkarni04]

Use of Bloom filters Summary data structure [Bloom70, Fan98, Broder02] Detect duplicates [TAPER05]

03/11/09 31

Summary Deduplication storage replaces tape library

Purchase cost: < tape library solution Space reduction: 10 – 30x Bandwidth reduction: 10-100x Power reduction: > 10x (~compression ratio)

Scalable deduplication NFS throughput ~110MB/s on 2 × 2.6Ghz CPUs w/ 15 disks ~210MB/s on 2 × 2-core 3Ghz CPUs w/ 15 disks ~360MB/s on 2 × 4-core 2.8Ghz CPUs w/ 15 disks

(~750MB/s for OST interface) Impact

Over 10,000 systems deployed to many data centers >70% replicate data over WAN for disaster recovery See white papers at http://www.datadomain.com

03/11/09 32

http://www.datadomain.com/

Beyond the Backup Use Case

Nearline storage (widely in use already) Besides throughput, optimize for file IO’s/sec 4-10X compression, depending on data

Archiving (beginning) Lock, shredding, encryption, … 4-10X compression, depending on data

Future issues FLASH + Dedup storage New storage eco-system for data centers Storage infrastructure for Cloud Computing

03/11/09 33

Documents

Deduplication Storage System