32
Analysis of Disk Access Patterns on File Systems for Content Addressable Storage Kuniyasu Suzaki, Kengo Iijima, Toshiki Yagi, Cyrille Artho Research Center for Information Security Linux Symposium 2011 at Ottawa

Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Embed Size (px)

DESCRIPTION

Used at Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Citation preview

Page 1: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Analysis of Disk Access Patterns on File Systems

for Content Addressable Storage

Kuniyasu Suzaki, Kengo Iijima, Toshiki Yagi, Cyrille Artho

Research Center for Information Security

Linux Symposium 2011 at Ottawa

Page 2: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

What I want to talk about!• I show the evidences of affinity between file

systems and CAS (fixed size deduplication storage).

• The evidences indicate– We should NOT use ext3 on deduplication storage

(IaaS Cloud).– Good FS for deduplication storage is

• NTFS was good on deduplication test, but it cannot boot Linux.• ext4 was stable on deduplication test and real case, but it was not the best.• ReiserFS showed good results on real case, but has weak points.• JFS showed high same chunk ratio, but other results were not good.• btrfs was good on deduplication test, but was not tested on real case yet.

# Please discuss or comment.

Page 3: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Contents• What is CAS? What is deduplication?• Block allocation strategy of file systems• Preliminary Evaluation of affinity between file

systems and CAS.– Propose file deduplication test, and evaluate 9 file

systems (ext3, ext4, XFS, JFS, ReiserFS, NILFS, btrfs, FAT32 and NTFS).

• Real case evaluation– Ubuntu installed on ext3/ext4/JFS/ReiserFS/XFS on

CAS• Conclusion

Page 4: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

CAS: Content addressable Storage

Address SHA-10000000-0003FFF 4ad36ffe8…0004000-0007FFF 974daf34a…0008000-000BFFF 2d34ff3e1…000C000-000FFFF 974daf34a…… …

CAS Storage ArchiveIndexing

sharing

New block is created with new SHA-1

Virtual Disk

Deduplication

• Virtual block device. Data is not addressed by its physical location. Data is addressed by a unique name (a secure hash is used usually) derived from the content.

• Same contents are expressed by one original content (same hash) and others are addressed by indirect link. (deduplication storage)– Plan9 has Venti [USENIX FAST02]– Data Domain (EMC) Deduplication [USENIX FAST08]– LBCAS (Loopback Content Addressable Storage) [LinuxSymp09]

Page 5: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Fixed Size v.s. Variable length • Contents for deduplication is managed by a unit

called “chunk”. • According to the chunk size, CAS is divided into

2 categories. – Fixed size: is efficient, but cannot find contents which

do not match to the alignment.• Chunk is usually bigger than 4KB (FS block) for performance

– Variable length: finds any length same contents, but is not efficient.

• In this talk, we assume CAS is fixed size chunk.

Page 6: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Open Source CAS (deduplication storage)

• LBCAS :Loopback Content Addressable Storage– http://openlab.jp/oscircular/

• SDFS: A user space deduplication file system– http://www.opendedup.org/

• lessfs: Open source data deduplication for less– http://www.lessfs.com/

# In this talk, I use LBCAS.

Page 7: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Where is it used?• Current main target is backup server.

– Many commercial products exist. (EMC, Symantec, NetApp, etc)

• IaaS hosts many virtual machines, and keeps many virtual disks for them. Fortunately, most people use popular OS and have same contents.

• Deduplication is applied to reduce storage consumption caused by many virtual disks.

• Even if same contents are saved in virtual disks, the effects of fixed size deduplication depend on how to store data on a virtual disk via file system.

Page 8: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

File Systems• Linux has many file systems for many purposes.• File system works as a filter to allocate data on a

disk.– Each filter changes the location of data by its own

strategy. – Depending on the location, the effect of deduplication

changes.

Page 9: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

File SystemsFile System Feature for block allocation

ext3 * ext2 with journaling, Block Group is imported from FFS.

ext4 * Successor of ext3, extent allocation, delayed allocation

JFS * Dynamic i-node allocation, extent allocation.

XFS * Variable block size, extent allocation.

ReiserFS (v3) * Block sub-allocation(Tail packing)

Nilfs stackable(log structured) FS

Btrfs copy-on-write, extent allocation.

FAT32 FS for Windows, File allocation table. No journaling.

NTFS FS for Windows NT, extent allocation. Linux uses NTFS-3G driver.

“*” indicates bootable FS.All file systems except FAT32, have same function of journaling.

Page 10: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Allocation Techniques• extent allocation

– Keep contiguous physical blocks for a file and reduces fragmentation.

• Block sub-allocation(Tail packing)– Allocate last partial blocks (less than 4KB) of multiple files

into a single block.

• stackable(log structured) FS– Allocate data in succession from top to tail in a disk.

Page 11: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

To increase deduplication• FS (which is a filter to allocate data on a disk) should

keep some features– Alignment matching

• If FS allocate each file to fit to alignment of chunk, it is easy deduplicated.

– Contiguous allocation of data blocks• If 4KB data blocks is not allocated contiguously, deduplication will be

reduced, especially on a large file. Extent will solve this problem.

– Non-contamination chunk• If a chunk is shared by files, deduplication will be reduced.• If 4KB data block is shared by another file (block sub-allocation),

deduplication will be reduced. (ReiserFS will not fit.)

Page 12: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

File Deduplication TEST• When 1,000 files which have 1MB same-content

are stored on a disk through a normal file system, it will use 1,000 MB storage.

• However, if deduplication of CAS works perfectly, the files are save in1MB only.

Page 13: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

allocate files with alignment, contiguity, and non-contamination

Address00000000

FFFFFFFFFCAS System

AF135D24D4621679AECB962A6F4AF135D24D4621679AECB962A6F4CB962A6F4AF135D24D4621679AECB962A6F4AF135D24D4621679AECB962A6F4CB962A6F4AF135D24D4621679AE

As the result, chunks are identified and deduplicated

Save files toFile system

Same-Contents Files

File System B

File System A

FilterAllocate files on a disk by own strategy

Address00000000

FFFFFFFFF

BD43AD3139AAE1AD46CD24A6784AF1368981563AD62AAB137189354621679AE67272AAFD66572ZF787774362AAA772711137468906FFCCCA65276AFAA1657F4621679AE4621679AE

A few chunks are dedup

CAS System

Same-Contents Files

Save files toFile system

File Deduplication TEST

volume is compared

The volume is compared

Compare

Page 14: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

File Deduplication TEST• We tried to save same files to fill 1 GB on 4GB LBCAS

(We evaluate 2 chunk size: 32KB and 256KB).– The files has same random data

• 5 cases – 100 KB file * 10,000– 1,000 KB (1 MB) file * 1,000 – 10,000 KB (10 MB) file * 100 – 256KB file * 3,906

• check data is allocated on alignment of power of 2

– 252KB file * 3,968• used to compare 256KB file. If one 4BK block is used for meta-data or

something , it will fit to alignment of power of 2. • We assume stackable FS fit to 256KB or 252KB file cases.

Page 15: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Result overview

• Nilfs and ext3 are bad.

• Most FS do not treat 10MB file well.– Contiguous allocation

is not kept.

• 252KB and 256KB files don’t show special features.

32KB chunk

256KB chunk

the smaller chunk has many chances to be deduplicated, but the overhead becomes heavy.

Page 16: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Result detail• Ideal deduplication line shows

the ideal smallest CAS. The closer bar to the line is better.

• NTFS is good on both 32KB and 256KB chunk

• Ext4 and btrfs are good on 32KB chunk

32KB chunk

256KB chunk

Page 17: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Result :Comparison between 32KB and 256KB chunk

• (CAS size on 256KB chunk) / (CAS size on 32KB chunk)• They show the degree to be worse on larger chunk size

(from 32KB to 256KB. x8).• FAT32 shows durability for larger chunk

– Almost 4 times on any file size, but the deduplication is not good

Page 18: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Summary of File Deduplication TEST

• Ext3 and nilfs are not good for fixed side deduplication (LBCAS).

• NTFS is good on both chunk sizes (32KB and 256KB) and any file size (100KB, 1MB, 10MB, 252KB and 256KB) .

• Ext4 and Btrfs are good on 32KB chunk size.

Page 19: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Real Case Evaluation• We evaluate installing and booting of Ubuntu

(11.04 desktop) on CAS.• Ubuntu is installed on different file system.

– The contents on a CAS is almost same. We evaluate the feature of file system.

• Target files systems are bootable FS. GRUB recognizes them.– ext3, ext4, XFS, JFS, and ReiserFS

• Evaluate dynamic behavior at Installing and Booting, and static CAS images.

Page 20: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Evaluation condition

• Ubuntu 11.04 desktop is installed on a 4GB virtual disk (LBCAS) with KVM virtual Machine.

• KVM has 768 MB memory, and runs on ThinkPAD T400 (Intel Core2 Duo, 2 GB memory).

• We compared the effect of 32 KB and 256 KB chunk of LBCAS.

Page 21: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Statistics for each file size in Ubuntu

Total 2GB

• The contents installed by Ubuntu is almost 2GB on any FS.# Less than 4KB is rounded up to 4KB, because normal block is 4KB.

• 77.9% files are less than 4KB, but the amount of them occupies 20.1% disk space.

• File systems works as a filter and allocates them with own strategy.

Total 132,205

Page 22: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Access Trace on each FSInstalling

ext3

ext4

JFS

XFS

ReiserFS

Bootingext3

ext4

JFS

XFS

ReiserFS

2,000sec

120sec

4GB 4GB

Red is readGreen is Write

Page 23: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

• The amount of write request was more than 3GB and reduced on LBCAS (by more than 1GB).– It means installer issues redundant write requests.

• XFS requires the most write requests, even if almost same image is installed. JFS requires the least.

MB

Amount of read and write requests issued from installer, and accessed chunks.

Remember the amount of files is 2GB.

Reduced by more than 1GB

Installing

Page 24: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Overhead for creating FS

• Creating FS (mksf) has many losses from the view of LBCAS, except JFS.– It means creating FS issues redundant write requests. However,

the loss at installation (more than 1GB) is not compensated by Creating FS.

MB

Amount of write requests issued from mkfs, and created chunks.

Ext3 had more than 100MB loss.

JFS has almost no loss.It means the chunks are full of data.

Page 25: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Static Disk Image (Coverage of created chunks)

• ReiserFS made the smallest CAS image. It comes from tail packing.

MB

10% is reduced by tail packing

Left is 32KB chunkRight is 256kB chunk

Coverage of created chunks. Zero chunk is only one, but covers half of the disk.

Remember the amount of files is 2GB.

Only One Zero-filled chunk covers half of disk

Page 26: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Deduplication on each Single Disk Image

• ext3 and ext4 has many same chunks. They are deduplicated. However, the total is too small (less than 80MB) compared to 2GB image. The impact is low in single disk image.

• We should evaluate the ratio of same chunks in other CAS images.(talk later).

MB

Reduced by deduplication

Effect of deduplication on each disk image

Left is 32KB chunkRight is 256kB chunk

Page 27: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Booting

• The amount of chunks which read at boot time is more than the requests from OS.– Redundant data is read from CAS.– The file system should be optimized to pack data into chunk.

• See our paper presented ASPLOS2011 workshop “Resolve”.

MB

Request issued from OS booting, and chunks for the requests.

Page 28: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Relation between CAS Images• Compare the ratio of same chunks between different FS.• Compare the ratio of same chunks between different

installations with same FS• The results indicate affinity of CAS images on multi-

tenant IaaS.– High ratio is desired.

ext3

ext4 ReiserFS

jfs xfs

between different file systems

between different installations on same file system

CAS image

Another CAS image with same installation

Page 29: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Relation between CAS Images• From Upper graph

– There is no strong relation between different FS.

– 4KB chunk has high similarity, because most file system use 4KB block.

• From Lower Graph– JFS and ReiserFS show high same

chunk ratio on any chunk size. We guess there is block allocation repeatability.

• ReiserFS has block sub-allocation (tail packing) and total CAS size is reduced by 10%. However, there are many similar chunk on different installations. It means that there are identical combinations of sub-allocations.

Between different file systems

Between different installations with same file system

Page 30: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Block Allocation Repeatability• Next Challenge

– Why is there Block Allocation Repeatability on JFS and ReiserFS? Why not on ext3,ext4 and XFS?

– Is it caused by installer?• Important for fixed size deduplication storage

Page 31: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Conclusions• I show the evidence of affinity file systems and CAS

(fixed size deduplication storage).• The results indicate

– We should NOT use ext3 on deduplication storage (IaaS Cloud).

– Good FS for deduplcation storage is • NTFS was good on deduplication test, but it cannot boot Linux.• ext4 was stable on deduplication test and real case, but it was not the best.• ReiserFS showed good results on real case, but has weak points.• JFS showed high same chunk ratio, but other results were not good.• btrfs was good on deduplication test, but was not tested on real case yet.

# Please discuss or comment.

Page 32: Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"

Reference• EuroSys 2011 Tutorial “Data Deduplication”

by Andre Brinkmann (University of Paderborn)– PDF http://bit.ly/khrs1a