Upload
kuniyasu-suzaki
View
1.778
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Used at Linux Symposium 2011 "Analysis of Disk Access Patterns on File Systems for Content Addressable Storage"
Citation preview
Analysis of Disk Access Patterns on File Systems
for Content Addressable Storage
Kuniyasu Suzaki, Kengo Iijima, Toshiki Yagi, Cyrille Artho
Research Center for Information Security
Linux Symposium 2011 at Ottawa
What I want to talk about!• I show the evidences of affinity between file
systems and CAS (fixed size deduplication storage).
• The evidences indicate– We should NOT use ext3 on deduplication storage
(IaaS Cloud).– Good FS for deduplication storage is
• NTFS was good on deduplication test, but it cannot boot Linux.• ext4 was stable on deduplication test and real case, but it was not the best.• ReiserFS showed good results on real case, but has weak points.• JFS showed high same chunk ratio, but other results were not good.• btrfs was good on deduplication test, but was not tested on real case yet.
# Please discuss or comment.
Contents• What is CAS? What is deduplication?• Block allocation strategy of file systems• Preliminary Evaluation of affinity between file
systems and CAS.– Propose file deduplication test, and evaluate 9 file
systems (ext3, ext4, XFS, JFS, ReiserFS, NILFS, btrfs, FAT32 and NTFS).
• Real case evaluation– Ubuntu installed on ext3/ext4/JFS/ReiserFS/XFS on
CAS• Conclusion
CAS: Content addressable Storage
Address SHA-10000000-0003FFF 4ad36ffe8…0004000-0007FFF 974daf34a…0008000-000BFFF 2d34ff3e1…000C000-000FFFF 974daf34a…… …
CAS Storage ArchiveIndexing
sharing
New block is created with new SHA-1
Virtual Disk
Deduplication
• Virtual block device. Data is not addressed by its physical location. Data is addressed by a unique name (a secure hash is used usually) derived from the content.
• Same contents are expressed by one original content (same hash) and others are addressed by indirect link. (deduplication storage)– Plan9 has Venti [USENIX FAST02]– Data Domain (EMC) Deduplication [USENIX FAST08]– LBCAS (Loopback Content Addressable Storage) [LinuxSymp09]
Fixed Size v.s. Variable length • Contents for deduplication is managed by a unit
called “chunk”. • According to the chunk size, CAS is divided into
2 categories. – Fixed size: is efficient, but cannot find contents which
do not match to the alignment.• Chunk is usually bigger than 4KB (FS block) for performance
– Variable length: finds any length same contents, but is not efficient.
• In this talk, we assume CAS is fixed size chunk.
Open Source CAS (deduplication storage)
• LBCAS :Loopback Content Addressable Storage– http://openlab.jp/oscircular/
• SDFS: A user space deduplication file system– http://www.opendedup.org/
• lessfs: Open source data deduplication for less– http://www.lessfs.com/
# In this talk, I use LBCAS.
Where is it used?• Current main target is backup server.
– Many commercial products exist. (EMC, Symantec, NetApp, etc)
• IaaS hosts many virtual machines, and keeps many virtual disks for them. Fortunately, most people use popular OS and have same contents.
• Deduplication is applied to reduce storage consumption caused by many virtual disks.
• Even if same contents are saved in virtual disks, the effects of fixed size deduplication depend on how to store data on a virtual disk via file system.
File Systems• Linux has many file systems for many purposes.• File system works as a filter to allocate data on a
disk.– Each filter changes the location of data by its own
strategy. – Depending on the location, the effect of deduplication
changes.
File SystemsFile System Feature for block allocation
ext3 * ext2 with journaling, Block Group is imported from FFS.
ext4 * Successor of ext3, extent allocation, delayed allocation
JFS * Dynamic i-node allocation, extent allocation.
XFS * Variable block size, extent allocation.
ReiserFS (v3) * Block sub-allocation(Tail packing)
Nilfs stackable(log structured) FS
Btrfs copy-on-write, extent allocation.
FAT32 FS for Windows, File allocation table. No journaling.
NTFS FS for Windows NT, extent allocation. Linux uses NTFS-3G driver.
“*” indicates bootable FS.All file systems except FAT32, have same function of journaling.
Allocation Techniques• extent allocation
– Keep contiguous physical blocks for a file and reduces fragmentation.
• Block sub-allocation(Tail packing)– Allocate last partial blocks (less than 4KB) of multiple files
into a single block.
• stackable(log structured) FS– Allocate data in succession from top to tail in a disk.
To increase deduplication• FS (which is a filter to allocate data on a disk) should
keep some features– Alignment matching
• If FS allocate each file to fit to alignment of chunk, it is easy deduplicated.
– Contiguous allocation of data blocks• If 4KB data blocks is not allocated contiguously, deduplication will be
reduced, especially on a large file. Extent will solve this problem.
– Non-contamination chunk• If a chunk is shared by files, deduplication will be reduced.• If 4KB data block is shared by another file (block sub-allocation),
deduplication will be reduced. (ReiserFS will not fit.)
File Deduplication TEST• When 1,000 files which have 1MB same-content
are stored on a disk through a normal file system, it will use 1,000 MB storage.
• However, if deduplication of CAS works perfectly, the files are save in1MB only.
allocate files with alignment, contiguity, and non-contamination
Address00000000
FFFFFFFFFCAS System
AF135D24D4621679AECB962A6F4AF135D24D4621679AECB962A6F4CB962A6F4AF135D24D4621679AECB962A6F4AF135D24D4621679AECB962A6F4CB962A6F4AF135D24D4621679AE
As the result, chunks are identified and deduplicated
Save files toFile system
Same-Contents Files
File System B
File System A
FilterAllocate files on a disk by own strategy
Address00000000
FFFFFFFFF
BD43AD3139AAE1AD46CD24A6784AF1368981563AD62AAB137189354621679AE67272AAFD66572ZF787774362AAA772711137468906FFCCCA65276AFAA1657F4621679AE4621679AE
A few chunks are dedup
CAS System
Same-Contents Files
Save files toFile system
File Deduplication TEST
volume is compared
The volume is compared
Compare
File Deduplication TEST• We tried to save same files to fill 1 GB on 4GB LBCAS
(We evaluate 2 chunk size: 32KB and 256KB).– The files has same random data
• 5 cases – 100 KB file * 10,000– 1,000 KB (1 MB) file * 1,000 – 10,000 KB (10 MB) file * 100 – 256KB file * 3,906
• check data is allocated on alignment of power of 2
– 252KB file * 3,968• used to compare 256KB file. If one 4BK block is used for meta-data or
something , it will fit to alignment of power of 2. • We assume stackable FS fit to 256KB or 252KB file cases.
Result overview
• Nilfs and ext3 are bad.
• Most FS do not treat 10MB file well.– Contiguous allocation
is not kept.
• 252KB and 256KB files don’t show special features.
32KB chunk
256KB chunk
the smaller chunk has many chances to be deduplicated, but the overhead becomes heavy.
Result detail• Ideal deduplication line shows
the ideal smallest CAS. The closer bar to the line is better.
• NTFS is good on both 32KB and 256KB chunk
• Ext4 and btrfs are good on 32KB chunk
32KB chunk
256KB chunk
Result :Comparison between 32KB and 256KB chunk
• (CAS size on 256KB chunk) / (CAS size on 32KB chunk)• They show the degree to be worse on larger chunk size
(from 32KB to 256KB. x8).• FAT32 shows durability for larger chunk
– Almost 4 times on any file size, but the deduplication is not good
Summary of File Deduplication TEST
• Ext3 and nilfs are not good for fixed side deduplication (LBCAS).
• NTFS is good on both chunk sizes (32KB and 256KB) and any file size (100KB, 1MB, 10MB, 252KB and 256KB) .
• Ext4 and Btrfs are good on 32KB chunk size.
Real Case Evaluation• We evaluate installing and booting of Ubuntu
(11.04 desktop) on CAS.• Ubuntu is installed on different file system.
– The contents on a CAS is almost same. We evaluate the feature of file system.
• Target files systems are bootable FS. GRUB recognizes them.– ext3, ext4, XFS, JFS, and ReiserFS
• Evaluate dynamic behavior at Installing and Booting, and static CAS images.
Evaluation condition
• Ubuntu 11.04 desktop is installed on a 4GB virtual disk (LBCAS) with KVM virtual Machine.
• KVM has 768 MB memory, and runs on ThinkPAD T400 (Intel Core2 Duo, 2 GB memory).
• We compared the effect of 32 KB and 256 KB chunk of LBCAS.
Statistics for each file size in Ubuntu
Total 2GB
• The contents installed by Ubuntu is almost 2GB on any FS.# Less than 4KB is rounded up to 4KB, because normal block is 4KB.
• 77.9% files are less than 4KB, but the amount of them occupies 20.1% disk space.
• File systems works as a filter and allocates them with own strategy.
Total 132,205
Access Trace on each FSInstalling
ext3
ext4
JFS
XFS
ReiserFS
Bootingext3
ext4
JFS
XFS
ReiserFS
2,000sec
120sec
4GB 4GB
Red is readGreen is Write
• The amount of write request was more than 3GB and reduced on LBCAS (by more than 1GB).– It means installer issues redundant write requests.
• XFS requires the most write requests, even if almost same image is installed. JFS requires the least.
MB
Amount of read and write requests issued from installer, and accessed chunks.
Remember the amount of files is 2GB.
Reduced by more than 1GB
Installing
Overhead for creating FS
• Creating FS (mksf) has many losses from the view of LBCAS, except JFS.– It means creating FS issues redundant write requests. However,
the loss at installation (more than 1GB) is not compensated by Creating FS.
MB
Amount of write requests issued from mkfs, and created chunks.
Ext3 had more than 100MB loss.
JFS has almost no loss.It means the chunks are full of data.
Static Disk Image (Coverage of created chunks)
• ReiserFS made the smallest CAS image. It comes from tail packing.
MB
10% is reduced by tail packing
Left is 32KB chunkRight is 256kB chunk
Coverage of created chunks. Zero chunk is only one, but covers half of the disk.
Remember the amount of files is 2GB.
Only One Zero-filled chunk covers half of disk
Deduplication on each Single Disk Image
• ext3 and ext4 has many same chunks. They are deduplicated. However, the total is too small (less than 80MB) compared to 2GB image. The impact is low in single disk image.
• We should evaluate the ratio of same chunks in other CAS images.(talk later).
MB
Reduced by deduplication
Effect of deduplication on each disk image
Left is 32KB chunkRight is 256kB chunk
Booting
• The amount of chunks which read at boot time is more than the requests from OS.– Redundant data is read from CAS.– The file system should be optimized to pack data into chunk.
• See our paper presented ASPLOS2011 workshop “Resolve”.
MB
Request issued from OS booting, and chunks for the requests.
Relation between CAS Images• Compare the ratio of same chunks between different FS.• Compare the ratio of same chunks between different
installations with same FS• The results indicate affinity of CAS images on multi-
tenant IaaS.– High ratio is desired.
ext3
ext4 ReiserFS
jfs xfs
between different file systems
between different installations on same file system
CAS image
Another CAS image with same installation
Relation between CAS Images• From Upper graph
– There is no strong relation between different FS.
– 4KB chunk has high similarity, because most file system use 4KB block.
• From Lower Graph– JFS and ReiserFS show high same
chunk ratio on any chunk size. We guess there is block allocation repeatability.
• ReiserFS has block sub-allocation (tail packing) and total CAS size is reduced by 10%. However, there are many similar chunk on different installations. It means that there are identical combinations of sub-allocations.
Between different file systems
Between different installations with same file system
Block Allocation Repeatability• Next Challenge
– Why is there Block Allocation Repeatability on JFS and ReiserFS? Why not on ext3,ext4 and XFS?
– Is it caused by installer?• Important for fixed size deduplication storage
Conclusions• I show the evidence of affinity file systems and CAS
(fixed size deduplication storage).• The results indicate
– We should NOT use ext3 on deduplication storage (IaaS Cloud).
– Good FS for deduplcation storage is • NTFS was good on deduplication test, but it cannot boot Linux.• ext4 was stable on deduplication test and real case, but it was not the best.• ReiserFS showed good results on real case, but has weak points.• JFS showed high same chunk ratio, but other results were not good.• btrfs was good on deduplication test, but was not tested on real case yet.
# Please discuss or comment.
Reference• EuroSys 2011 Tutorial “Data Deduplication”
by Andre Brinkmann (University of Paderborn)– PDF http://bit.ly/khrs1a