55
Flexible Storage Allocation A. L. Narasimha Reddy Department of Electrical and Computer Engineering Texas A & M University Students: Sukwoo Kang (now at IBM Almaden) John Garrison

Flexible Storage Allocation A. L. Narasimha Reddy Department of Electrical and Computer Engineering Texas A & M University Students: Sukwoo Kang (now at

Embed Size (px)

Citation preview

Flexible Storage Allocation

A. L. Narasimha ReddyDepartment of Electrical and Computer EngineeringTexas A & M University

Students: Sukwoo Kang (now at IBM Almaden)John Garrison

2 Texas A&M University Narasimha Reddy 5/1/2008

Outline

Big Picture

Part I: Flexible Storage Allocation

– Introduction and Motivation

– Design of Virtual Allocation

– Evaluation

Part II: Data Distribution in Networked Storage SystemsPart II: Data Distribution in Networked Storage Systems

– Introduction and MotivationIntroduction and Motivation

– Design of User-Optimal Data MigrationDesign of User-Optimal Data Migration

– EvaluationEvaluation

Part III: Storage Management across diverse devicesPart III: Storage Management across diverse devices

Conclusion

3 Texas A&M University Narasimha Reddy 5/1/2008

Storage Allocation

Allocate entire storage space at the time of the file system creation

Storage space owned by one operating system cannot be used by another

30 GB50 GB

Windows NT (NTFS)

Linux(ext2)

70 GB50 GB

98 GB

AIX(JFS)

Running out of space!

ActualAllocations

4 Texas A&M University Narasimha Reddy 5/1/2008

Big Picture

Memory systems employ virtual memory for several reasons

Current storage systems lack such flexibility

Current file systems allocate storage statically at the time of their creation

– Storage allocation: Space on the disk is not allocated well across multiple file systems

5 Texas A&M University Narasimha Reddy 5/1/2008

File Systems with Virtual Allocation

When a file system is created with X GB,

– Allows the file system to be created with only Y GB, where Y << X

– Remaining space used as one common available pool

– As the file system grows, the storage space can be allocated on demand

30 GB50 GB

Windows NT (NTFS)

Linux(ext2)

98 GB

AIX(JFS)

10 GB

10 GB

ActualAllocations

60 GB 40 GB100 GB Common Storage Pool

6 Texas A&M University Narasimha Reddy 5/1/2008

Our Approach to Design

Physical Disk

Physical block address

Employ Allocate-on-write policy

– Storage space is allocated when the data is written

– Writes all data to disk sequentially based on the time at which data is written to the device

– Once data is written, data can be accessed from the same location, i.e., data is updated in-place

7 Texas A&M University Narasimha Reddy 5/1/2008

Allocate-on-write Policy

Physical DiskWrite at t = t’

Extent

Storage space is allocated by the unit of the extent when the data is written

Extent is a group of file system blocks

– Fixed size

– Retain more spatial locality

– Reduce information that must be maintained

8 Texas A&M University Narasimha Reddy 5/1/2008

Allocate-on-write Policy

Physical Disk

Extent0

Extent1

Write at t = t’

Write at t = t’’ (where t’’ > t’)

Data is written to disk sequentially based on write-time

– Further writes to the same data updated in-place

– VA (Virtual Allocation) requires additional data structure

9 Texas A&M University Narasimha Reddy 5/1/2008

Block Map

Physical Disk

Extent0

Extent1

Write at t = t’

Write at t = t’’ (where t’’ > t’)

Extent2

Block map

Block map keeps a mapping of logical storage locations and real (physical) storage locations

10 Texas A&M University Narasimha Reddy 5/1/2008

VA Metadata

Physical Disk

Extent0

Extent1

Extent2

Block map

VAMetadata

Hardening

This block map is maintained in memory and regularly written to disk for hardening against system failures

VA Metadata represents the on-disk block map

11 Texas A&M University Narasimha Reddy 5/1/2008

On-disk Layout & Storage Expansion

Physical Disk

FSMetadata

Extent1

Extent2

VAMetadata

Extent0

Virtual Disk

Extent3

Extent4

Extent5

Extent6

Extent7

Storage Expansion Threshold

Storage Expansion

When the capacity is exhausted or reaches storage expansion threshold, a physical disk can be expanded to other available storage resources

– File system unaware of the actual space allocation and expansion

12 Texas A&M University Narasimha Reddy 5/1/2008

Write Operation

Application Write Request

File System

Buffer/Page Cache Layer Page

Acknowledgement

Allocate new extent and update mapping information

Block I/O Layer (VA)

Search VA block map

Extent3

FSMetadata

Extent1

Extent2

VAMetadata

Extent0

Disk

Hardening

13 Texas A&M University Narasimha Reddy 5/1/2008

Read Operation

Application Read Request

File System

Buffer/Page Cache Layer

Block I/O Layer (VA)

Search VA block map

Extent3

FSMetadata

Extent1

Extent2

VAMetadata

Extent0

Disk

14 Texas A&M University Narasimha Reddy 5/1/2008

Allocate-on-write vs. Other Work

Key difference from log-structured file systems (LFS)

– Only allocation is done at the end of log

– Updates are done in-place after allocation

LVM still ties up storage at the time of file system creation

15 Texas A&M University Narasimha Reddy 5/1/2008

Design Issues

Extent-based Policy Example (with Ext2)

– I (inode), B (data block), V (VA block map)

– A B (B is allocated to A)

File system-based Policy Example (with Ext3 ordered mode)

VA Metadata Hardening (File System Integrity)

– Must keep certain update ordering of VA metadata and FS (meta)data

16 Texas A&M University Narasimha Reddy 5/1/2008

Design Issues (cont.)

Extent Size

– Larger extent size: Reduce block map size, retain more spatial locality, cause data fragmentation

Reclaiming allocated storage space of deleted files

– Needed to continue to provide the benefits of virtual allocation

– Without reclamation, possible to turn virtual allocation into static allocation

Interaction with RAID

– RAID remaps blocks to physical devices to provide device characteristics

– VA remaps blocks for flexibility

– Need to resolve performance impact of VA’s extent size and RAID’s chunk size

17 Texas A&M University Narasimha Reddy 5/1/2008

Spatial Locality Observations & Issues

Metadata and data separation

Data clustering: Reduce seek distance

Multiple file systems

Data placement policy

– Allocate hot data in a high data region of disk

– Allocate hot data in the middle of the partition

18 Texas A&M University Narasimha Reddy 5/1/2008

Implementation & Experimental Setup

Virtual allocation prototype

– Kernel module for Linux 2.4.22

– Employ a hash table in memory for speeding up VA lookups

Setup

– A 3GHz Pentium 4 processor, 1GB main memory

– Red Hat Linux 9 with a 2.4.22 kernel

– Ext2 file system and Ext3 file system

Workloads

– Bonnie++ (Large-file workload)

– Postmark (Small-file workload)

– TPC-C (Database workload)

19 Texas A&M University Narasimha Reddy 5/1/2008

VA Metadata Hardening

-7.3 -3.3 -1.2 +4.9 +8.4 +9.5

Compare EXT2 and VA-EXT2-EX

Compare EXT3 and VA-EXT3-EX, VA-EXT3-FS

20 Texas A&M University Narasimha Reddy 5/1/2008

Reclaiming Allocated Storage Space

Reclaim operation for deleted large files

How to keep track of deleted files?

– Employed stackable file system: Maintain duplicated block bitmap

– Alternatively, could employ “Life or Death at Block-Level” (OSDI’04) work

21 Texas A&M University Narasimha Reddy 5/1/2008

VA with RAID-5 Large-file workload Small-file workload

Large-file workload with NVRAM

Used Ext2 with software RAID-5 + VA

NVRAM-X%: X% of total VA metadata size

VA-RAID-5 NO-HARDEN

VA-RAID-5 NVRAM-17%

VA-RAID-5 NVRAM-4%

VA-RAID-5 NVRAM-1%

22 Texas A&M University Narasimha Reddy 5/1/2008

Data Placement Policy (Postmark)

VA NORMAL partition: Same data rate across a partition

VA ZCAV partition: Hot data is placed in high data region of a partition

16 24

VA-NORMAL: start allocation from the outer cylinders

VA-MIDDLE: start allocation from the middle of a partition

23 Texas A&M University Narasimha Reddy 5/1/2008

Multiple File Systems

VA-7GB: 2 x 3.5GB partition, 30% utilization

VA-32GB: 2 x 16GB partition, 80% utilization

Used Postmark

VA-HALF: The 2nd file system is created after 40% of the 1st file system is written

VA-FULL: 80%

24 Texas A&M University Narasimha Reddy 5/1/2008

Real-World Deployment of Virtual Allocation

Prototype built

25 Texas A&M University Narasimha Reddy 5/1/2008

VA in Networked Storage Environment

Flexible allocation provided by VA leads to

– Balancing locality vs. load balance issues

26 Texas A&M University Narasimha Reddy 5/1/2008

Part II: Data Distribution

Locality-based approach

– Use data migration (e.g. HP AutoRAID)

– Employ “hot” data migration from slower device (remote disk) to faster device (local disk)

Load balancing-based approach (Striping)

– Exploit multiple devices to support the required data rates (e.g. Slice-OSDI’00)

Hot data

Cold data

27 Texas A&M University Narasimha Reddy 5/1/2008

User-Optimal Data Migration

data

Locality is exploited first

– Data is migrated from Disk B to Disk A

Load balancing is also considered

– If the load on Disk A is too high, data is migrated from Disk A to Disk B

28 Texas A&M University Narasimha Reddy 5/1/2008

Migration Decision Issues

data

Where to migrate: Use I/O request response time

When to migrate: Migration threshold

– Initiate migration from Disk A to Disk B only when

How to migrate: Limit number of concurrent migrations (Migration token)

What data to migrate: Active data

writewrite readwrite

29 Texas A&M University Narasimha Reddy 5/1/2008

Design Issues

Allocation policy

– Striping with user-optimal migration: will improve data access locality

– Sequential allocation with user-optimal migration: will improve load balancing

Multi-user environment

– Each user migrates data in a user-selfish manner

– Migrations will tend to improve the performance of all users over longer periods of time

30 Texas A&M University Narasimha Reddy 5/1/2008

Evaluation

Implemented as a kernel block device driver

Evaluated it using SPECsfs benchmark

Configuration SPECsfs Performance Curve

Single-UserMulti-User

31 Texas A&M University Narasimha Reddy 5/1/2008

Single-User Environment

Striping with user-optimal migration

Seq. allocation with user-optimal migration

Configuration: (Allocation Policy)-(Migration Policy)

– STR (Striping), SEQ (Seq. Alloc.), NOMIG (No migration), MIG (User-Optimal migration)

32 Texas A&M University Narasimha Reddy 5/1/2008

Single-User Environment (cont.)

Comparison between migration systems

– Migration based on locality: hot data (remotelocal), cold data (localremote)

33 Texas A&M University Narasimha Reddy 5/1/2008

Multi-User Environment - Striping

Server A: Load from 100 to 700

Server B: Load from 50 to 350

34 Texas A&M University Narasimha Reddy 5/1/2008

Multi-User Environment – Seq. Allocation

Server A: Load from 100 to 1100

Server B: Load from 30 to 480

35 Texas A&M University Narasimha Reddy 5/1/200835 Texas A&M University Narasimha Reddy 8/7/2007

Storage Management Across Diverse Devices

Flash storage becoming widely available

– More expensive than hard drives

– Faster random accesses

– Low Power consumption

In Laptops now

In hybrid storage systems soon

Manage data across Different Devices

– Match application needs to device characteristics

– Optimize for performance, power consumption

36 Texas A&M University Narasimha Reddy 5/1/200836 Texas A&M University Narasimha Reddy 8/7/2007

Motivation

VFS Allows many file systems underneath

VFS maintains 1 to 1 mapping from namespace to storage

Can we provide different storage options for different files for a single user?

– /user1/file1 storage system 1, /user2/file2 storage system 2…

37 Texas A&M University Narasimha Reddy 5/1/200837 Texas A&M University Narasimha Reddy 8/7/2007

Normal File System Architecture

Calc Impress Writer WinAmp

VFS

Ext2

/user1/file1 /user1/file2 /user2/file3 /user2/file4

/user1/*

User Space

Kernel

FAT32

/user2/*

Magnetic Disk Flash Drive

38 Texas A&M University Narasimha Reddy 5/1/2008

Umbrella File System

Calc Impress Writer WinAmp

VFS

Ext2

/user1/file1 /user1/file2

User Space

Kernel

Ext3 Ext2 FAT32

/FS1/user1/file3/FS2/user1/file1

/FS2/user1/file2/FS3/user1/file4

Encrypted Magnetic Disk

Magnetic Disk Flash Drive

UmbrellaFS

/user1/file3 /user1/file4

39 Texas A&M University Narasimha Reddy 5/1/2008

Example Data Organization

/usr/dir1/foo.avi

/usr/dir1/foo.txt

/usr/dir1/foo.jpg

/usr/dir1

/usr

/media/usr/dir1/text/usr/dir1/images/usr/dir1

/media/usr/text/usr/images/usr

/media/usr/dir1/foo.avi/text/usr/dir1/foo.txt/images/usr/dir1/foo.jpg

User View

Underlying data organization

40 Texas A&M University Narasimha Reddy 5/1/200840 Texas A&M University Narasimha Reddy 8/7/2007

Motivation --Policy Based Storage

User or System administrator Choice

– Allow different types of files on different devices

– Reliability, performance, power consumption

Layered Architecture

– Leverage benefits of underlying file systems

– Map applications to file systems and underlying storage

Policy decisions can depend on namespace and metadata

– Example: Files not touched in a week slow storage system

41 Texas A&M University Narasimha Reddy 5/1/200841 Texas A&M University Narasimha Reddy 8/7/2007

Rules Structure

Provided at mount time

User specified

Based on inode values (metadata) and filenames (namespace)

Provides array of branches

42 Texas A&M University Narasimha Reddy 5/1/200842 Texas A&M University Narasimha Reddy 8/7/2007

Umbrella File System

Sits under VFS to enforce policy

Policy enforced at open and close times

Policy also enforced periodically (less often)

UmbrellaFS acts as a “router” for files

– Not only based on namespace, but also metadata

43 Texas A&M University Narasimha Reddy 5/1/200843 Texas A&M University Narasimha Reddy 8/7/2007

Inode Rules Structure

Rule Inode/

Filename

Field Match Value Branch

1 Inode file permissions = Read Only /fs1, /fs2

2 Filename n/a n/a n/a n/a

3 Inode file creation time >= 8:00 am, August 3rd, 2007

/fs2

4 Inode file length < 20 KB /fs3

44 Texas A&M University Narasimha Reddy 5/1/200844 Texas A&M University Narasimha Reddy 8/7/2007

Inode Rules

Provide in order of precedence

First match

Compare inode value to rule

– At file creation some inode values indeterminate

– Pass over those rules

45 Texas A&M University Narasimha Reddy 5/1/200845 Texas A&M University Narasimha Reddy 8/7/2007

Filename Rules Structure

Rule Match String Branch

1 /*.avi /fs2,/fs1

2 /home/*.txt /fs1

3 /home/jgarrison/* /fs3

46 Texas A&M University Narasimha Reddy 5/1/200846 Texas A&M University Narasimha Reddy 8/7/2007

Filename Rules

Once first filename rule triggered, all checked

Similar to longest prefix matching

Double index based on

– Path matching

– Filename matching

Example:

– Rules: /home/*/*.bar, /home/jgarrison/foo.bar

– File: /home/jgarrison/foo.bar

– File matches second rule more closely (3 path length and 7 characters of file name vs. 3 path length and 4 characters of file name)

47 Texas A&M University Narasimha Reddy 5/1/200847 Texas A&M University Narasimha Reddy 8/7/2007

Evaluation

Overhead

– Throughput

– CPU Limited

– I/O Limited

Example Improvement

48 Texas A&M University Narasimha Reddy 5/1/2008

UmbrellaFS Overhead

Bonnie Read Overhead

0

5

10

15

20

25

30

35

40

Ext2 1 2 4 8 16 32

Rules

Throughput (MB/s)

Ext2

Inode Rules

Filename Rules

49 Texas A&M University Narasimha Reddy 5/1/200849 Texas A&M University Narasimha Reddy 8/7/2007

CPU Limited Benchmarks

50 Texas A&M University Narasimha Reddy 5/1/200850 Texas A&M University Narasimha Reddy 8/7/2007

I/O Limited Benchmarks

51 Texas A&M University Narasimha Reddy 5/1/200851 Texas A&M University Narasimha Reddy 8/7/2007

Flash vs. RAID5 Read Performance

52 Texas A&M University Narasimha Reddy 5/1/2008

Flash vs. RAID5 Write Performance

Write Performance

0

10

20

30

40

50

60

70

1 10 100 1000 10000

File Size (kB)

Throughput (MB/s)

RAID 5

Flash SSD

53 Texas A&M University Narasimha Reddy 5/1/200853 Texas A&M University Narasimha Reddy 8/7/2007

Flash and Disk Hybrid System

54 Texas A&M University Narasimha Reddy 5/1/2008

Disks with Encryption hardware

Encryption Example

0

100

200

300

400

500

600

700

800

Partial Encryption Full Encryption

Time (s)

55 Texas A&M University Narasimha Reddy 5/1/2008

Conclusion

Virtual allocation allows Flexibility

– Improve the flexibility of managing storage across multiple file systems/platforms

Enabled user-optimal migration

– Balance disk access locality and load balance automatically and transparently

– Adapt to changes of workloads and loads in each storage device

Policy-based storage: Umbrella File System

– Allows matching application characteristics to devices