Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of...

Operating System Supportfor Space Allocation

in Grid Storage Systems

Douglas Thain

University of Notre Dame

IEEE Grid Computing, Sep 2006

Bad News:

Many large distributed systemsfall to pieces under heavy load!

Example: Grid3 (OSG)

Robert Gardner, et al. (102 authors)The Grid3 Production Grid

Principles and PracticeIEEE HPDC 2004

The Grid2003 Project has deployed a multi-virtual organization, application-driven grid laboratory

that has sustained for several months the production-level services required by…

ATLAS, CMS, SDSS, LIGO…

Grid2003: The DetailsThe good news:

– 27 sites with 2800 CPUs– 40985 CPU-days provided over 6 months– 10 applications with 1300 simultaneous jobs

The bad news on ATLAS jobs:– 40-70 percent utilization– 30 percent of jobs would fail.– 90 percent of failures were site problems– Most site failures were due to disk space!

A Thought Experiment

CPUCPUCPU

shareddisk

CPUCPUCPUCPU

outout

tasktask

task task

x 1,000,000task

task task

1 - Only a problem when load > capacity.

2 – Grids are employed by users with infinite needs!

Need Space Allocation

• Grid storage managers:– SRB - Storage Resource Broker at SDSC.– SRM – Storage Resource Manager at LBNL.– NeST – Networked Storage at UW-Madison.– IBP – Internet Backplane Protocol at UTK.

• But, do not have any help from the OS.– A runaway logfile can invalidate the careful

accounting of the grid storage mgr.

Outline

• Grids Need OS Support for Allocation

• A Model of Space Allocation

• Three Implementations– User-Level Library– Loopback Devices– AllocFS: Kernel Filesystem

• Application to a Cluster

A Model of Space Allocation

jobs home

alice betty

size:1000 GBused: 0 GB

size: 100 GBused: 0 GB

Three commands:

mkalloc (dir) (size)

lsalloc (dir)

rm –rf (dir)

No Built-In Allocation Policy

• In order to make an allocation:– Must have permission to mkdir.– New allocation must fit in available space.

• Need something more complex?– Check remote database re global quota?– Delete allocation after a certain time?– Send email when allocation is full?

• Use a storage manager at a higher level.– SRB, SRM, NeST, IBP, etc...

No Built-In Allocation Policy

gridstorage

manager

need 10 GB

ok, use jobs/j5

size: 10 GBused: 0 GBsize: 10 GB

used: 5 GB

check database,charge credit card,consult human...

(writeable by alice)

mkalloc /jobs/j5 10GB

setacl /jobs/j5 alice write

ordinaryfile access

task1 task2size: 5 GB

used: 0 GBsize: 5 GB

used: 0 GB

Outline

User Level Library

Appl Appl

LibAlloc LibAlloc

1 - lock/read

2 - stat/write

3 - unlock/write

1 - lock/read

2 - stat/write

3 - write/unlocksize: 10 GBused: 2 GB

User Level Library

• Some details about locking: see paper.• Applicability

– Must modify apps or servers to employ.– Fails if non-enabled apps interfere.– But, can employ anywhere without privileges.

• Performance– Optimization: Cache locks until idle 2 sec.– At best, writes double in latency.– At worst, shared directories ping-pong locks.

• Recovery– fixalloc: traverses the directory structure and

recomputes current allocations.

size:1000 GB

Loopback Filesystems

size: 100 GB

size:10 GB

dd if=/dev/zero of=/jobs.fs 100GB

losetup /dev/loopN /jobs.fs

mke2fs /dev/loopN

mount /dev/loopN /jobs

Loopback Filesystems

• Applicability– Works with any standard application.– Must be root to deploy and manage allocations.– Limited to approx 10-100 allocations.

• Performance– Ordinary reads and writes: no overhead.– Allocations: Must touch every block to reserve!– Massively increases I/O traffic to disk.

• Recovery– Must scan hierarchy, fsck and mount every allocation.– Disastrous for large file systems!

AllocFS: Kernel-Level Filesystem

# uid size used parent

2 0 1000 GB 700 GB 2

3 0 100 GB 99 GB 2

4 34 10 GB 5 GB 3

5 34 4

6 56 3

7 56 7

filefile

Inode Table

1 – To update allocation state, update fields in incore-inode.

2 – To create/delete an allocation, update the parent’s allocation state, which is already cached for other reasons.

AllocFS: Kernel-Level Filesystem• Applicability

– Works with any ordinary application.– Must load module and be root to install.– Binary compatible with existing EXT2 filesystem.– Once loaded, ordinary users may employ.

• Performance– No measurable overhead on I/O.– Creating an allocation: touch two inodes.– Deleting an allocation: same as deleting directory.

• Recovery– fixalloc: traverses the directory structure and

recomputes current allocations.

Library Adds Latency

Allocation Performance

• Loopback Filesystem– 1 second per 25 MB of allocation. (40 sec/GB)– Must touch every single block.– Big increase in unnecessary I/O traffic!

• Allocation Library– 227 usec regardless of size.– Several synchronous disk ops.

• Kernel Level Filesystem– 32 usec regardless of size.– Touch one inode.

Comparison

Guarantee?

Max # Write

Recovery

Library any

no no limit 2x

latency

usec fixalloc

Loopback root to install, use

yes 10-100 no

change

secs to mins

fsck and mount each alloc

Kernel root to

install

yes no limit no change

usec fixalloc

Outline

A Physical Experiment

CPUCPUCPU

shareddisk

CPUCPUCPUCPU

outout

tasktask

task task

Three configurations:1 – No allocations.2 – Backoff when failures detected.3 – Heuristic: don’t start job unless space > threshhold.4 – Allocate space for each job.

Only space for 10.Vary load: # of simultaneous jobs.

Allocations Improve Robustness

Summary

• Grids require space allocations in order to become robust under heavy loads.

• Explicit operating system support for allocations is needed in order to make them manageable and efficient.

• User level approximations are possible, but have overheads in perf and mgmt.

• AllocFS provides allocations compatible with EXT2 with no measurable overhead.

Library Implementation• http://www.cctools.org/chirp

• Solaris, Linux, Mac, Windows

• Start server with –Q 100GB

Kernel Implementation

• http://www.cctools.org/allocfs

• Works with Linux 2.4.21.

• Install over existing EXT2 FS.– (And, uninstall without loss.)

% mkalloc /mnt/alloctest/adir 25M

mkalloc: /mnt/alloctest/adir allocated 25600 blocks.

% lsalloc -r /mnt/alloctest

USED TOTAL PCT PATH

25.01M 87.14M 28% /mnt/alloctest

10.00M 25.00M 39% /mnt/alloctest/adir

A Final Thought

[Some think] traditional OS issues are either solved problems or minor problems. We believe that

building such vast distributed systems upon the fragile infrastructure provided by today’s operating systems is analogous to building castles on sand.

The Persistent Relevance of the Local Operating System to Global Applications

Jay Lepreau, Bryan Ford, and Mike Hibler

SIGOPS European Workshop, September 1996

For More Information:

• Cooperative Computing Lab:– http://www.cse.nd.edu/~ccl

• Douglas Thain– dthain@cse.nd.edu

• Related Talks:– “Grid Deployment of Bioinformatics Apps...”

• Session 4A Friday

– “Cacheable Decentralized Groups...”• Session 5B Friday

Extra Slides

Existing Tools Not Suitable for the Grid• User and Group Quotas

– Don’t always correspond to allocation needs!• User might want one alloc per job.• Or, many users may want to share an alloc.

• Disk Partitions– Very expensive to create, change, manage.– Not hierarchical: only root can manage.

• ZFS Allocations– Cheap to create, change, manage.– Not hierarchical: only root can manage.

Library Suffers on Small Writes

Recovery Linear wrt # of Files

Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of...

Documents

1 Condor Compatible Tools for Data Intensive Computing Douglas Thain University of Notre Dame Condor Week 2011

Programming Distributed Systems with High Level Abstractions Douglas Thain University of Notre Dame Cloud Computing and Applications (CCA-08) University

Cooperative Computing for Data Intensive Science Douglas Thain University of Notre Dame NSF Bridges to Engineering 2020 Conference 12 March 2008

1 Scaling Up Data Intensive Science with Application Frameworks Douglas Thain University of Notre Dame Michigan State University September 2011

Efficient Access to Many Small Files in a Grid Filesystem Douglas Thain and Christopher Moretti University of Notre Dame

1 Science in the Clouds: History, Challenges, and Opportunities Douglas Thain University of Notre Dame GeoClouds Workshop 17 September 2009

1 High Throughput Scientific Computing with Condor: Computer Science Challenges in Large Scale Parallelism Douglas Thain University of Notre Dame UAB 27

Tactical Storage: Simple, Secure, and Semantic Access to Remote Data Prof. Douglas Thain University of Notre Dame dthain

Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006 Prof. Douglas Thain CSE Department 9 Feb 2007

Grid Enabled Pattern Matching within the DAME e-Science Pilot Project

Introduction to Makeflow and Work Queue Prof. Douglas Thain, University of Notre Dame dthain dthain@nd.edu @ProfThain

Condor and the Grid - University of Notre Damedthain/papers/condorgrid-submit.pdf · Condor and the Grid Douglas Thain, Todd Tannenbaum, and Miron Livny Computer Sciences Department,

High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009

Condor and the Grid - University of Wisconsin–MadisonCondor and the Grid Douglas Thain, Todd Tannenbaum, and Miron Livny ... 2 D. THAIN, T. TANNENBAUM, AND M. LIVNY e orts such as

An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame dthain

Portable Resource Management for Data Intensive Workflows Douglas Thain University of Notre Dame

1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame

Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

The Anatomy of the Grid - University of Notre Dame

Farming with Condor Douglas Thain thain@cs.wisc.edu INFN Bologna, December 2001