IBM General Parallel File System (GPFS™) and SoNAS · IBM General Parallel File System (GPFS™) 3.5 and SoNAS Scalable high-performance file system Klaus Gottschalk HPC Architect

© 2012 IBM Corporation

IBM General Parallel File System (GPFS™) 3.5and SoNASScalable high-performance file system

Klaus Gottschalk HPC Architect IBM GermanyAnselm Hruschka IT Specialist IBM Germany

© 2012 IBM Corporation2

Agenda – Whats New in GPFS 3.5

Active File Management

GPFS Native RAID

Independent Filesets

Improved Quotas

ILM Improvements


GPFS introduced

concurrent file

system access from

multiple nodes.

Evolution of the global namespace:GPFS Active File Management (afm)

Multi-cluster expands the global

namespace by connecting

multiple sites

AFM takes global namespace

truly global by automatically

managing asynchronous

replication of data

GPFS

GPFS

GPFS

GPFS

GPFS

GPFS

1993 2005 2011


NAS key building block of cloud storage architecture

Enables edge caching in the cloud

DR support within cloud data repositories

Peer-to-peer data access among cloud edge sites

Global wide-area filesystem spanning multiple sites in the cloud

HPC Distributed NAS Storage Cloud

AFM Use Cases

WAN Caching: Caching across WAN between SoNAS clusters or SoNAS and another NAS vendor

Data Migration: Online cross-vendor data migration

Disaster Recovery: multi-site fileset-level replication/failover

Shared Namespace:

across SoNAS clusters

Grid computing: allowing data to move transparently during grid workflows

Facilitates content distribution for global enterprises, “follow-the-sun” engineering teams

5

IBM General Parallel File System

Data Marking: OTHER DATA


pNFS/NFS over the WAN

GW NodesSoNAS layer

Cache Cluster Site1

(GPFS+Panache)

Cache Cluster Site 2

(GPFS+Panache)

Home Cluster Site

(Any NAS box or SOFS)

Fileset on home cluster is associated with a fileset on one or more cache clusters

If data is in cache …– Cache hit at local disk speeds

– Client sees local GPFS performance if file or directory is in cache

If data not in cache …– Data and metadata (files and directories) pulled on-demand

at network line speed and written to GPFS

– Uses NFS/pNFS for WAN data transfer

SoNAS layer

AFM Architecture

Pull on cache miss

Push on write

If data is modified at home– Revalidation done at a configurable timeout

– Close to NFS style close-to-open consistency across sites

– POSIX strong consistency within cache site

If data is modified at cache– Writes see no WAN latency

– are done to the cache (i.e. local GPFS), then asynchronously pushed home

If network is disconnected …– cached data can still be read, and writes to cache are

written back after reconnection

There can be conflicts…

6




… …

… …

/home/appl/data/web/important_big_spreadsheet.xls

/home/appl/data/web/big_architecture_drawing.ppt

/home/appl/data/web/unstructured_big_video.mpg

Remote site – read caching

/home

/appl

/data

/web



IBM Scale Out NAS

Policy Engine

Tier 1: SAS drive Tier 2: SATA drives

Storage node

Storage node

Panache w/ Scale Out NAS

Interface node

Interface node


Remote user reads local edge device for file

Auto-read from home site

Local cache to disk

ReadCan run disconnected


7




… …

… …




Remote site – write caching, update home site

/home

/appl

/data

/web



IBM Scale Out NAS

Global Namespace

Policy Engine

Tier 1: SAS drive Tier 2: SATA drives

Storage node

Storage node

Interface node

Interface node


Remote user writes file to local edge

device

Local cache to disk

1. WritePeriodically, or when

nw is reconnected



8




Panache Modes

Single Writer

– Only cache can write data. Home can’t change. Other peer caches have to

be setup as read only caches.

Read Only

– Cache can only read data, no data change allowed.

Local Update

– Data is cached from home and changes are allowed like SW mode but changes are

not pushed to home.

– Once data is changed the relationship is broken i.e cache and home are no longer in

sync for that file.

Change of Modes

– SW & RO mode caches can be changed to any other mode.

– LU cache can’t be changed – too many complications/conflicts to deal with.

9




Pre-fetching

Policy-based pre population

Periodically runs parallel inodescan at home– Selects files/dirs based on policy criterion

• Includes any user defined metadata in xattrs or other file attributes

• SQL like construct to select

•RULE LIST „prefetchlist' WHERE FILESIZE > 1GB AND MODIFICATION_TIME > CURRENT_TIME- 3600 AND

USER_ATTR1 = “sat-photo” OR USER_ATTR2 =

“classified”

Cache then pre-fetches selected objects– Runs asynchronously in the background

– Parallel multi-node prefetch

– Can callout when completed

10




Expiration of Data

Staleness Control

– Defined based on time since disconnection

– Once cache is expired, no access is allowed to cache

– Manual expire/unexpire option for admin

• mmafmctl –expire/unexpire, ctlcache in sonas

– Allowed onlys for ro mode cache

– Disabled for SW & LU as they are sources of data themselves

11




Use Case: Central/Branch Office

Central Site is where data is created, maintained, updated/changed.

This is typically done in customer situations, where data is ingested via satellite or data warehousing etc.

Branch/edge sites can periodically prefetch (via policy) or pull on demand

Data is revalidated when accessed

A typical scenario for this is:

– Music sites, where data is maintained at central location and other sites in various locations will pull in data into cache and serve the data locally at that location.

Customers Use Cases:

– BofA, NGA, US Army

Periodic Prefetch On Demand Pull

Edge site

(Reader)

HQ Primary Site (Writer)

12




Use Case: Independent Writers

Local sites dedicated to researchers within a campus sharing a dedicated home filesystem but with individual home directory.

Each site/system has their own fileset, which will be their local cluster.

Use case oftens has a central system which will have all home dirs and backup/hsm will be managed out of this.

A company spread across various countries. They maintain logs of phone calls/sms etc as needed. The headquarters needs to maintain logs/records of all calls/sms from all countries. Per each country requirements, the data needs to be maintainted to process any queries received from customers/govt etc.

Typically in this case, the company headquarters contains all records from all countries/branches of its office. And each location maintains logs/records for that country and any country it requires.

UseUser A’s home directory

(writer)

r A’s home directory

(writer)

Backup Site

UseUser B’s home directory (writer)

UseBackup site


Global Namespace

Clients access:

/global/data1

/global/data2

/global/data3

/global/data4

/global/data5

/global/data6

Clients access:

/global/data1

/global/data2

/global/data3

/global/data4

/global/data5

/global/data6

Clients access:

/global/data1

/global/data2

/global/data3

/global/data4

/global/data5

/global/data6

Cache Filesets:

/data1

/data2

Local Filesets:

/data3

/data4

Cache Filesets:

/data5

/data6

File System: store1

Local Filesets:

/data1

/data2

Cache Filesets:

/data3

/data4

Cache Filesets:

/data5

/data6

File System: store2

Cache Filesets:

/data1

/data2

Cache Filesets:

/data3

/data4

Local Filesets:

/data5

/data6

File System: store3

See all data from any Cluster

Cache as much data as

required or fetch data on

demand


Why build GPFS Native RAID?

Disk rebuilding is a fact of life at Petascale level– With 100,000 disks and an MTBFdisk = 600 Khrs, rebuild is triggered about

four times a day

– 24-hour rebuild implies four concurrent, continuous rebuilds at all times

– With larger disks, rebuild time is larger risking 2nd disk failure.

Disk Integrity issues

– Silent i/o drops etc


Features

• Auto rebalancing

• Only 2% rebuild performance hit

• Reed Solomon erasure code, “8 data +3 parity”

• ~105 year MTDDL for 100-PB file system

• End-to-end, disk-to-GPFS-client data checksums

What is GPFS Native RAID?

Software RAID on the I/O Servers– SAS attached JBOD

– Special JBOD storage drawer for very

dense drive packing

– Solid-state drives (SSDs) for metadata

storage

SAS

vDISK

Local area network (LAN)

NSD

servers

SAS

vDISK

JBODs


Declustered RAID

– Data, parity and spare strips are uniformly and independently distributed across

disk array

– Supports an arbitrary number of disks per array

• Not restricted to an integral number of RAID track widths

Conventional Declustered


GPFS Native RAID algorithm

Each block of each file is stripped

Two types of RAID– 2-fault and 3-fault tolerant codes (‘RAID-D2, RAID-D3’)

– 3 or 4 way replication

– 8 + 2 or 3 way parity

3-way Replication (1+2)8 + 2p Reed Solomon2-fault

tolerant

codes

3-fault

tolerant

codes

1 strip

(GPFS

block)

2 or 3

replicated

strips

4-way Replication (1+3)

8 strips

(GPFS block)

2 or 3

redundancy

strips

8 + 3p Reed Solomon


Component Hierarchy

A Recovery group can have – max 512 disks

– 16 declustered arrays

– At least 1 SSD log vdisk

– Max 64 vdisks

A De-clustered array can – contain 128 pdisks

– Smallest is 4 disks

– Must have one large >= 11 disks

– Need 1 or more pdisks worth of

spare space

Vdisks– Block Size: 1 MiB, 2 MiB, 4 MiB, 8

MiB and 16 MiB

pdisks

Recovery Group Recovery Group

DA DA DA DA

Declustered

Arrays

VD VD VDVdisks = NSD VD VD VD VD VD




–Own inode space; dynamic expansion of inodes

–Efficient File mgmt ops

–Fileset level snapshots

–Per user/group quotas per fileset


Whats new in 3.5?

New event callbacks

– Tiebreaker callback to let customer decide which side to survive in case of network

partition

– diskDown to ensure desired action is taken when disk goes down

Performance Enhancements

– NSD multi-queue

• Provides more pipelining and parallelism in terms of i/o scheduling

• Better I/O performance in large SMP configs

– Data in inode for small files

– Striped Log files – provides balanced disk usage in small clusters

ILM Enhancements

– Scope option allows scans to be limited to fileset, filesystem or inode space– choice-algorithm (best, exact, fast)

– split-margin specifies how much deviation is allowed when use fast choice

algorithm in terms of THRESHOLD usage etc.


Misc

Snapshot Clones

– Quick, efficient way of making a file copy by creating a clone

• Doesn’t copy data blocks (eg: fits well with VM images)

IPV6 support

Windows

– GPFS daemon no longer needs SUA (SUA still required for GPFS admin cmds)

SELinux Support

API to access xattrs of a file

Documents

IBM General Parallel File System (GPFS™) and SoNAS · IBM General Parallel File System (GPFS™) 3.5 and SoNAS Scalable high-performance file system Klaus Gottschalk HPC Architect