Upload
phungthuan
View
239
Download
0
Embed Size (px)
Citation preview
© 2012 IBM Corporation
IBM General Parallel File System (GPFS™) 3.5and SoNASScalable high-performance file system
Klaus Gottschalk HPC Architect IBM GermanyAnselm Hruschka IT Specialist IBM Germany
© 2012 IBM Corporation2
Agenda – Whats New in GPFS 3.5
Active File Management
GPFS Native RAID
Independent Filesets
Improved Quotas
ILM Improvements
© 2012 IBM Corporation3
GPFS introduced
concurrent file
system access from
multiple nodes.
Evolution of the global namespace:GPFS Active File Management (afm)
Multi-cluster expands the global
namespace by connecting
multiple sites
AFM takes global namespace
truly global by automatically
managing asynchronous
replication of data
GPFS
GPFS
GPFS
GPFS
GPFS
GPFS
1993 2005 2011
© 2012 IBM Corporation4
NAS key building block of cloud storage architecture
Enables edge caching in the cloud
DR support within cloud data repositories
Peer-to-peer data access among cloud edge sites
Global wide-area filesystem spanning multiple sites in the cloud
HPC Distributed NAS Storage Cloud
AFM Use Cases
WAN Caching: Caching across WAN between SoNAS clusters or SoNAS and another NAS vendor
Data Migration: Online cross-vendor data migration
Disaster Recovery: multi-site fileset-level replication/failover
Shared Namespace:
across SoNAS clusters
Grid computing: allowing data to move transparently during grid workflows
Facilitates content distribution for global enterprises, “follow-the-sun” engineering teams
5
IBM General Parallel File System
Data Marking: OTHER DATA
© 2012 IBM Corporation
pNFS/NFS over the WAN
GW NodesSoNAS layer
Cache Cluster Site1
(GPFS+Panache)
Cache Cluster Site 2
(GPFS+Panache)
Home Cluster Site
(Any NAS box or SOFS)
Fileset on home cluster is associated with a fileset on one or more cache clusters
If data is in cache …– Cache hit at local disk speeds
– Client sees local GPFS performance if file or directory is in cache
If data not in cache …– Data and metadata (files and directories) pulled on-demand
at network line speed and written to GPFS
– Uses NFS/pNFS for WAN data transfer
SoNAS layer
AFM Architecture
Pull on cache miss
Push on write
If data is modified at home– Revalidation done at a configurable timeout
– Close to NFS style close-to-open consistency across sites
– POSIX strong consistency within cache site
If data is modified at cache– Writes see no WAN latency
– are done to the cache (i.e. local GPFS), then asynchronously pushed home
If network is disconnected …– cached data can still be read, and writes to cache are
written back after reconnection
There can be conflicts…
6
IBM General Parallel File System
Data Marking: OTHER DATA
© 2012 IBM Corporation
… …
… …
/home/appl/data/web/important_big_spreadsheet.xls
/home/appl/data/web/big_architecture_drawing.ppt
/home/appl/data/web/unstructured_big_video.mpg
Remote site – read caching
/home
/appl
/data
/web
/home/appl/data/web/big_architecture_drawing.ppt
/home/appl/data/web/unstructured_big_video.mpg
IBM Scale Out NAS
Policy Engine
Tier 1: SAS drive Tier 2: SATA drives
Storage node
Storage node
Panache w/ Scale Out NAS
Interface node
Interface node
/home/appl/data/web/important_big_spreadsheet.xls
Remote user reads local edge device for file
Auto-read from home site
Local cache to disk
ReadCan run disconnected
Panache w/ Scale Out NAS
7
IBM General Parallel File System
Data Marking: OTHER DATA
© 2012 IBM Corporation
… …
… …
/home/appl/data/web/important_big_spreadsheet.xls
/home/appl/data/web/big_architecture_drawing.ppt
/home/appl/data/web/unstructured_big_video.mpg
Remote site – write caching, update home site
/home
/appl
/data
/web
/home/appl/data/web/big_architecture_drawing.ppt
/home/appl/data/web/unstructured_big_video.mpg
IBM Scale Out NAS
Global Namespace
Policy Engine
Tier 1: SAS drive Tier 2: SATA drives
Storage node
Storage node
Interface node
Interface node
/home/appl/data/web/important_big_spreadsheet.xls
Remote user writes file to local edge
device
Local cache to disk
1. WritePeriodically, or when
nw is reconnected
Panache w/ Scale Out NAS
Panache w/ Scale Out NAS
8
IBM General Parallel File System
Data Marking: OTHER DATA
© 2012 IBM Corporation
Panache Modes
Single Writer
– Only cache can write data. Home can’t change. Other peer caches have to
be setup as read only caches.
Read Only
– Cache can only read data, no data change allowed.
Local Update
– Data is cached from home and changes are allowed like SW mode but changes are
not pushed to home.
– Once data is changed the relationship is broken i.e cache and home are no longer in
sync for that file.
Change of Modes
– SW & RO mode caches can be changed to any other mode.
– LU cache can’t be changed – too many complications/conflicts to deal with.
9
IBM General Parallel File System
Data Marking: OTHER DATA
© 2012 IBM Corporation
Pre-fetching
Policy-based pre population
Periodically runs parallel inodescan at home– Selects files/dirs based on policy criterion
• Includes any user defined metadata in xattrs or other file attributes
• SQL like construct to select
•RULE LIST „prefetchlist' WHERE FILESIZE > 1GB AND MODIFICATION_TIME > CURRENT_TIME- 3600 AND
USER_ATTR1 = “sat-photo” OR USER_ATTR2 =
“classified”
Cache then pre-fetches selected objects– Runs asynchronously in the background
– Parallel multi-node prefetch
– Can callout when completed
10
IBM General Parallel File System
Data Marking: OTHER DATA
© 2012 IBM Corporation
Expiration of Data
Staleness Control
– Defined based on time since disconnection
– Once cache is expired, no access is allowed to cache
– Manual expire/unexpire option for admin
• mmafmctl –expire/unexpire, ctlcache in sonas
– Allowed onlys for ro mode cache
– Disabled for SW & LU as they are sources of data themselves
11
IBM General Parallel File System
Data Marking: OTHER DATA
© 2012 IBM Corporation
Use Case: Central/Branch Office
Central Site is where data is created, maintained, updated/changed.
This is typically done in customer situations, where data is ingested via satellite or data warehousing etc.
Branch/edge sites can periodically prefetch (via policy) or pull on demand
Data is revalidated when accessed
A typical scenario for this is:
– Music sites, where data is maintained at central location and other sites in various locations will pull in data into cache and serve the data locally at that location.
Customers Use Cases:
– BofA, NGA, US Army
Periodic Prefetch On Demand Pull
Edge site
(Reader)
HQ Primary Site (Writer)
12
IBM General Parallel File System
Data Marking: OTHER DATA
© 2012 IBM Corporation
Use Case: Independent Writers
Local sites dedicated to researchers within a campus sharing a dedicated home filesystem but with individual home directory.
Each site/system has their own fileset, which will be their local cluster.
Use case oftens has a central system which will have all home dirs and backup/hsm will be managed out of this.
A company spread across various countries. They maintain logs of phone calls/sms etc as needed. The headquarters needs to maintain logs/records of all calls/sms from all countries. Per each country requirements, the data needs to be maintainted to process any queries received from customers/govt etc.
Typically in this case, the company headquarters contains all records from all countries/branches of its office. And each location maintains logs/records for that country and any country it requires.
UseUser A’s home directory
(writer)
r A’s home directory
(writer)
Backup Site
UseUser B’s home directory (writer)
UseBackup site
© 2012 IBM Corporation13
Global Namespace
Clients access:
/global/data1
/global/data2
/global/data3
/global/data4
/global/data5
/global/data6
Clients access:
/global/data1
/global/data2
/global/data3
/global/data4
/global/data5
/global/data6
Clients access:
/global/data1
/global/data2
/global/data3
/global/data4
/global/data5
/global/data6
Cache Filesets:
/data1
/data2
Local Filesets:
/data3
/data4
Cache Filesets:
/data5
/data6
File System: store1
Local Filesets:
/data1
/data2
Cache Filesets:
/data3
/data4
Cache Filesets:
/data5
/data6
File System: store2
Cache Filesets:
/data1
/data2
Cache Filesets:
/data3
/data4
Local Filesets:
/data5
/data6
File System: store3
See all data from any Cluster
Cache as much data as
required or fetch data on
demand
© 2012 IBM Corporation14
Why build GPFS Native RAID?
Disk rebuilding is a fact of life at Petascale level– With 100,000 disks and an MTBFdisk = 600 Khrs, rebuild is triggered about
four times a day
– 24-hour rebuild implies four concurrent, continuous rebuilds at all times
– With larger disks, rebuild time is larger risking 2nd disk failure.
Disk Integrity issues
– Silent i/o drops etc
© 2012 IBM Corporation15
Features
• Auto rebalancing
• Only 2% rebuild performance hit
• Reed Solomon erasure code, “8 data +3 parity”
• ~105 year MTDDL for 100-PB file system
• End-to-end, disk-to-GPFS-client data checksums
What is GPFS Native RAID?
Software RAID on the I/O Servers– SAS attached JBOD
– Special JBOD storage drawer for very
dense drive packing
– Solid-state drives (SSDs) for metadata
storage
SAS
vDISK
Local area network (LAN)
NSD
servers
SAS
vDISK
JBODs
© 2012 IBM Corporation16
Declustered RAID
– Data, parity and spare strips are uniformly and independently distributed across
disk array
– Supports an arbitrary number of disks per array
• Not restricted to an integral number of RAID track widths
Conventional Declustered
© 2012 IBM Corporation17
GPFS Native RAID algorithm
Each block of each file is stripped
Two types of RAID– 2-fault and 3-fault tolerant codes (‘RAID-D2, RAID-D3’)
– 3 or 4 way replication
– 8 + 2 or 3 way parity
3-way Replication (1+2)8 + 2p Reed Solomon2-fault
tolerant
codes
3-fault
tolerant
codes
1 strip
(GPFS
block)
2 or 3
replicated
strips
4-way Replication (1+3)
8 strips
(GPFS block)
2 or 3
redundancy
strips
8 + 3p Reed Solomon
© 2012 IBM Corporation18
Component Hierarchy
A Recovery group can have – max 512 disks
– 16 declustered arrays
– At least 1 SSD log vdisk
– Max 64 vdisks
A De-clustered array can – contain 128 pdisks
– Smallest is 4 disks
– Must have one large >= 11 disks
– Need 1 or more pdisks worth of
spare space
Vdisks– Block Size: 1 MiB, 2 MiB, 4 MiB, 8
MiB and 16 MiB
pdisks
Recovery Group Recovery Group
DA DA DA DA
Declustered
Arrays
VD VD VDVdisks = NSD VD VD VD VD VD
© 2012 IBM Corporation19
Independent Filesets
Independent Filesets
–Own inode space; dynamic expansion of inodes
–Efficient File mgmt ops
–Fileset level snapshots
–Per user/group quotas per fileset
© 2012 IBM Corporation20
Whats new in 3.5?
New event callbacks
– Tiebreaker callback to let customer decide which side to survive in case of network
partition
– diskDown to ensure desired action is taken when disk goes down
Performance Enhancements
– NSD multi-queue
• Provides more pipelining and parallelism in terms of i/o scheduling
• Better I/O performance in large SMP configs
– Data in inode for small files
– Striped Log files – provides balanced disk usage in small clusters
ILM Enhancements
– Scope option allows scans to be limited to fileset, filesystem or inode space– choice-algorithm (best, exact, fast)
– split-margin specifies how much deviation is allowed when use fast choice
algorithm in terms of THRESHOLD usage etc.
© 2012 IBM Corporation21
Misc
Snapshot Clones
– Quick, efficient way of making a file copy by creating a clone
• Doesn’t copy data blocks (eg: fits well with VM images)
IPV6 support
Windows
– GPFS daemon no longer needs SUA (SUA still required for GPFS admin cmds)
SELinux Support
API to access xattrs of a file