Upload
cirila
View
31
Download
0
Embed Size (px)
DESCRIPTION
Peter J. BraamTim Reddin [email protected] [email protected] http://www.clusterfs.com. The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003. Topics. History of project High level picture Networking Devices and fundamental API’s File I/O - PowerPoint PPT Presentation
Citation preview
6/10/2001 1Cluster File Systems, Inc
Peter J. Braam Tim [email protected] [email protected]
http://www.clusterfs.com
The Lustre Storage Architecture
Linux Clusters for Super Computing
Linköping 2003
2 - NSC 2003
Topics
History of project High level picture Networking Devices and fundamental API’s File I/O Metadata & recovery Project status Cluster File Systems, Inc
3 - NSC 2003
Lustre’s History
4 - NSC 2003
Project history
1999 CMU & Seagate Worked with Seagate for one year Storage management, clustering Built prototypes, much design Much survives today
5 - NSC 2003
2000-2002 File system challenge
First put forward Sep 1999 Santa Fe New architecture for National Labs Characteristics:
100’s GB’s/sec of I/O throughput trillions of files 10,000’s of nodes Petabytes
From start Garth & Peter in the running
6 - NSC 2003
2002 – 2003 fast lane
3 year ASCI Path Forward contract with HP and Intel
MCR & ALC, 2x 1000 node Linux Clusters PNNL HP IA64, 1000 node Linux cluster Red Storm, Sandia (8000 nodes, Cray) Lustre Lite 1.0 Many partnerships (HP, Dell, DDN, …)
7 - NSC 2003
2003 – Production, perfomance
Spring and summer LLNL MCR from no, to partial, to full time use PNNL similar Stability much improved
Performance Summer 2003: I/O problems tackled Metadata much faster
Dec/Jan Lustre 1.0
8 - NSC 2003
High level picture
9 - NSC 2003
Lustre Systems – Major Components
Clients Have access to file system Typical role: compute server
OST Object storage targets Handle (stripes of, references to) file data
MDS Metadata request transaction engine.
Also: LDAP, Kerberos, routers etc.
10 - NSC 2003
OST 1
OST 2
OST 7
OST 3
OST 6
OST 5
OST 4
GigE
QSW Net
Lustre Clients(1,000 Lustre Lite)
Up to 10,000’s
MDS 1(active)
MDS 2(failover)
Lustre Object Storage Targets (OST)
Linux OST
Servers with disk arrays
3rd party OST Appliances
SAN
11 - NSC 2003
Clients
Object Storage Targets(OST)
Meta-data Server(MDS)
LDAP Server
configuration information, network connection details,
& security management
file I/O & file locking
recovery, file status, & file creation
directory operations, meta-data, & concurrency
12 - NSC 2003
Networking
13 - NSC 2003
Lustre Networking
Currently runs over: TCP Quadrics Elan 3 & 4 Lustre can route & can use heterogeneous nets
Beta Myrinet, SCI
Under development SAN (FC/iSCSI), I/B
Planned: SCTP, some special NUMA and other nets
14 - NSC 2003
Lustre Network Stack - Portals
Device Library (Elan,Myrinet,TCP,...)
Portal NAL’s
Portal Library
NIO API
Lustre Request Processing
Network Abstraction Layer forTCP, QSW, etc. Small & hard
Includes routing api.
Sandia’s API,CFS improved impl.
Move small & large buffers,Remote DMA handling,
Generate events
0-copy marshalling libraries,Service framework,Client request dispatch,Connection & address naming,Generic recovery infrastructure
15 - NSC 2003
Devices and API’s
16 - NSC 2003
Lustre has numerous driver modules One API - very different implementations Driver binds to named device Stacking devices is key Generalized “object devices”
Drivers currently export several API’s Infrastructure - a mandatory API Object Storage Metadata Handling Locking Recovery
Lustre Devices & API’s
17 - NSC 2003
Lustre File System (Linux) or Lustre Library (Win, Unix, Micro Kernels)
Logical Object Volume(LOV driver)
OSC1 … OSCn
Data Object & Lock
MDC
Metadata & Lock
MDC …
Clustered MDdriver
Lustre Clients & API’s
18 - NSC 2003
Object Storage Api
Objects are (usually) unnamed files Improves on the block device api
create, destroy, setattr, getattr, read, write OBD driver does block/extent allocation Implementation:
Linux drivers, using a file system backend
19 - NSC 2003
Networking
Object-Based DiskServer (OBD server
LockServer
Fibre Channel
Reco
very
Load Balancing
MDSServer
LockServer
Ext3, Reiser, XFS, … FS
Reco
very
Networking
LockClient
DirectoryMetadata &Concurrency MDSMDS
Lustre Client FileSystem Metadata
WB cache
OSC’s MDCLockClient
Networking
Reco
very
Device (Elan,TCP,…)Portal NAL’s
Portal LibraryNIO API
Request Processing
System &Parallel File I/O,
File Locking
Recovery,File Status,
File Creation
OSOSTT
Fibre Channel
Ext3, Reiser, XFS,… FS
Bringing it all together
20 - NSC 2003
File I/O
21 - NSC 2003
File I/O – Write Operation
Open file on meta-data server Get information on all objects that are part of file:
Objects id’s What storage controllers (OST) What part of the file (offset) Striping pattern
Create LOV, OSC drivers Use connection to OST
Object writes to OST No MDS involvement at all
22 - NSC 2003
MDC
Lustre Client Meta-data Server
MDS
OST 1 OST 2 OST 3
OSC 2
File meta-data
Inode A {(O1,obj1),(O3, obj2)}
File open request
Write (obj 1) Write (obj 2)
OSC 1
LOV
File system
23 - NSC 2003
I/O bandwidth
100’s GB/sec => saturate many100’s OSTs OST’s:
Do ext3 extent allocation, non-caching direct I/O Lock management spread over cluster
Achieve 90-95% of network throughput Single client, single thread Elan3: W 269MB/sec OST’s handle up to 260MB/sec W/O extent code, on 2 way 2.4GHz Xeon
24 - NSC 2003
Metadata
25 - NSC 2003
Intent locks & Write Back caching
Clients – MDS: protocol adaptation Low concurrency - write back caching
Client in memory updates delayed replay to MDS
High concurrency (mostly merged in 2.6) Single network request per transaction No lock revocations to clients Intent based lock includes complete request
26 - NSC 2003
Client File ServerNetwork
a) Conventional mkdir
lookup
mkdir
LustreClient
lookup
Lustre_mkdir
Meta-dataServer
lock module
Mds_mkdir
Network
b) Lustre mkdir
lookup mkdir
lookup intent mkdir
exercise the intent
Client
lookup
mkdir
File Server
lookup
mkdir
Network
a) Conventional mkdir
create dir
lookup
mkdir
27 - NSC 2003
Lustre 1.0
Only has high concurrency model Aggregate throughput (1,000 clients):
Achieve ~5000 file creations (open/close) /sec Achieve ~7800 stat’s in 10 x1M file directories
Single client: Around 1500 creations or stat’s /sec
Handling 10M file directories is effortless Many changes to ext3 (all merged in 2.6)
28 - NSC 2003
Metadata Future
Lustre 2.0 – 2004
Metadata clustering Common operations will parallelize
100% WB caching in memory or on disk Like AFS
30 - NSC 2003
Recovery
31 - NSC 2003
Recovery approach
Keep it simple! Based on failover circles:
Use existing failover software Left working neighbor is failover node for you
At HP we use failover pairs Simplify storage connectivity
I/O failure triggers Peer node serves failed OST Retry from client routed to new OST node
32 - NSC 2003
OST Server – redundant pair
C2C1C1 C2
OST1 OST 2
FC Switch
FC Switch
33 - NSC 2003
Configuration
34 - NSC 2003
Lustre 1.0
Good tools to build configuration Configuration is recorded on MDS
Or on dedicated management server Configuration can be changed,
1.0 requires downtime
Clients auto configure mount –t lustre –o … mds://fileset/sub/dir /mnt/pt
SNMP support
35 - NSC 2003
Futures
36 - NSC 2003
Advanced Management
Snapshots All features you might expect
Global namespace Combine best of AFS & autofs4
HSM, hot migration Driven by customer demand (we plan XDSM)
Online 0-downtime re-configuration Part of Lustre 2.0
38 - NSC 2003
Authentication POSIX style authorization NASD style OST authorization
Refinement: use OST ACL’s and cookies File crypting with group key service
STK secure file system
Security
43 - NSC 2003
Project status
44 - NSC 2003
Lustre Feature Roadmap
Lustre (Lite) 1.0
(Linux 2.4 & 2.6)
Lustre 2.0 (2.6) Lustre 3.0
2003 2004 2005
Failover MDS Metadata cluster Metadata cluster
Basic Unix security Basic Unix security Advanced Security
File I/O very fast
(~100’s OST’s)
Collaborative read cache
Storage management
Intent based scalable metadata
Write back metadata Load balanced MD
POSIX compliant Parallel I/O Global namespace
45 - NSC 2003
Cluster File Systems, Inc.
46 - NSC 2003
Cluster File Systems
Small service company: 20-30 people Software development & service (95% Lustre) contract work for Government labs OSS but defense contracts
Extremely specialized and extreme expertise we only do file systems and storage
Investments - not needed. Profitable. Partners: HP, Dell, DDN, Cray
47 - NSC 2003
Lustre – conclusions
Great vehicle for advanced storage software Things are done differently Protocols & design from Coda & InterMezzo Stacking & DB recovery theory applied
Leverage existing components Initial signs promising
48 - NSC 2003
HP & Lustre
Two projects ASCI PathForward – Hendrix Lustre Storage product
Field trial in Q1 of 04
49 - NSC 2003
Questions?