43
6/10/2001 1 Cluster File Systems, Inc Peter J. Braam Tim Reddin [email protected] [email protected] http://www.clusterfs.com The Lustre Storage Architecture Linux Clusters for Super Computing Linköping 2003

The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

  • Upload
    cirila

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Peter J. BraamTim Reddin [email protected] [email protected] http://www.clusterfs.com. The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003. Topics. History of project High level picture Networking Devices and fundamental API’s File I/O - PowerPoint PPT Presentation

Citation preview

Page 1: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

6/10/2001 1Cluster File Systems, Inc

Peter J. Braam Tim [email protected] [email protected]

http://www.clusterfs.com

The Lustre Storage Architecture

Linux Clusters for Super Computing

Linköping 2003

Page 2: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

2 - NSC 2003

Topics

History of project High level picture Networking Devices and fundamental API’s File I/O Metadata & recovery Project status Cluster File Systems, Inc

Page 3: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

3 - NSC 2003

Lustre’s History

Page 4: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

4 - NSC 2003

Project history

1999 CMU & Seagate Worked with Seagate for one year Storage management, clustering Built prototypes, much design Much survives today

Page 5: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

5 - NSC 2003

2000-2002 File system challenge

First put forward Sep 1999 Santa Fe New architecture for National Labs Characteristics:

100’s GB’s/sec of I/O throughput trillions of files 10,000’s of nodes Petabytes

From start Garth & Peter in the running

Page 6: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

6 - NSC 2003

2002 – 2003 fast lane

3 year ASCI Path Forward contract with HP and Intel

MCR & ALC, 2x 1000 node Linux Clusters PNNL HP IA64, 1000 node Linux cluster Red Storm, Sandia (8000 nodes, Cray) Lustre Lite 1.0 Many partnerships (HP, Dell, DDN, …)

Page 7: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

7 - NSC 2003

2003 – Production, perfomance

Spring and summer LLNL MCR from no, to partial, to full time use PNNL similar Stability much improved

Performance Summer 2003: I/O problems tackled Metadata much faster

Dec/Jan Lustre 1.0

Page 8: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

8 - NSC 2003

High level picture

Page 9: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

9 - NSC 2003

Lustre Systems – Major Components

Clients Have access to file system Typical role: compute server

OST Object storage targets Handle (stripes of, references to) file data

MDS Metadata request transaction engine.

Also: LDAP, Kerberos, routers etc.

Page 10: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

10 - NSC 2003

OST 1

OST 2

OST 7

OST 3

OST 6

OST 5

OST 4

GigE

QSW Net

Lustre Clients(1,000 Lustre Lite)

Up to 10,000’s

MDS 1(active)

MDS 2(failover)

Lustre Object Storage Targets (OST)

Linux OST

Servers with disk arrays

3rd party OST Appliances

SAN

Page 11: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

11 - NSC 2003

Clients

Object Storage Targets(OST)

Meta-data Server(MDS)

LDAP Server

configuration information, network connection details,

& security management

file I/O & file locking

recovery, file status, & file creation

directory operations, meta-data, & concurrency

Page 12: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

12 - NSC 2003

Networking

Page 13: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

13 - NSC 2003

Lustre Networking

Currently runs over: TCP Quadrics Elan 3 & 4 Lustre can route & can use heterogeneous nets

Beta Myrinet, SCI

Under development SAN (FC/iSCSI), I/B

Planned: SCTP, some special NUMA and other nets

Page 14: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

14 - NSC 2003

Lustre Network Stack - Portals

Device Library (Elan,Myrinet,TCP,...)

Portal NAL’s

Portal Library

NIO API

Lustre Request Processing

Network Abstraction Layer forTCP, QSW, etc. Small & hard

Includes routing api.

Sandia’s API,CFS improved impl.

Move small & large buffers,Remote DMA handling,

Generate events

0-copy marshalling libraries,Service framework,Client request dispatch,Connection & address naming,Generic recovery infrastructure

Page 15: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

15 - NSC 2003

Devices and API’s

Page 16: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

16 - NSC 2003

Lustre has numerous driver modules One API - very different implementations Driver binds to named device Stacking devices is key Generalized “object devices”

Drivers currently export several API’s Infrastructure - a mandatory API Object Storage Metadata Handling Locking Recovery

Lustre Devices & API’s

Page 17: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

17 - NSC 2003

Lustre File System (Linux) or Lustre Library (Win, Unix, Micro Kernels)

Logical Object Volume(LOV driver)

OSC1 … OSCn

Data Object & Lock

MDC

Metadata & Lock

MDC …

Clustered MDdriver

Lustre Clients & API’s

Page 18: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

18 - NSC 2003

Object Storage Api

Objects are (usually) unnamed files Improves on the block device api

create, destroy, setattr, getattr, read, write OBD driver does block/extent allocation Implementation:

Linux drivers, using a file system backend

Page 19: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

19 - NSC 2003

Networking

Object-Based DiskServer (OBD server

LockServer

Fibre Channel

Reco

very

Load Balancing

MDSServer

LockServer

Ext3, Reiser, XFS, … FS

Reco

very

Networking

LockClient

DirectoryMetadata &Concurrency MDSMDS

Lustre Client FileSystem Metadata

WB cache

OSC’s MDCLockClient

Networking

Reco

very

Device (Elan,TCP,…)Portal NAL’s

Portal LibraryNIO API

Request Processing

System &Parallel File I/O,

File Locking

Recovery,File Status,

File Creation

OSOSTT

Fibre Channel

Ext3, Reiser, XFS,… FS

Bringing it all together

Page 20: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

20 - NSC 2003

File I/O

Page 21: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

21 - NSC 2003

File I/O – Write Operation

Open file on meta-data server Get information on all objects that are part of file:

Objects id’s What storage controllers (OST) What part of the file (offset) Striping pattern

Create LOV, OSC drivers Use connection to OST

Object writes to OST No MDS involvement at all

Page 22: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

22 - NSC 2003

MDC

Lustre Client Meta-data Server

MDS

OST 1 OST 2 OST 3

OSC 2

File meta-data

Inode A {(O1,obj1),(O3, obj2)}

File open request

Write (obj 1) Write (obj 2)

OSC 1

LOV

File system

Page 23: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

23 - NSC 2003

I/O bandwidth

100’s GB/sec => saturate many100’s OSTs OST’s:

Do ext3 extent allocation, non-caching direct I/O Lock management spread over cluster

Achieve 90-95% of network throughput Single client, single thread Elan3: W 269MB/sec OST’s handle up to 260MB/sec W/O extent code, on 2 way 2.4GHz Xeon

Page 24: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

24 - NSC 2003

Metadata

Page 25: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

25 - NSC 2003

Intent locks & Write Back caching

Clients – MDS: protocol adaptation Low concurrency - write back caching

Client in memory updates delayed replay to MDS

High concurrency (mostly merged in 2.6) Single network request per transaction No lock revocations to clients Intent based lock includes complete request

Page 26: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

26 - NSC 2003

Client File ServerNetwork

a) Conventional mkdir

lookup

mkdir

LustreClient

lookup

Lustre_mkdir

Meta-dataServer

lock module

Mds_mkdir

Network

b) Lustre mkdir

lookup mkdir

lookup intent mkdir

exercise the intent

Client

lookup

mkdir

File Server

lookup

mkdir

Network

a) Conventional mkdir

create dir

lookup

mkdir

Page 27: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

27 - NSC 2003

Lustre 1.0

Only has high concurrency model Aggregate throughput (1,000 clients):

Achieve ~5000 file creations (open/close) /sec Achieve ~7800 stat’s in 10 x1M file directories

Single client: Around 1500 creations or stat’s /sec

Handling 10M file directories is effortless Many changes to ext3 (all merged in 2.6)

Page 28: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

28 - NSC 2003

Metadata Future

Lustre 2.0 – 2004

Metadata clustering Common operations will parallelize

100% WB caching in memory or on disk Like AFS

Page 29: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

30 - NSC 2003

Recovery

Page 30: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

31 - NSC 2003

Recovery approach

Keep it simple! Based on failover circles:

Use existing failover software Left working neighbor is failover node for you

At HP we use failover pairs Simplify storage connectivity

I/O failure triggers Peer node serves failed OST Retry from client routed to new OST node

Page 31: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

32 - NSC 2003

OST Server – redundant pair

C2C1C1 C2

OST1 OST 2

FC Switch

FC Switch

Page 32: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

33 - NSC 2003

Configuration

Page 33: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

34 - NSC 2003

Lustre 1.0

Good tools to build configuration Configuration is recorded on MDS

Or on dedicated management server Configuration can be changed,

1.0 requires downtime

Clients auto configure mount –t lustre –o … mds://fileset/sub/dir /mnt/pt

SNMP support

Page 34: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

35 - NSC 2003

Futures

Page 35: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

36 - NSC 2003

Advanced Management

Snapshots All features you might expect

Global namespace Combine best of AFS & autofs4

HSM, hot migration Driven by customer demand (we plan XDSM)

Online 0-downtime re-configuration Part of Lustre 2.0

Page 36: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

38 - NSC 2003

Authentication POSIX style authorization NASD style OST authorization

Refinement: use OST ACL’s and cookies File crypting with group key service

STK secure file system

Security

Page 37: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

43 - NSC 2003

Project status

Page 38: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

44 - NSC 2003

Lustre Feature Roadmap

Lustre (Lite) 1.0

(Linux 2.4 & 2.6)

Lustre 2.0 (2.6) Lustre 3.0

2003 2004 2005

Failover MDS Metadata cluster Metadata cluster

Basic Unix security Basic Unix security Advanced Security

File I/O very fast

(~100’s OST’s)

Collaborative read cache

Storage management

Intent based scalable metadata

Write back metadata Load balanced MD

POSIX compliant Parallel I/O Global namespace

Page 39: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

45 - NSC 2003

Cluster File Systems, Inc.

Page 40: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

46 - NSC 2003

Cluster File Systems

Small service company: 20-30 people Software development & service (95% Lustre) contract work for Government labs OSS but defense contracts

Extremely specialized and extreme expertise we only do file systems and storage

Investments - not needed. Profitable. Partners: HP, Dell, DDN, Cray

Page 41: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

47 - NSC 2003

Lustre – conclusions

Great vehicle for advanced storage software Things are done differently Protocols & design from Coda & InterMezzo Stacking & DB recovery theory applied

Leverage existing components Initial signs promising

Page 42: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

48 - NSC 2003

HP & Lustre

Two projects ASCI PathForward – Hendrix Lustre Storage product

Field trial in Q1 of 04

Page 43: The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

49 - NSC 2003

Questions?