Download ppt - Nick Kirsch Isilon Clustered Storage OneFS. Introduction Who is Isilon? What Problems Are We Solving? (Market Opportunity) Who Has These Problems? (Our

Nick Kirsch

Isilon Clustered StorageOneFS

Introduction

• Who is Isilon?• What Problems Are We Solving? (Market Opportunity)• Who Has These Problems? (Our Customers)• What Is Our Solution? (Our Product)• How Does It Work? (The Cool Stuff)

Who is Isilon Systems?

• Founded in 2000• Located in Seattle (Queen Anne)• IPO’d in 2006 (ISLN)• ~400 employees• Q3 2008 Revenue: $30 million, 40%

Y/Y• Co-founded by Paul Mikesell, UW/CSE

• I’ve been at the company for 6+ years

What Problems Are We Solving?

Structured Data

• Small files• Modest-size data stores• I/O intensive• Transactional• Steady capacity growth

Unstructured Data

• Larger files • Very large data stores• Throughput intensive• Sequential• Explosive capacity growth

Traditional Architectures

• Data Organized in Layers of Abstraction• File System, Volume Manager, RAID• Server/Storage Architecture - “Head” and

“Disk”• Scale Up (vs Scale Out)

• Islands of Storage• Hard to Scale• Performance Bottlenecks• Not Highly Available• Overly Complex• Cost Prohibitive

Storage Device #1

Storage Device #2

Storage Device #3

Who Has These Problems?(P

B)

File Based: 79.3% CAGR Block Based: 31% CAGR

By 2011, 75% of all storage capacity sold will be for file-based data

* Source: IDC, 2007

• Isilon has over 850 customers today.

What is Our Solution?

Scales to 96 nodes

2.3 PB (single file system) 20 GB/s (aggregate)

A 3-nodeIsilon IQ Cluster

Enterprise-

class hardware

Clustered Storage

Isilon IQOneFS™

intelligentsoftware

Clustered Storage Consists Of “Nodes”

• Largely Commodity Hardware• Quad-core 2.3Ghz CPU• 4 GB memory read cache • GbE and 10GbE for front-end network• 12 disks per node

• InfiniBand for intra-cluster communication• High-speed NVRAM journal• Hot-swappable disks, power supplies, and fans

• NFS, CIFS, HTTP, FTP• Integrates with Windows and UNIX

• OneFS operating system

Isilon Network Architecture

CIFS

Ethernet

NFS

Either

• Drop-in replacement for any NAS device• No client-side drivers required, like Andrew FS (Coda), or Lustre• No application changes, like Google FS or Amazon S3 • No changes required to adopt.

How Does It Work?

• Built on FreeBSD 6.x (originally 5.x)• New kernel module for OneFS• Modifications to the kernel proper• User space applications• Leverage open-source where possible• Almost all of the heavy-lifting is in the kernel

• Commodity Hardware• A few exceptions:

• We have a high-speed NVRAM journal for data consistency• We have an Infiniband low-latency cluster inter-connect• We have a close-to-commodity SAS card (commodity chips)• A custom monitoring board (fans, temps, voltages, etc.)• SAS and SATA disks

OneFS architecture• Fully Distributed

• Top Half• Initiator

• Bottom Half• Participant

• The OneFS architecture is basically an Infiniband SAN• All data access across the back-end network is block-level• The participants act as very smart disk drives• Much of the back-end data traffic can be RDMA

Network Operations (TCP, NFS, CIFS)FEC Calculations, Block ReconstructionVFS layer, Locking, etc.File-Indexed Cache

Journal and Disk OperationsBlock-Indexed Cache

OneFS architecture

• OneFS started from UFS (aka FFS)• Generalized for a distributed system.• Little resemblance in code today, but concepts are there.• Almost all data structures are trees

• OneFS Knows Everything – no volume manager, no RAID• Lack of abstraction allows us to do interesting things, but

forces the file system to know a lot – everything.

• Cache/Memory Architecture Split• “Level 1” – file cache (cached as part of the vnode)• “Level 2” – block cache (local or remote disk blocks)• Memory used for high-speed write coalescer

• Much more resource intensive than a local FS

Atomicity/Consistency Guarantees

• POSIX file system• Namespace operations are atomic• fsync/sync operations are guaranteed synchronous

• FS data is either mirrored or FEC-protected• Meta-data is always mirrored; up to 8x• User-data can be mirrored (up to 8x) or FEC up to +4

• We use Reed-Solomon codings for FEC

• Protection level can be chosen on a per-file or per-directory basis. • Some files can be at 1x (no protection) while others can be at +4

(survive four failures).• Meta-data must be protected at least as high as anything it refers

to.

• All writes go to the NVRAM first as part of a distributed transaction – guaranteed to commit or abort.

Group Management• Transactional way to handle state changes• All nodes need to agree on their peers• Group changes: split, merge, add, remove• Group changes don’t “scale”, but are rare

+

1

2 3

4

Distributed Lock Manager• Textbook-ish DLM

• Anyone requesting a lock is an initiator.• Coordinator knows the definitive owner for the lock.

• Controls access to locks.• Coordinator is chosen by a hash of the resource.

• Split/Merge behavior• Locks are lost at merge time, not split time.• Since POSIX has no lock-revoke mechanism, advisory locks are

silently dropped.• Coordinator renegotiates on split/merge.

• Locking optimizations – “lazy locks”• Locks are cached.• Lock-lost callbacks.• Lock-contention callbacks.

RPC Mechanism

• Uses SDP on Infiniband• Batch System

• Allows you to put dependencies on the remote side. • i.e. Send 20 messages, checkpoint, send 20 messages.• Messages run in parallel, then synchronize, etc.

• Coalesces errors.

• Async messages (callback)• Sync messages• Update message (no response)• Used by DLM, RBM, etc. (everything)

Writing a file to OneFS

• Writes occur via NFS, CIFS, etc. to a single node• That node coalesces data and initiates transactions

• Optimizing for write performance is hard•Lots of variables•Each node might have different load•Unusual scenarios, e.g. degraded writes

•Asynchronous Write Engine•Build a directed acyclical graph (DAG)•Do work as soon as dependencies satisfied•Prioritize and pipeline work for efficiency


(optional 2nd switch)


Servers

NFS, CIFS,

FTP, HTTP


Servers

Servers



Writing a file to OneFS• Break the write into regions• Region are protection group aligned• For each region:

• Create a layout• Use layout to generate a plan• Execute the plan asynchronously

write block

write FEC compute

FEC

allocate blocks

compute layout

write block


• Plan executes and transaction commits• Data and parity blocks are now on disks

Inode mirror 0 Inode mirror 1

Data and Parity blocks




Servers

NFS, CIFS,

FTP, HTTP


Servers

Servers

Reading a file from OneFS


Reading a OneFS File


Servers

NFS, CIFS,

FTP, HTTP


Servers

Servers

Reading a file from OneFS

Handling Failures

• What could go wrong during a single transaction?

• A block-level I/O request fails• A drive goes down• A node runs out of space• A node disconnects or crashes

• In a distributed system, things are expected to fail.• Most of our system calls automatically restart.• Have to be able to gracefully handle all of the

above, plus much more!

Handling Failures• When a node goes “down”:

• New files will use effective protection levels (if necessary)

• Affected files will be reconstructed automatically per request.

• That node’s IP addresses are migrated to another node.• Some data is orphaned and later garbage collected.

• When a node “fails”:• New files will use effective protection levels (if

necessary)• Affected files will be repaired automatically across the

cluster.• AutoBalance will automatically rebalance data.

• We can safely, proactively SmartFail nodes/drives:• Reconstruct data without removing the device.• In the event of a multiple-component failure occurs, use

the original device – minimizes WOR.

SmartConnect

CIFS

Ethernet

NFS

Either

• Client must connect to a single IP address.• SmartConnect - DNS server which runs on the cluster

• Customer delegates zone to the cluster DNS server• SmartConnect responds to DNS queries with only available nodes• SmartConnect can also be configured to respond with nodes based on load, connection, throughput, etc.

SmartConnect

We've got Lego Pieces

• Accelerator Nodes• Top-Half Only • Adds CPU and Memory – no disks or journal• Only has Level 1 cache… high single-stream throughput

• Storage Nodes• Both Top or Bottom Half• In Some Workloads, Bottom Half Only Makes Sense

• Storage Expansion Nodes• Just a dumb extension of a Storage Node – add disks• Grow Capacity Without Performance

IT

it.tx.com•Full access, maintenance interface•Corporate DNS, no SC•Static (well-known) IPs required

ext-2

ext-1

gg.tx.com•Storage nodes•NFS clients, no failover

Interpreters

10.20

BizDev

bizz.tx.com•Renamed sub-domain•CIFS clients (static IP)

10.10

Eng

eng.tx.com•Shared subnet•Separate sub-domain•NFS Failover

fin.tx.com•VLAN (confidential traffic, isolated)•Same physical LAN

10.30

Finance

hpc. tx.com•10 GigE dedicated•Accelerator X nodes•NFS Failover required 10gige-1

Processing

SmartConnect Zones

http://www.slb.com/content/services/software/geo/geoframe/geooffice/geooffice_screenshots.asp?

29ISILON CONFIDENTIAL

Initiator Software Block Diagram

NFS CIFS FTPNDMPHTTP ?

BAM

RBM

Front-end Network

Back-end Network

Btree

IFMDFMBSWLayout

LIN

Initiator Cache

STF

MDS


Participant Software Block Diagram

RBM

Back-end Network

DRV

LBM

Journal

Disk SubsystemNVRAM

Participant Cache


RBM

Back-end Network

DRV

LBM

Journal

Disk Subsystem

NVRAM

Participant Cache

Infinband

NFSCIFS

FTPNDMP

HTTP

iSCSI

BAM

RBM

Front-end Network

Back-end Network

Btree

IFM

DFM

BSW

Layout

LIN

Initiator CacheSTF

MDS

NFSCIFS

FTPNDMP

HTTP

iSCSI

BAM

RBM

Front-end Network

Back-end Network

Btree

IFM

DFM

BSW

Layout

LIN

Initiator CacheSTF

MDS

System Software Block Diagram

Storage Node

Accelerator

Too much to talk about…

• Failed Drive Reconstruction• Distributed Deadlock

Detection• On-the-fly Filesystem Upgrade• Dynamic Sector Repair• Globally Coherent Cache

• Snapshots• Quotas• Replication• Bit Error Protection• Rebalancing Data• Handling Slow Drives• Statistics Gathering• I/O Scheduling• Network Failover• Native Windows Concepts (ACLs, SIDs, etc.)

Thank You!

Questions?