29
Cluster Shared Volume Vladimir Petter Principal Software Design Engineer Microsoft

Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

  • Upload
    docong

  • View
    227

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Cluster Shared Volume

Vladimir Petter Principal Software Design Engineer

Microsoft

Page 2: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Topics

CSV Requirements and motivation CSV Design CSV IO operations Scale Out File Server Developing For CSV

Page 3: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Topics

CSV Requirements and motivation CSV Design CSV IO operations Scale Out File Server Developing For CSV

Page 4: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV Requirements. Why we did CSV in Windows 2008 R2?

4

Shared Storage VHD VHD VHD

We need a LUN per VM (1000 LUNs) or we need a Clustered File System No in box solution that would provide

shared LUN access. SMB based NAS at that time was not

considered a viable solution. Has changed in Windows 2012. See slide #

18 - 20

VM consolidation without scarifying performance while keeping management sane We want to consolidate lots of VMs (1000s) in

a cluster We need to be able to move VM between

nodes To access VHD file VM required local file

system

Page 5: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV Requirements. Characteristics of the Workloads Target workloads VM and SQL

Keeps file opened for a long time Few metadata operations. Mostly reads

and writes When file is accessed for both read and

write it is opened from one node When files accessed from multiple nodes

then read is dominant operation Multiple VMs need to access the same Base VHD,

and still can run on different cluster nodes.

Need to support tens of thousands of opened file VDI scenario with up to 64 node cluster and hundreds

VMs per node

5

Shared Storage VHD VHD VHD

Page 6: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV Requirements. What we’ve decided to build Design Objectives

Multiple workloads can access the same LUN

Build up on the current investments

NTFS; SMB; Clustering

Same path to the file from any node

Reads and writes need to go directly to the SAN whenever possible – “Direct I/O”

All other operations should go to NTFS on the node where NTFS is mounted.

Performance Objectives Direct IO performance should

be as fast as NTFS If we need to Redirect IO over

network then it should be as fast as IO over SMB when file is opened for write-through.

High Availability Objectives Detect and hide storage,

network and node failures.

6

Disk.sys

Volume Manager

NTFS

VM

SMB

Node 2

Disk.sys

VM

.VHD SMB

Node 1

Direct I/O

CSV CSV

Shared Storage

CSV Components

Windows Components

( Coordinator )

Cluster C:\ClusterStorage\ Volume1\foo.vhd C:\ClusterStorage\

Volume1\boo.vhd

Direct I/O Direct I/O

Page 7: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Topics

CSV Requirements and motivation CSV Design CSV IO operations Scale Out File Server Developing For CSV

Page 8: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV Design Decisions Local File System on Windows Logically every local file system in

the windows has Upper part that is responsible for

implementing File System interface. Applications have dependency on this interface so it is very application compatibility sensitive

Lower part that manages how File System lays out data on the disk, and how it writes data to the disk. Applications are not supposed to care how it works. It interfaces with volume for block read/write operations

CSVFS does not introduce new on

disk structure nor does it work directly with on disk layout except of one special case – Direct IO. Consequently CSVFS has only upper part.

8

Volume Manager

FS Upper Part

Disk.sys

FS Lower Part

Disk.sys

Page 9: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV Design Decisions Remote vs Local File System It looks similar to the remote

file system, and we almost went the direction of developing it as an extension for the SMB Redirectior

We chose to implement it as a local file system because It provides better compatibility with

existing applications It provides better management

story with existing Disk Management tools

It provides better backup story with existing backup tools

Remote applications can access it

Local File System needs a

volume to mount on so we needed a Volume Manager 9

CSV

Disk.sys

Volume Manager

NTFS

Disk.sys

CSV CSV

Page 10: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV Design Decisions CSV Volume Manager CSV volume is created on every

cluster node. You can access it using mountpoint like C:\CluserStorage\Volume1 from any cluster node.

CSV Volume Manager is controlled by Cluster

CSVFS mounts only on the volumes create by the CSV Volume Manager

CSV Volume provides CSVFS lifetime that does not have dependency on the disk connectivity. Even on the node where disk is not connected CSVFS is present and accessible.

CSV Volume is the object that you see in the disk management utilities.

CSV Volume is what applications find when they enumerate volumes on the machine using Win32 APIs

10

CSV

Disk.sys

Volume Manager

NTFS

Disk.sys

CSV CSV

CSV Volume Manager

CSV Volume Manager

CSV Volume Manager

Page 11: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV Design Decisions Recipe how to develop CSVFS Take FastFat from DDK Make it mountable on CSV Volume Remove lower part of FastFat that is managing on disk structure and replace it

with routines that either forward IO to NTFS or perform Direct IO. Add support for bunch of calls that FastFat did not support to get parity with

NTFS and REFS

11

Secret ingredients: Read several times File System Internals book by Rajeev Nagar Heavily instrument the code with tracing so you can learn how it really works

Optional If you care about performance then redesign locking model

Debug it for at least 3 to 5 years.

Page 12: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV Design Decisions CSV Fault Tolerance. Failure Storage disconnects CSVFS on one of the nodes

observes IO failure and tells cluster

On every node cluster tells CSVFS volume to start draining In draining state CSVFS

pends new IO and any failing IO

On every node cluster tells CSVFS volume to pause CSVFS cancels ongoing IO

and waits for completion of all IOs

Once all IO completed CSVFS closes its internal files opens on NTFS

Cluster will observe disk failure

Cluster will tear down volume stack

12

CSV

Disk.sys

Volume Manager

NTFS

Disk.sys

CSV CSV

CSV Volume Manager

CSV Volume Manager

CSV Volume Manager

Page 13: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV Design Decisions CSV Fault Tolerance. Recovery Application still have files

opened on CSVFS and are not aware of the failure

Cluster finds a node where disk is still connected and mounts NTFS on that node

On every node cluster tells CSVFS to reopen its internal handles on NTFS

On every node cluster tells CSVFS to resume IO. CSVFS reissues all paused

IO and stop pending any new IOs

13

CSV

Volume Manager

NTFS

Disk.sys

CSV CSV

CSV Volume Manager

CSV Volume Manager

CSV Volume Manager

Page 14: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Topics

CSV Requirements and motivation CSV Design CSV IO operations Scale Out File Server Developing For CSV

Page 15: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV Operation. File Open

15

sd Create

Node 1 Node 2

Csv Fs

Application

MUP/RDR/SMB SRV/SMB

Csv Flt

NTFS

IO Manager

IO Manager

1: CreateFile(FileName)

1.1: IRP_MJ_CREATE(FileName)

1.2: Create CLIUSR for the user token()

1.3: Serialize User Token to a BLOB()

1.4: Impersonate CLIUSR()

1.5: IoCreateFile(FileName, EA: BLOB ) 1.6: SMB Create(FileName, EA: BLOB)

1.7: Authenticate And Impersorane CLIUSR()

1.8: Validate CLIUSR has access to the share()

1.9: IoCreateFile(FileName, EA: BLOB)

1.10: IRP_MJ_CREATE(FileName, EA: BLOB)

1.11: Validate BLOB's Signature()

1.12: Recreate Original Token From BLOB()

1.13: Impersonate the original user()1.14: IRP_MJ_CREATE(FileName)

Page 16: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV Operations CSV Uses Oplocks for Cache Coherency Oplocks existed in Windows for a long time and are used by SMB clients for

cache coherency Like SMB Client, CSV uses oplocks for cache coherency CSV also uses oplocks to decide when it is safe to perform Direct IO.

We can perform Direct IO on read if we have a read containing oplock level (RWH, RW, RH or R) We can perform Direct IO on write if we have a write containing oplock level (RWH or RW)

A write containing oplock is granted only if file is opened from single node When another node opens file, the open operation will cause revoke of write oplock level. Open will

be held until the client that lost oplock acknowledges oplock break On files opened from multiple nodes CSV can get read containing oplock

Opertions that modify data cause read oplock break

16

Page 17: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV Scenarios. When Direct IO is Possible?

17

We understand on disk file format There are no File System filters that might modify file layout There are no File System filters that object against Direct IO on the steram We were able to make sure NTFS will not change location of the file data on the

volume No applications that need to make sure IO is observed by NTFS stack We have oplocks. Cross node cache coherency.

RWH or RH or RW or R for reads RWH or RW for write

We we able to purge cache on NTFS. Make sure there is no stale cache on NTFS.

Page 18: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV Operations Unbuffered Write

Data are written to the disk using Direct IO

If write is extending ValidDataLength then we also update VDL on NTFS

18

sd DirectIO. No Buffering.

Storage

MDSDS

Csv Fs

Application

Storage

NTFS

CC

1: Write()

1.1: Write Data()

1.2: Set Vdl()

1.3: Update Vdl in Meta-Data Stream()

1.4: Flush Meta-Data Stream()

1.5: Write(FUA bit set)

1.6: Write Data(FUA bit Set)

CC – Windows Cache Manager

Page 19: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV Operations Buffered Write

Data are written to the disk using Direct IO

If write is extending ValidDataLength then we also update VDL on NTFS

19

sd Shadow FO NO Write Through Down-Lev el FO has Write Through

Storage

MDSDS

Csv Fs

CC

Application

Storage

NTFS

CC

1: Top-Level-Write()

1.1: CCWrite() 2: Write Behind()

2.1: Write Data()

2.2: Move VDL To Disk()

2.3: Set Vdl()

2.4: Update Vdl in Meta-Data Stream()

2.5: Flush Meta-Data Stream()

2.6: Write(FUA bit set)

2.7: Write Data(FUA bit Set)

CC – Windows Cache Manager

Page 20: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV – Scale Out File Server What is it?

Store Hyper-V files in shares over the SMB 3.0 protocol (including VM configuration, VHD files, snapshots)

Highlights Increases flexibility Eases provisioning, management and migration Leverages converged network Reduces capital and operational expenses

Supporting Features SMB Transparent Failover - Continuous

availability SMB Scale-Out – Active/Active file server clusters SMB Direct (SMB over RDMA) - Low latency, low

CPU use SMB Multichannel – Network throughput and

failover SMB Encryption - Security VSS for SMB File Shares - Backup and restore SMB PowerShell and VMM Support -

Manageability

20

File Server

File Server

Shared Storage

SQL Server

IIS

VDI Desktop

SQL Server

IIS

VDI Desktop

SQL Server

IIS

VDI Desktop

Page 21: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

SMB Transparent Failover Failover transparent to server application

Zero downtime – small IO delay during failover

Supports planned and unplanned failovers Hardware/Software Maintenance Hardware/Software Failures Load Rebalancing

Resilient for both file and directory operations

Requires: File Servers configured as Windows

Failover Cluster Windows Server 2012 on both the servers

running the application and file server cluster nodes

Shares enabled for “continuous availability” (default configuration for clustered file shares)

Works for both classic file server clusters (cluster disks) and scale-out file server clusters (CSV)

21

File Server Cluster

Hyper-V

Failover share - connections and handles lost, temporary stall of IO

2

2

Normal operation 1

Connections and handles auto-recovered Application IO continues with no errors 3

1 3

\\fs\share \\fs\share

Page 22: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

SMB Scale-Out Targeted for server app storage Example: Hyper-V and SQL Server Increase available bandwidth by adding

nodes Leverages Cluster Shared Volumes

(CSV)

Key capabilities: Active/Active file shares Fault tolerance with zero downtime Fast failure recovery CHKDSK with zero downtime Support for app consistent snapshots Support for RDMA enabled networks Optimization for server apps Simple management

22

Hyper-V Cluster (Up to 64 nodes)

File Server Cluster (Up to 8 nodes)

Datacenter Network (Ethernet, InfiniBand or combination)

Page 23: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Topics

CSV Requirements and motivation CSV Design CSV IO operations Scale Out File Server Developing For CSV

Page 24: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV Component Overview

CSV Filter Driver (CSVFLT.sys ) Mounted on Metadata

Coordinator Node Blocks access to the NTFS

file system Co-ordinates metadata

operations over SMB Filter Altitude - 404800

CSV Proxy File System ( CSVFS.sys ) Proxy file system on top of

an underlying NTFS file system

Mounted on every node including Coordinator

Performed Direct I/O to the physical disk.

CSV Volume Manager ( CSVvBUS.sys ) Responsible for CSV

pseudo/virtual volumes Block-level IO redirector

24

Disk.sys

Volume Manager

NTFS

CSVFLT.sys

VM

SMB

Node 2 ( Coordinator )

Disk.sys

VM

SMB

Node 1

Direct I/O

CSVvBUS.sys

CSVFS.sys

CSVvBUS.sys

CSVFS.sys

Shared Storage

CSV Components

Windows Components

Page 25: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV – Filtering Options Different options available to Filter Drivers File system filters:

Attach on top of CSVFS.sys Attach to NTFS Attach to SMB

Volume filters:

Filters attach to CSVvBus.sys

Filters attach to Volmgr.sys

Attaching legacy filters to NTFS stack CSV safeguards with

Redirect IO mode

If attaching to MUP ignore CSV traffic to the coordinator node Extended create parameters

25

Disk.sys

Volume Manager

NTFS

CSVFLT.sys

VM

SMB

Node 2 ( Coordinator )

Disk.sys

VM

SMB

Node 1

Direct I/O

CSVvBUS.sys

CSVFS.sys

CSVvBUS.sys

CSVFS.sys

Shared Storage

CSV Components

Windows Components

File System Filters

File System Filterss

CSV Volume Filters

CSV Volume Filters

3RD Party

MUP File System

Filters

File System Filters

Volume Filters

Page 26: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

CSV – Filtering Options Different Options Available to Filter Drivers

26

NTFS Aware File System Filters

CSVFS Aware FS Filters

CSVFS Aware Volume Filter

MUP Aware Volume Filter

Traditional Volume Filters

1

1

2 2 2

3

3

4

4

5

5

3

3

Page 27: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Considerations for Filter Drivers Attached to NTFS NTFS hidden and locked for user mode access

Mount Manager will not assign volume GUID Use Disk Number and Partition Number instead to identify instance of you filter on a stack

If anti-virus avoid attaching to the NTFS to prevent double scanning Use IOCTL_DISK_GET_CLUSTER_INFO to detect CSV Metadata Stack Avoid attaching to NTFS stack If output has IsCsv bit set

Filters should not take a long time to process IO Will be subject of 60 seconds timeout from RDR 2 minutes timeout from CsvFs on volume state transitions. IO cannot be pended for indefinite time by the filter on NTFS. Violation

CSV volume in extra pause/resume transitions, Volume invalidation Failure of workloads

27

Page 28: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Considerations for Filter Drivers attached to NTFS

Do not modify file sizes while attached to NTFS (Allocation size, File size or Valid Data length while

below NTFS). Violation - volume corruption and file invalidation by

CsvFs. Avoid causing files to get unpinned or to cause

allocated blocks to get moved. Violation - volume corruption and file invalidation by

CsvFs. Avoid building memory sections.

Violation - stale cache and stream corruption.

28

Page 29: Cluster Shared Volume - SNIA Shared Volume . Vladimir Petter. Principal Software Design Engineer . ... We can perform Direct IO on read if we have a read containing oplock level

2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Considerations for Filter Drivers Unsupported CSV Functionalities Enabling Compression on directory Enabling NTFS File level encryption

Bitlocker on CSVFS is supported Transactions on CSVFS Name grafting reparse points points on CSVFS

volumes (mountpoints and symbolic links) Defrag files while in CSV Direct IO mode Direct IO on a Sparse or Compressed file;

Redirected I/O used instead DASD IO in File System Redirect-IO mode

29