Upload
docong
View
227
Download
5
Embed Size (px)
Citation preview
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Cluster Shared Volume
Vladimir Petter Principal Software Design Engineer
Microsoft
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Topics
CSV Requirements and motivation CSV Design CSV IO operations Scale Out File Server Developing For CSV
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Topics
CSV Requirements and motivation CSV Design CSV IO operations Scale Out File Server Developing For CSV
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV Requirements. Why we did CSV in Windows 2008 R2?
4
Shared Storage VHD VHD VHD
We need a LUN per VM (1000 LUNs) or we need a Clustered File System No in box solution that would provide
shared LUN access. SMB based NAS at that time was not
considered a viable solution. Has changed in Windows 2012. See slide #
18 - 20
VM consolidation without scarifying performance while keeping management sane We want to consolidate lots of VMs (1000s) in
a cluster We need to be able to move VM between
nodes To access VHD file VM required local file
system
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV Requirements. Characteristics of the Workloads Target workloads VM and SQL
Keeps file opened for a long time Few metadata operations. Mostly reads
and writes When file is accessed for both read and
write it is opened from one node When files accessed from multiple nodes
then read is dominant operation Multiple VMs need to access the same Base VHD,
and still can run on different cluster nodes.
Need to support tens of thousands of opened file VDI scenario with up to 64 node cluster and hundreds
VMs per node
5
Shared Storage VHD VHD VHD
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV Requirements. What we’ve decided to build Design Objectives
Multiple workloads can access the same LUN
Build up on the current investments
NTFS; SMB; Clustering
Same path to the file from any node
Reads and writes need to go directly to the SAN whenever possible – “Direct I/O”
All other operations should go to NTFS on the node where NTFS is mounted.
Performance Objectives Direct IO performance should
be as fast as NTFS If we need to Redirect IO over
network then it should be as fast as IO over SMB when file is opened for write-through.
High Availability Objectives Detect and hide storage,
network and node failures.
6
Disk.sys
Volume Manager
NTFS
VM
SMB
Node 2
Disk.sys
VM
.VHD SMB
Node 1
Direct I/O
CSV CSV
Shared Storage
CSV Components
Windows Components
( Coordinator )
Cluster C:\ClusterStorage\ Volume1\foo.vhd C:\ClusterStorage\
Volume1\boo.vhd
Direct I/O Direct I/O
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Topics
CSV Requirements and motivation CSV Design CSV IO operations Scale Out File Server Developing For CSV
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV Design Decisions Local File System on Windows Logically every local file system in
the windows has Upper part that is responsible for
implementing File System interface. Applications have dependency on this interface so it is very application compatibility sensitive
Lower part that manages how File System lays out data on the disk, and how it writes data to the disk. Applications are not supposed to care how it works. It interfaces with volume for block read/write operations
CSVFS does not introduce new on
disk structure nor does it work directly with on disk layout except of one special case – Direct IO. Consequently CSVFS has only upper part.
8
Volume Manager
FS Upper Part
Disk.sys
FS Lower Part
Disk.sys
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV Design Decisions Remote vs Local File System It looks similar to the remote
file system, and we almost went the direction of developing it as an extension for the SMB Redirectior
We chose to implement it as a local file system because It provides better compatibility with
existing applications It provides better management
story with existing Disk Management tools
It provides better backup story with existing backup tools
Remote applications can access it
Local File System needs a
volume to mount on so we needed a Volume Manager 9
CSV
Disk.sys
Volume Manager
NTFS
Disk.sys
CSV CSV
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV Design Decisions CSV Volume Manager CSV volume is created on every
cluster node. You can access it using mountpoint like C:\CluserStorage\Volume1 from any cluster node.
CSV Volume Manager is controlled by Cluster
CSVFS mounts only on the volumes create by the CSV Volume Manager
CSV Volume provides CSVFS lifetime that does not have dependency on the disk connectivity. Even on the node where disk is not connected CSVFS is present and accessible.
CSV Volume is the object that you see in the disk management utilities.
CSV Volume is what applications find when they enumerate volumes on the machine using Win32 APIs
10
CSV
Disk.sys
Volume Manager
NTFS
Disk.sys
CSV CSV
CSV Volume Manager
CSV Volume Manager
CSV Volume Manager
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV Design Decisions Recipe how to develop CSVFS Take FastFat from DDK Make it mountable on CSV Volume Remove lower part of FastFat that is managing on disk structure and replace it
with routines that either forward IO to NTFS or perform Direct IO. Add support for bunch of calls that FastFat did not support to get parity with
NTFS and REFS
11
Secret ingredients: Read several times File System Internals book by Rajeev Nagar Heavily instrument the code with tracing so you can learn how it really works
Optional If you care about performance then redesign locking model
Debug it for at least 3 to 5 years.
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV Design Decisions CSV Fault Tolerance. Failure Storage disconnects CSVFS on one of the nodes
observes IO failure and tells cluster
On every node cluster tells CSVFS volume to start draining In draining state CSVFS
pends new IO and any failing IO
On every node cluster tells CSVFS volume to pause CSVFS cancels ongoing IO
and waits for completion of all IOs
Once all IO completed CSVFS closes its internal files opens on NTFS
Cluster will observe disk failure
Cluster will tear down volume stack
12
CSV
Disk.sys
Volume Manager
NTFS
Disk.sys
CSV CSV
CSV Volume Manager
CSV Volume Manager
CSV Volume Manager
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV Design Decisions CSV Fault Tolerance. Recovery Application still have files
opened on CSVFS and are not aware of the failure
Cluster finds a node where disk is still connected and mounts NTFS on that node
On every node cluster tells CSVFS to reopen its internal handles on NTFS
On every node cluster tells CSVFS to resume IO. CSVFS reissues all paused
IO and stop pending any new IOs
13
CSV
Volume Manager
NTFS
Disk.sys
CSV CSV
CSV Volume Manager
CSV Volume Manager
CSV Volume Manager
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Topics
CSV Requirements and motivation CSV Design CSV IO operations Scale Out File Server Developing For CSV
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV Operation. File Open
15
sd Create
Node 1 Node 2
Csv Fs
Application
MUP/RDR/SMB SRV/SMB
Csv Flt
NTFS
IO Manager
IO Manager
1: CreateFile(FileName)
1.1: IRP_MJ_CREATE(FileName)
1.2: Create CLIUSR for the user token()
1.3: Serialize User Token to a BLOB()
1.4: Impersonate CLIUSR()
1.5: IoCreateFile(FileName, EA: BLOB ) 1.6: SMB Create(FileName, EA: BLOB)
1.7: Authenticate And Impersorane CLIUSR()
1.8: Validate CLIUSR has access to the share()
1.9: IoCreateFile(FileName, EA: BLOB)
1.10: IRP_MJ_CREATE(FileName, EA: BLOB)
1.11: Validate BLOB's Signature()
1.12: Recreate Original Token From BLOB()
1.13: Impersonate the original user()1.14: IRP_MJ_CREATE(FileName)
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV Operations CSV Uses Oplocks for Cache Coherency Oplocks existed in Windows for a long time and are used by SMB clients for
cache coherency Like SMB Client, CSV uses oplocks for cache coherency CSV also uses oplocks to decide when it is safe to perform Direct IO.
We can perform Direct IO on read if we have a read containing oplock level (RWH, RW, RH or R) We can perform Direct IO on write if we have a write containing oplock level (RWH or RW)
A write containing oplock is granted only if file is opened from single node When another node opens file, the open operation will cause revoke of write oplock level. Open will
be held until the client that lost oplock acknowledges oplock break On files opened from multiple nodes CSV can get read containing oplock
Opertions that modify data cause read oplock break
16
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV Scenarios. When Direct IO is Possible?
17
We understand on disk file format There are no File System filters that might modify file layout There are no File System filters that object against Direct IO on the steram We were able to make sure NTFS will not change location of the file data on the
volume No applications that need to make sure IO is observed by NTFS stack We have oplocks. Cross node cache coherency.
RWH or RH or RW or R for reads RWH or RW for write
We we able to purge cache on NTFS. Make sure there is no stale cache on NTFS.
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV Operations Unbuffered Write
Data are written to the disk using Direct IO
If write is extending ValidDataLength then we also update VDL on NTFS
18
sd DirectIO. No Buffering.
Storage
MDSDS
Csv Fs
Application
Storage
NTFS
CC
1: Write()
1.1: Write Data()
1.2: Set Vdl()
1.3: Update Vdl in Meta-Data Stream()
1.4: Flush Meta-Data Stream()
1.5: Write(FUA bit set)
1.6: Write Data(FUA bit Set)
CC – Windows Cache Manager
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV Operations Buffered Write
Data are written to the disk using Direct IO
If write is extending ValidDataLength then we also update VDL on NTFS
19
sd Shadow FO NO Write Through Down-Lev el FO has Write Through
Storage
MDSDS
Csv Fs
CC
Application
Storage
NTFS
CC
1: Top-Level-Write()
1.1: CCWrite() 2: Write Behind()
2.1: Write Data()
2.2: Move VDL To Disk()
2.3: Set Vdl()
2.4: Update Vdl in Meta-Data Stream()
2.5: Flush Meta-Data Stream()
2.6: Write(FUA bit set)
2.7: Write Data(FUA bit Set)
CC – Windows Cache Manager
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV – Scale Out File Server What is it?
Store Hyper-V files in shares over the SMB 3.0 protocol (including VM configuration, VHD files, snapshots)
Highlights Increases flexibility Eases provisioning, management and migration Leverages converged network Reduces capital and operational expenses
Supporting Features SMB Transparent Failover - Continuous
availability SMB Scale-Out – Active/Active file server clusters SMB Direct (SMB over RDMA) - Low latency, low
CPU use SMB Multichannel – Network throughput and
failover SMB Encryption - Security VSS for SMB File Shares - Backup and restore SMB PowerShell and VMM Support -
Manageability
20
File Server
File Server
Shared Storage
SQL Server
IIS
VDI Desktop
SQL Server
IIS
VDI Desktop
SQL Server
IIS
VDI Desktop
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
SMB Transparent Failover Failover transparent to server application
Zero downtime – small IO delay during failover
Supports planned and unplanned failovers Hardware/Software Maintenance Hardware/Software Failures Load Rebalancing
Resilient for both file and directory operations
Requires: File Servers configured as Windows
Failover Cluster Windows Server 2012 on both the servers
running the application and file server cluster nodes
Shares enabled for “continuous availability” (default configuration for clustered file shares)
Works for both classic file server clusters (cluster disks) and scale-out file server clusters (CSV)
21
File Server Cluster
Hyper-V
Failover share - connections and handles lost, temporary stall of IO
2
2
Normal operation 1
Connections and handles auto-recovered Application IO continues with no errors 3
1 3
\\fs\share \\fs\share
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
SMB Scale-Out Targeted for server app storage Example: Hyper-V and SQL Server Increase available bandwidth by adding
nodes Leverages Cluster Shared Volumes
(CSV)
Key capabilities: Active/Active file shares Fault tolerance with zero downtime Fast failure recovery CHKDSK with zero downtime Support for app consistent snapshots Support for RDMA enabled networks Optimization for server apps Simple management
22
Hyper-V Cluster (Up to 64 nodes)
File Server Cluster (Up to 8 nodes)
Datacenter Network (Ethernet, InfiniBand or combination)
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Topics
CSV Requirements and motivation CSV Design CSV IO operations Scale Out File Server Developing For CSV
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV Component Overview
CSV Filter Driver (CSVFLT.sys ) Mounted on Metadata
Coordinator Node Blocks access to the NTFS
file system Co-ordinates metadata
operations over SMB Filter Altitude - 404800
CSV Proxy File System ( CSVFS.sys ) Proxy file system on top of
an underlying NTFS file system
Mounted on every node including Coordinator
Performed Direct I/O to the physical disk.
CSV Volume Manager ( CSVvBUS.sys ) Responsible for CSV
pseudo/virtual volumes Block-level IO redirector
24
Disk.sys
Volume Manager
NTFS
CSVFLT.sys
VM
SMB
Node 2 ( Coordinator )
Disk.sys
VM
SMB
Node 1
Direct I/O
CSVvBUS.sys
CSVFS.sys
CSVvBUS.sys
CSVFS.sys
Shared Storage
CSV Components
Windows Components
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV – Filtering Options Different options available to Filter Drivers File system filters:
Attach on top of CSVFS.sys Attach to NTFS Attach to SMB
Volume filters:
Filters attach to CSVvBus.sys
Filters attach to Volmgr.sys
Attaching legacy filters to NTFS stack CSV safeguards with
Redirect IO mode
If attaching to MUP ignore CSV traffic to the coordinator node Extended create parameters
25
Disk.sys
Volume Manager
NTFS
CSVFLT.sys
VM
SMB
Node 2 ( Coordinator )
Disk.sys
VM
SMB
Node 1
Direct I/O
CSVvBUS.sys
CSVFS.sys
CSVvBUS.sys
CSVFS.sys
Shared Storage
CSV Components
Windows Components
File System Filters
File System Filterss
CSV Volume Filters
CSV Volume Filters
3RD Party
MUP File System
Filters
File System Filters
Volume Filters
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
CSV – Filtering Options Different Options Available to Filter Drivers
26
NTFS Aware File System Filters
CSVFS Aware FS Filters
CSVFS Aware Volume Filter
MUP Aware Volume Filter
Traditional Volume Filters
1
1
2 2 2
3
3
4
4
5
5
3
3
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Considerations for Filter Drivers Attached to NTFS NTFS hidden and locked for user mode access
Mount Manager will not assign volume GUID Use Disk Number and Partition Number instead to identify instance of you filter on a stack
If anti-virus avoid attaching to the NTFS to prevent double scanning Use IOCTL_DISK_GET_CLUSTER_INFO to detect CSV Metadata Stack Avoid attaching to NTFS stack If output has IsCsv bit set
Filters should not take a long time to process IO Will be subject of 60 seconds timeout from RDR 2 minutes timeout from CsvFs on volume state transitions. IO cannot be pended for indefinite time by the filter on NTFS. Violation
CSV volume in extra pause/resume transitions, Volume invalidation Failure of workloads
27
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Considerations for Filter Drivers attached to NTFS
Do not modify file sizes while attached to NTFS (Allocation size, File size or Valid Data length while
below NTFS). Violation - volume corruption and file invalidation by
CsvFs. Avoid causing files to get unpinned or to cause
allocated blocks to get moved. Violation - volume corruption and file invalidation by
CsvFs. Avoid building memory sections.
Violation - stale cache and stream corruption.
28
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Considerations for Filter Drivers Unsupported CSV Functionalities Enabling Compression on directory Enabling NTFS File level encryption
Bitlocker on CSVFS is supported Transactions on CSVFS Name grafting reparse points points on CSVFS
volumes (mountpoints and symbolic links) Defrag files while in CSV Direct IO mode Direct IO on a Sparse or Compressed file;
Redirected I/O used instead DASD IO in File System Redirect-IO mode
29