MSCS Clustering Implementation

MSCS Clustering Implementation

Mylex eXtremeRAID 1100

PCI-to-Ultra2 SCSI RAID Controllers

Clustering: Basics

Q1’99 Mylex Confidential Slide 3

What Are Clusters ?

• Group of independent systems that

– Function as a single system

– Appear to users as a single system

– And are managed as a single system

• Clusters are “virtual servers”


Why Clusters ?

• Clusters Improve System Availability

– This is the primary value in Wolfpack-I clusters

• Clusters Enable Application Scaling

• Clusters Simplify System Management

• Clusters (with Intel servers) Are Cheap


System Availability

• Clusters Improve System Availability

– When a networked server fails, the service it provided is down

– When a clustered server fails, the service it provided “failsover” and downtime is avoided

MailServer

InternetServer

Networked Servers

Clustered Servers

Mail & Internet


Application Scaling

• Clusters Enable Application Scaling

– With networked SMP servers, application scaling is limited to a single server

– With clusters, applications scale across multiple SMP servers (typically up to 16 servers)


Simple Systems Management

• Clusters Simplify System Management

– Clusters present a Single System Image; the cluster looks like a single server to management applications

– Hence, clusters reduce system management costs

Three Management Domains

One Management Domain


Inexpensive

• Clusters (with Intel servers) Are Cheap

– Essentially no additional hardware costs - Readily Available Hardware (High Volume Server)

– Microsoft charges an extra $3K per node

• Windows NT Server $1,000• Windows NT Server, Enterprise Edition $4,000

Note: Proprietary Unix cluster software costs $10K to $25K per node.


An Analogy to RAID

• RAID Makes Disks Fault Tolerant

– Clusters make servers fault tolerant

• RAID Increases I/O Performance

– Clusters increase compute performance

• RAID Makes Disks Easier to Manage

– Clusters make servers easier to manage

RAID


Two Flavors of Clusters

• High Availability Clusters

• Microsoft’s Wolfpack 1• Compaq’s Recovery Server

• Load Balancing Clusters (a.k.a. Parallel Application Clusters)

• Microsoft’s Wolfpack 2• Digital’s VAXClusters

Note: Load balancing clusters are a superset of high availability clusters.


High Availability Clusters

• Two node clusters (node = server)

• During normal operations, both servers do useful work

• Failover

– When a node fails, applications failover to the surviving node and it assumes the workload of both nodes

Mail Web

Mail & Web


High Availability Clusters (Contd.)

• Failback

– When the failed node is returned to service, the applications failback

Mail Web

WebMail


Load Balancing Clusters

• Multi-node clusters (two or more nodes)

• Load balancing clusters typically run a single application, e.g. database, distributed across all nodes

• Cluster capacity is increased by adding nodes (but like SMP servers, scaling is less than linear)

3,000 TPM 3,600 TPM


Load Balancing Clusters (Contd.)

• Cluster rebalances the workload when a node dies

• If different apps are running on each server, they failover to the least busy server or as directed by predefined failover policies


Two Cluster Models

• “Shared Nothing” Model

– Microsoft’s Wolfpack Cluster

• “Shared Disk” Model

– VAXClusters


“Shared Nothing” Model

• At any moment in time, each disk is owned and addressable by only one server

• “Shared nothing” terminology is confusing

• Access to disks is shared -- on the same bus• But at any moment in time, disks are not shared

RAID


“Shared Nothing” Model (Contd.)

• When a server fails, the disks that it owns “failover” to the surviving server transparently to the clients

RAID


“Shared Disk” Model

• Disks are not owned by servers but shared by all servers

• At any moment in time, any server can access any disk

• Distributed Lock Manager arbitrates disk access so apps on different servers don’t step on one another (corrupt data)

RAID


Cluster Interconnect

• This is about how servers are tied together and how disks are physically connected to the cluster

• Clustered servers always have a client network interconnect, typically Ethernet, to talk to users

• And at least one cluster interconnect to talk to other nodes and to disks

RAID


Client Network

HBA HBA


Cluster Interconnect (Contd.)

• Or They Can Have Two Cluster Interconnects

– One for nodes to talk to each other -- “Heartbeat Interconnect”

• Typically Ethernet

– And one for nodes to talk to disks -- “Shared Disk Interconnect”

• Typically SCSI or Fibre Channel

RAID

Shared Disk Interconnect


HBA HBA

NIC NIC

Microsoft Clustering Service(MSCS)

Wolfpack


Clusters Are Not New

• Clusters Have been Around Since 1985

• Most UNIX Systems are Clustered

• What’s New is Microsoft Clusters

– Code named “Wolfpack”

– Named Microsoft Cluster Server (MSCS)

• Software that provides clustering

– MSCS is part of Window NT, Enterprise Server V 4.0


Microsoft Cluster Rollout

• Wolfpack-I

– In Windows NT, Enterprise Server, 4.0 (NT/E, 4.0) [Also includes Transaction Server and Reliable Message Queue]

– Two node “failover cluster”

– Shipped October, 1997

• Wolfpack-II

– In (or after) Windows 2000, Advanced Server

– Borrows component from more robust Tandem and Digital Cluster technology (Compaq technology sharing)

– “N” node (probably up to 16) “load balancing cluster”

– Beta in 1998 and ship in 1999 ?


MSCS (NT/E, 4.0) Overview

• Two Node “Failover” Cluster

• “Shared Nothing” Model

– At any moment in time, each disk is owned and addressable by only one server

• Two Cluster Interconnects

– “Heartbeat” cluster interconnect

• Ethernet

– Shared disk interconnect

• SCSI (any flavor)• Fibre Channel (SCSI protocol over Fibre Channel)

• Each Node Has a “Private System Disk”

– Boot disk


MSCS (NT/E, 4.0) Topologies

• Host-based (PCI) RAID Arrays

• External RAID Arrays


NT Cluster With Host-Based RAID Array

• Each node has– Ethernet NIC -- Heartbeat

– Private system disk (generally on an HBA)

– PCI-based RAID controller -- SCSI or Fibre

• Nodes share access to data disks but do not share data

RAIDShared Disk Interconnect

“Heartbeat” Interconnect

RAID

HBA HBANICNIC


NT Cluster With External RAID Array

• Each node has– Ethernet NIC -- Heartbeat

– Multi-channel HBA’s connect boot disk and external array

• Shared external RAID controller on the SCSI or FC Bus -- Mylex’s DAC-SX, DAC-FL, DAC-FF products

RAID



HBAHBA

NICNIC


Cluster Interconnect and Heartbeats

• Cluster Interconnect– Private Ethernet between nodes

– Used to transmit “I’m alive” heartbeat messages

• Heartbeat Messages– When a node stops getting heartbeats, it assumes the other node has died

and initiates failover

– In some failure modes both nodes stop getting heartbeats (NIC dies or someone trips over the cluster cable)

• Both nodes are still alive • But each thinks the other is dead• Split brain syndrome• Both nodes initiate failover• Who wins?


Quorum Disk

• Special cluster resource that stores the cluster log

• When a node joins a cluster, it attempts to reserve the quorum disk (purple disk)

– If the quorum disk does not have an owner, the node takes ownership and forms a cluster

– If the quorum disk has an owner, the node joins the cluster

RAIDDisk Interconnect

Cluster “Heartbeat” Interconnect

RAID

HBA HBA

Quorum Disk


Quorum Disk (Contd.)

• If Nodes Cannot Communicate (no heartbeats)

– Then only one is allow to continue operating

– They use the quorum disk to decide which one lives

– Each node waits, then tries to reserve the quorum disk

– Last owner waits the shortest time and if it’s still alive will take ownership of the quorum disk

– When the other node attempts to reserve the quorum disk, it will find that it’s already owned

– The node that doesn’t own the quorum disk then failsover

– This is called the Challenge / Defense Protocol


Microsoft Cluster Server (MSCS)

• MSCS Objects– Lots of MSCS objects but only two we care about

• Resources and Groups

• Resources– Applications, data files, disks, IP addresses, ...

• Groups– Application and related resources like data on disks


Microsoft Cluster Server (MSCS)

• When a server dies, groups failover

• When a server is repaired and returned to service, groups failback

• Since data on disks is included in groups, disks failover and failback

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource


Groups Failover

• Groups are the entities that failover

• And they take their disks with them

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource


Microsoft Cluster Certification

• Two Levels of Certification

– Cluster Component Certification• HBA’s and RAID controllers must be certified• When they pass:

� They’re listed on the Microsoft web site www.microsoft.com/hwtest/hcl/

� They’re eligible for inclusion in cluster system certification

– Cluster System Certification• Complete two node cluster• When they pass:

� They’re listed on the Microsoft web site� They’ll be supported by Microsoft

• Each Certification Takes 30 - 60 Days

Mylex’s Clustering Implementation

eXtremeRAID 1100PCI-to-Ultra2 SCSI RAID


NT Cluster With Host-Based RAID Array

– Nodes have: • Ethernet NIC -- Heartbeat• Private system disks (HBA)• PCI-based RAID controller

– Nodes share access to data disks but do not share data

3 Shared Ultra2 Interconnects


HBA HBANICNIC

eXtremeRAID

eXtremeRAID


MSCS Requirement for Shared Storage Bus

• Local drive is needed for boot OS and file system

• At any time, only one node has sole ownership of a shared drive.

• MSCS only supports SCSI protocol for shared bus

• SCSI commands are required for clustered shared devices– Reserved, Release, Test Unit Ready, Inquiry

– Support of DPO(Disable page OUT), FUA(Force Unit Access) in read/write commands

• Support of multiple initiators, and ability to handle SCSI Bus Reset and Bus Device Reset

• Controller ability to handle cluster partner node shutdown, removal -- SCSI bus transition, reset and termination control

• Operating System Control Access


Mylex RAID Products for MSCS Clustering

• Controllers supported -- LVD based– eXtremeRAID - DAC1164P

• Recommend LVD mode for long cabling distance (12m). Single End mode is limited to 3m and will require SCSI Bus extender for longer distance

eXtremeRAID



eXtremeRAID

HBA HBANICNIC


eXtremeRAID 1100: Technology

eXtremeRAID 1100

C PU

C h 0 C h 1 C h 2

Ch

1C

h 0

(bo

ttom

)C

h 2

(top

)

SC SI SC SI

SC SIPC I

Bridg e

BASS

LEDsSe ria l Po rt

DAC M e m o ry M o d u lewith BBU

NVRAM

Fo otBridg e


eXtremeRAID 1100: Architecture

SCSIASIC

SCSIASIC

/16 LVD SCSI Channel80MB/sec

SCSIASIC

SCSIASIC


SCSIASIC

SCSIASIC


SecondaryPCI Bus

NVRAMNVRAM

HostP2P Bridge

HostP2P Bridge /32 PCI

33MHz

CPUBridgeCPU

Bridge

RISCCPU

RISCCPU

FlashFlash

SDRAM

/32/32 /8 /8

/32

/64 Host PCI33MHz

40MHz


Mylex PCI RAID’s Two-node Cluster

• Emulate SCSI shared bus requirement through NT mini-port driver and RAID F/W– Treat RAID volume drive as physical disk drives

– Support release/reserved and other clustered related SCSI commands in the FW through volume reservation table

– Honor DPO and FUA and Flush operation in FW.

• RAID configuration, Fault Management, Enclosure Management, Volume Reserve/Release are administrated by Master/Slave mechanism

• Establish communication between RAID controllers in the 2-node through back-end SCSI bus -- Heartbeat, Cluster commands and RAID configuration and fault management


Master-Slave Concept

• Master/Slave is a controller concept and is transparent to host system and OS

• Master/Slave is independent of the server cluster-node status

• First established node will act as master, the later one a slave

• If one node fails or goes offline, the surviving node becomes master

• Node discovery is initiated by a SCSI Bus Reset and kept alive by heartbeat communication through back-end shared SCSI bus

eXtremeRAID

Back-end SCSI Buses

eXtremeRAID

Raid Heartbeat & CommunicationNode A Node B

MasterSlave


Master/Slave Perspective

• Only master manages RAID configuration changes and fault rebuild process.– Raid configuration and fault management can be initiated from

either nodes or invoked from DACCF/GAM.

– COD updates are done by master but it will inform Slave to update its NVRAM information.

– The master manages the rebuild process and could delegate task to slave.

• Enclosure management (SAF-TE) is administrated by Master.

• Logic Volume Release/Reserved are communicated between master and slave through backend shared SCSI Bus.


Termination Control and Bus Isolation

• In a cluster setup, one server node could be powered-on, shutdown or removed for upgrade or maintenance

• Mylex Supplied Terminator Switch Box– Contain LVD/SE terminator and fast silicon switches– When server node power is on

• Terminator is off and SCSI signal passes through– When server node power is off or removed

• Terminator is on and SCSI signal is isolated from the server node

ServerNode A

ServerNode B

1164P 1164P

Disk Box

Term

inator S

witch

Term

inator S

witch


Mylex’s Clustering Support Elements

• Two-Node NT 4.0 Clustering only (MSCS)– FW 5.07C for eXtremeRAID

– BIOS support for cluster nexus establishment message

– DACCF/BCU modification for initiator ID and clustering support

– NT miniport drive modification to support clustered related SCSI commands.

– GAM driver, Server, Clients no changes

FWBIOS

BCUDACCF

MiniPort

GAMDriver

GAMServer

GAMClient

TCP/IP


Global Array Management (GAM)• GAM : Client/Server RAID management tool via TCP/IP protocol

– Uses Virtual IP for viewing single RAID subsystem image (Could use physical IP to view 2 physical node image if needed)

– Either Master/Slave will be viewed depending on the current cluster group. GAM task-request will be communicated through back-end SCSI bus and administrated by Master Controller

eXtremeRAID


Virtual IP, Single System Image

eXtremeRAID

GAMServer

TCP / IP

GAMServer

GAMClient

GAMClient

GAMClient

GAMClient


Mylex Clustering Approach • Same FW, BIOS, Driver and utilities for clustering and

non-clustering support

• Support full featured Mylex RAID controller functions– Full RAID configuration through DACCF and GAM

– Hot Swap, Hot spare, RAID Rebuild

– Background Consistent Check

– Background Initialization

– SAF-TE enclosure management

• MORE -- Mylex Online Capacity Expansion and RAID migration are not supported in a cluster configuration

• Maintain TPC-C world record leader performance– Minimum impact on master/slave heartbeat monitoring

– Write back is disabled for cluster data availability and integrity


WHQL Clustering Certification

• Passed Microsoft SDG 1.0 (Server Design Guide), submit for WHQL certification queue

• Passed MSCS HCT 8.0 and Clustering Certification Pre-submission test

– MSCS System Validation -- Phase 1 - 3 tested• Tested on Intel Madrona, Nightshade, Sitka based systems• Will submit test log to Microsoft in early DEC. 1998

eXtremeRAID



eXtremeRAID

HBA HBANICNIC

ClusterAdmin

……….Clients Clients


Mylex Clustering Restrictions

• Only support 2 node MSCS clustering

• Boot and File system needs to be in local drive, separate from shared bus -- Per MSCS requirement

• The shared bus includes all SCSI channels in both controllers. All shared devices should be in the same channel for the 2 clustered controllers

• Only SCSI hard disks and SAF-TE devices are allowed on the shared bus.

• Write-back caching is disabled

• MORE is not supported

• SCSI device must be capable of supporting multi-initiators, SCSI bus reset and device reset


Mylex: Recommended Installation

• Setup controller initiator ID and enable cluster support for each node through DACCF while two-nodes are still separate.

• Disable RAID controller BIOS for both node, since the RAID controller is not controlling boot device

• Run RAID configuration, using DACCF on one node.

• Connect the two node together using Mylex terminator switch box and cabling.

• Ready to goReady to go --Just follow Microsoft Cluster Server Administrator’s Guide for clustering installation.


Mylex’s Installation Tips

• Disable termination on all of the drives and the drive box.

• Be sure there are no SCSI ID conflicts with the drives and SAF-TE processors.

• Use LVD (Low Voltage Differential) over SE (Single Ended) drives and enclosures because of SE cable length restrictions.

• If using SE, suggest using repeaters.

• For optimum performance, create 2 packs. One pack per controller.

• Do not create multiple partitions on a shared drive. MSCS can only Failover a physical drive.

• MSCS only supports NTFS partitions.

• Failback needs to be set manually within MSCS, otherwise the server that loads the MSCS services first will get ALL of the resources.

Documents

MSCS Clustering Implementation