51
MSCS Clustering Implementation Mylex eXtremeRAID 1100 PCI-to-Ultra2 SCSI RAID Controllers

MSCS Clustering Implementation

  • Upload
    tuan

  • View
    99

  • Download
    1

Embed Size (px)

DESCRIPTION

MSCS Clustering Implementation. Mylex eXtremeRAID 1100 PCI-to-Ultra2 SCSI RAID Controllers. Clustering: Basics. What Are Clusters ?. Group of independent systems that Function as a single system Appear to users as a single system And are managed as a single system - PowerPoint PPT Presentation

Citation preview

Page 1: MSCS Clustering Implementation

MSCS Clustering Implementation

Mylex eXtremeRAID 1100

PCI-to-Ultra2 SCSI RAID Controllers

Page 2: MSCS Clustering Implementation

Clustering: Basics

Page 3: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 3

What Are Clusters ?

• Group of independent systems that

– Function as a single system

– Appear to users as a single system

– And are managed as a single system

• Clusters are “virtual servers”

Page 4: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 4

Why Clusters ?

• Clusters Improve System Availability

– This is the primary value in Wolfpack-I clusters

• Clusters Enable Application Scaling

• Clusters Simplify System Management

• Clusters (with Intel servers) Are Cheap

Page 5: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 5

System Availability

• Clusters Improve System Availability

– When a networked server fails, the service it provided is down

– When a clustered server fails, the service it provided “failsover” and downtime is avoided

MailServer

InternetServer

Networked Servers

Clustered Servers

Mail & Internet

Page 6: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 6

Application Scaling

• Clusters Enable Application Scaling

– With networked SMP servers, application scaling is limited to a single server

– With clusters, applications scale across multiple SMP servers (typically up to 16 servers)

Page 7: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 7

Simple Systems Management

• Clusters Simplify System Management

– Clusters present a Single System Image; the cluster looks like a single server to management applications

– Hence, clusters reduce system management costs

Three Management Domains

One Management Domain

Page 8: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 8

Inexpensive

• Clusters (with Intel servers) Are Cheap

– Essentially no additional hardware costs - Readily Available Hardware (High Volume Server)

– Microsoft charges an extra $3K per node

• Windows NT Server $1,000• Windows NT Server, Enterprise Edition $4,000

Note: Proprietary Unix cluster software costs $10K to $25K per node.

Page 9: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 9

An Analogy to RAID

• RAID Makes Disks Fault Tolerant

– Clusters make servers fault tolerant

• RAID Increases I/O Performance

– Clusters increase compute performance

• RAID Makes Disks Easier to Manage

– Clusters make servers easier to manage

RAID

Page 10: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 10

Two Flavors of Clusters

• High Availability Clusters

• Microsoft’s Wolfpack 1• Compaq’s Recovery Server

• Load Balancing Clusters (a.k.a. Parallel Application Clusters)

• Microsoft’s Wolfpack 2• Digital’s VAXClusters

Note: Load balancing clusters are a superset of high availability clusters.

Page 11: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 11

High Availability Clusters

• Two node clusters (node = server)

• During normal operations, both servers do useful work

• Failover

– When a node fails, applications failover to the surviving node and it assumes the workload of both nodes

Mail Web

Mail & Web

Page 12: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 12

High Availability Clusters (Contd.)

• Failback

– When the failed node is returned to service, the applications failback

Mail Web

WebMail

Page 13: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 13

Load Balancing Clusters

• Multi-node clusters (two or more nodes)

• Load balancing clusters typically run a single application, e.g. database, distributed across all nodes

• Cluster capacity is increased by adding nodes (but like SMP servers, scaling is less than linear)

3,000 TPM 3,600 TPM

Page 14: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 14

Load Balancing Clusters (Contd.)

• Cluster rebalances the workload when a node dies

• If different apps are running on each server, they failover to the least busy server or as directed by predefined failover policies

Page 15: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 15

Two Cluster Models

• “Shared Nothing” Model

– Microsoft’s Wolfpack Cluster

• “Shared Disk” Model

– VAXClusters

Page 16: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 16

“Shared Nothing” Model

• At any moment in time, each disk is owned and addressable by only one server

• “Shared nothing” terminology is confusing

• Access to disks is shared -- on the same bus• But at any moment in time, disks are not shared

RAID

Page 17: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 17

“Shared Nothing” Model (Contd.)

• When a server fails, the disks that it owns “failover” to the surviving server transparently to the clients

RAID

Page 18: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 18

“Shared Disk” Model

• Disks are not owned by servers but shared by all servers

• At any moment in time, any server can access any disk

• Distributed Lock Manager arbitrates disk access so apps on different servers don’t step on one another (corrupt data)

RAID

Page 19: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 19

Cluster Interconnect

• This is about how servers are tied together and how disks are physically connected to the cluster

• Clustered servers always have a client network interconnect, typically Ethernet, to talk to users

• And at least one cluster interconnect to talk to other nodes and to disks

RAID

Cluster Interconnect

Client Network

HBA HBA

Page 20: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 20

Cluster Interconnect (Contd.)

• Or They Can Have Two Cluster Interconnects

– One for nodes to talk to each other -- “Heartbeat Interconnect”

• Typically Ethernet

– And one for nodes to talk to disks -- “Shared Disk Interconnect”

• Typically SCSI or Fibre Channel

RAID

Shared Disk Interconnect

Cluster Interconnect

HBA HBA

NIC NIC

Page 21: MSCS Clustering Implementation

Microsoft Clustering Service(MSCS)

Wolfpack

Page 22: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 22

Clusters Are Not New

• Clusters Have been Around Since 1985

• Most UNIX Systems are Clustered

• What’s New is Microsoft Clusters

– Code named “Wolfpack”

– Named Microsoft Cluster Server (MSCS)

• Software that provides clustering

– MSCS is part of Window NT, Enterprise Server V 4.0

Page 23: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 23

Microsoft Cluster Rollout

• Wolfpack-I

– In Windows NT, Enterprise Server, 4.0 (NT/E, 4.0) [Also includes Transaction Server and Reliable Message Queue]

– Two node “failover cluster”

– Shipped October, 1997

• Wolfpack-II

– In (or after) Windows 2000, Advanced Server

– Borrows component from more robust Tandem and Digital Cluster technology (Compaq technology sharing)

– “N” node (probably up to 16) “load balancing cluster”

– Beta in 1998 and ship in 1999 ?

Page 24: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 24

MSCS (NT/E, 4.0) Overview

• Two Node “Failover” Cluster

• “Shared Nothing” Model

– At any moment in time, each disk is owned and addressable by only one server

• Two Cluster Interconnects

– “Heartbeat” cluster interconnect

• Ethernet

– Shared disk interconnect

• SCSI (any flavor)• Fibre Channel (SCSI protocol over Fibre Channel)

• Each Node Has a “Private System Disk”

– Boot disk

Page 25: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 25

MSCS (NT/E, 4.0) Topologies

• Host-based (PCI) RAID Arrays

• External RAID Arrays

Page 26: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 26

NT Cluster With Host-Based RAID Array

• Each node has– Ethernet NIC -- Heartbeat

– Private system disk (generally on an HBA)

– PCI-based RAID controller -- SCSI or Fibre

• Nodes share access to data disks but do not share data

RAIDShared Disk Interconnect

“Heartbeat” Interconnect

RAID

HBA HBANICNIC

Page 27: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 27

NT Cluster With External RAID Array

• Each node has– Ethernet NIC -- Heartbeat

– Multi-channel HBA’s connect boot disk and external array

• Shared external RAID controller on the SCSI or FC Bus -- Mylex’s DAC-SX, DAC-FL, DAC-FF products

RAID

Shared Disk Interconnect

“Heartbeat” Interconnect

HBAHBA

NICNIC

Page 28: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 28

Cluster Interconnect and Heartbeats

• Cluster Interconnect– Private Ethernet between nodes

– Used to transmit “I’m alive” heartbeat messages

• Heartbeat Messages– When a node stops getting heartbeats, it assumes the other node has died

and initiates failover

– In some failure modes both nodes stop getting heartbeats (NIC dies or someone trips over the cluster cable)

• Both nodes are still alive • But each thinks the other is dead• Split brain syndrome• Both nodes initiate failover• Who wins?

Page 29: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 29

Quorum Disk

• Special cluster resource that stores the cluster log

• When a node joins a cluster, it attempts to reserve the quorum disk (purple disk)

– If the quorum disk does not have an owner, the node takes ownership and forms a cluster

– If the quorum disk has an owner, the node joins the cluster

RAIDDisk Interconnect

Cluster “Heartbeat” Interconnect

RAID

HBA HBA

Quorum Disk

Page 30: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 30

Quorum Disk (Contd.)

• If Nodes Cannot Communicate (no heartbeats)

– Then only one is allow to continue operating

– They use the quorum disk to decide which one lives

– Each node waits, then tries to reserve the quorum disk

– Last owner waits the shortest time and if it’s still alive will take ownership of the quorum disk

– When the other node attempts to reserve the quorum disk, it will find that it’s already owned

– The node that doesn’t own the quorum disk then failsover

– This is called the Challenge / Defense Protocol

Page 31: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 31

Microsoft Cluster Server (MSCS)

• MSCS Objects– Lots of MSCS objects but only two we care about

• Resources and Groups

• Resources– Applications, data files, disks, IP addresses, ...

• Groups– Application and related resources like data on disks

Page 32: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 32

Microsoft Cluster Server (MSCS)

• When a server dies, groups failover

• When a server is repaired and returned to service, groups failback

• Since data on disks is included in groups, disks failover and failback

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource

Page 33: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 33

Groups Failover

• Groups are the entities that failover

• And they take their disks with them

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Page 34: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 34

Microsoft Cluster Certification

• Two Levels of Certification

– Cluster Component Certification• HBA’s and RAID controllers must be certified• When they pass:

� They’re listed on the Microsoft web site www.microsoft.com/hwtest/hcl/

� They’re eligible for inclusion in cluster system certification

– Cluster System Certification• Complete two node cluster• When they pass:

� They’re listed on the Microsoft web site� They’ll be supported by Microsoft

• Each Certification Takes 30 - 60 Days

Page 35: MSCS Clustering Implementation

Mylex’s Clustering Implementation

eXtremeRAID 1100PCI-to-Ultra2 SCSI RAID

Page 36: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 36

NT Cluster With Host-Based RAID Array

– Nodes have: • Ethernet NIC -- Heartbeat• Private system disks (HBA)• PCI-based RAID controller

– Nodes share access to data disks but do not share data

3 Shared Ultra2 Interconnects

“Heartbeat” Interconnect

HBA HBANICNIC

eXtremeRAID

eXtremeRAID

Page 37: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 37

MSCS Requirement for Shared Storage Bus

• Local drive is needed for boot OS and file system

• At any time, only one node has sole ownership of a shared drive.

• MSCS only supports SCSI protocol for shared bus

• SCSI commands are required for clustered shared devices– Reserved, Release, Test Unit Ready, Inquiry

– Support of DPO(Disable page OUT), FUA(Force Unit Access) in read/write commands

• Support of multiple initiators, and ability to handle SCSI Bus Reset and Bus Device Reset

• Controller ability to handle cluster partner node shutdown, removal -- SCSI bus transition, reset and termination control

• Operating System Control Access

Page 38: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 38

Mylex RAID Products for MSCS Clustering

• Controllers supported -- LVD based– eXtremeRAID - DAC1164P

• Recommend LVD mode for long cabling distance (12m). Single End mode is limited to 3m and will require SCSI Bus extender for longer distance

eXtremeRAID

Shared Disk Interconnect

“Heartbeat” Interconnect

eXtremeRAID

HBA HBANICNIC

Page 39: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 39

eXtremeRAID 1100: Technology

eXtremeRAID 1100

C PU

C h 0 C h 1 C h 2

Ch

1C

h 0

(bo

ttom

)C

h 2

(top

)

SC SI SC SI

SC SIPC I

Bridg e

BASS

LEDsSe ria l Po rt

DAC M e m o ry M o d u lewith BBU

NVRAM

Fo otBridg e

Page 40: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 40

eXtremeRAID 1100: Architecture

SCSIASIC

SCSIASIC

/16 LVD SCSI Channel80MB/sec

SCSIASIC

SCSIASIC

/16 LVD SCSI Channel80MB/sec

SCSIASIC

SCSIASIC

/16 LVD SCSI Channel80MB/sec

SecondaryPCI Bus

NVRAMNVRAM

HostP2P Bridge

HostP2P Bridge /32 PCI

33MHz

CPUBridgeCPU

Bridge

RISCCPU

RISCCPU

FlashFlash

SDRAM

/32/32 /8 /8

/32

/64 Host PCI33MHz

40MHz

Page 41: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 41

Mylex PCI RAID’s Two-node Cluster

• Emulate SCSI shared bus requirement through NT mini-port driver and RAID F/W– Treat RAID volume drive as physical disk drives

– Support release/reserved and other clustered related SCSI commands in the FW through volume reservation table

– Honor DPO and FUA and Flush operation in FW.

• RAID configuration, Fault Management, Enclosure Management, Volume Reserve/Release are administrated by Master/Slave mechanism

• Establish communication between RAID controllers in the 2-node through back-end SCSI bus -- Heartbeat, Cluster commands and RAID configuration and fault management

Page 42: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 42

Master-Slave Concept

• Master/Slave is a controller concept and is transparent to host system and OS

• Master/Slave is independent of the server cluster-node status

• First established node will act as master, the later one a slave

• If one node fails or goes offline, the surviving node becomes master

• Node discovery is initiated by a SCSI Bus Reset and kept alive by heartbeat communication through back-end shared SCSI bus

eXtremeRAID

Back-end SCSI Buses

eXtremeRAID

Raid Heartbeat & CommunicationNode A Node B

MasterSlave

Page 43: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 43

Master/Slave Perspective

• Only master manages RAID configuration changes and fault rebuild process.– Raid configuration and fault management can be initiated from

either nodes or invoked from DACCF/GAM.

– COD updates are done by master but it will inform Slave to update its NVRAM information.

– The master manages the rebuild process and could delegate task to slave.

• Enclosure management (SAF-TE) is administrated by Master.

• Logic Volume Release/Reserved are communicated between master and slave through backend shared SCSI Bus.

Page 44: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 44

Termination Control and Bus Isolation

• In a cluster setup, one server node could be powered-on, shutdown or removed for upgrade or maintenance

• Mylex Supplied Terminator Switch Box– Contain LVD/SE terminator and fast silicon switches– When server node power is on

• Terminator is off and SCSI signal passes through– When server node power is off or removed

• Terminator is on and SCSI signal is isolated from the server node

ServerNode A

ServerNode B

1164P 1164P

Disk Box

Term

inator S

witch

Term

inator S

witch

Page 45: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 45

Mylex’s Clustering Support Elements

• Two-Node NT 4.0 Clustering only (MSCS)– FW 5.07C for eXtremeRAID

– BIOS support for cluster nexus establishment message

– DACCF/BCU modification for initiator ID and clustering support

– NT miniport drive modification to support clustered related SCSI commands.

– GAM driver, Server, Clients no changes

FWBIOS

BCUDACCF

MiniPort

GAMDriver

GAMServer

GAMClient

TCP/IP

Page 46: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 46

Global Array Management (GAM)• GAM : Client/Server RAID management tool via TCP/IP protocol

– Uses Virtual IP for viewing single RAID subsystem image (Could use physical IP to view 2 physical node image if needed)

– Either Master/Slave will be viewed depending on the current cluster group. GAM task-request will be communicated through back-end SCSI bus and administrated by Master Controller

eXtremeRAID

Shared Disk Interconnect

Virtual IP, Single System Image

eXtremeRAID

GAMServer

TCP / IP

GAMServer

GAMClient

GAMClient

GAMClient

GAMClient

Page 47: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 47

Mylex Clustering Approach • Same FW, BIOS, Driver and utilities for clustering and

non-clustering support

• Support full featured Mylex RAID controller functions– Full RAID configuration through DACCF and GAM

– Hot Swap, Hot spare, RAID Rebuild

– Background Consistent Check

– Background Initialization

– SAF-TE enclosure management

• MORE -- Mylex Online Capacity Expansion and RAID migration are not supported in a cluster configuration

• Maintain TPC-C world record leader performance– Minimum impact on master/slave heartbeat monitoring

– Write back is disabled for cluster data availability and integrity

Page 48: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 48

WHQL Clustering Certification

• Passed Microsoft SDG 1.0 (Server Design Guide), submit for WHQL certification queue

• Passed MSCS HCT 8.0 and Clustering Certification Pre-submission test

– MSCS System Validation -- Phase 1 - 3 tested• Tested on Intel Madrona, Nightshade, Sitka based systems• Will submit test log to Microsoft in early DEC. 1998

eXtremeRAID

Shared Disk Interconnect

“Heartbeat” Interconnect

eXtremeRAID

HBA HBANICNIC

ClusterAdmin

……….Clients Clients

Page 49: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 49

Mylex Clustering Restrictions

• Only support 2 node MSCS clustering

• Boot and File system needs to be in local drive, separate from shared bus -- Per MSCS requirement

• The shared bus includes all SCSI channels in both controllers. All shared devices should be in the same channel for the 2 clustered controllers

• Only SCSI hard disks and SAF-TE devices are allowed on the shared bus.

• Write-back caching is disabled

• MORE is not supported

• SCSI device must be capable of supporting multi-initiators, SCSI bus reset and device reset

Page 50: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 50

Mylex: Recommended Installation

• Setup controller initiator ID and enable cluster support for each node through DACCF while two-nodes are still separate.

• Disable RAID controller BIOS for both node, since the RAID controller is not controlling boot device

• Run RAID configuration, using DACCF on one node.

• Connect the two node together using Mylex terminator switch box and cabling.

• Ready to goReady to go --Just follow Microsoft Cluster Server Administrator’s Guide for clustering installation.

Page 51: MSCS Clustering Implementation

Q1’99 Mylex Confidential Slide 51

Mylex’s Installation Tips

• Disable termination on all of the drives and the drive box.

• Be sure there are no SCSI ID conflicts with the drives and SAF-TE processors.

• Use LVD (Low Voltage Differential) over SE (Single Ended) drives and enclosures because of SE cable length restrictions.

• If using SE, suggest using repeaters.

• For optimum performance, create 2 packs. One pack per controller.

• Do not create multiple partitions on a shared drive. MSCS can only Failover a physical drive.

• MSCS only supports NTFS partitions.

• Failback needs to be set manually within MSCS, otherwise the server that loads the MSCS services first will get ALL of the resources.