Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to

Scalable Clusters

Jed Liu11 April 2002

Overview Microsoft Cluster Service

Built on Windows NT Provides high availability services Presents itself to clients as a single system

Frangipani A scalable distributed file system

Microsoft Cluster Service Design goals:

Cluster composed of COTS components Scalability – able to add components

without interrupting services Transparency – clients see cluster as a

single machine Reliability – when a node fails, can restart

services on a different node

Cluster Abstractions Nodes Resources

e.g., logical disk volumes, NetBIOS names, SMB shares, mail service, SQL service

Quorum resource Implements persistent storage for cluster

configuration database and change log Resource dependencies

Tracks dependencies btw resources

Cluster Abstractions (cont’d) Resource groups

The unit of migration: resources in the same group are hosted on the same node

Cluster database Configuration data for starting the cluster is

kept in a database, accessed through the Windows registry.

Database is replicated at each node in the cluster.

Node Failure Active members broadcast periodic

heartbeat messages Failure suspicion occurs when a node

misses two successive heartbeat messages from some other node Regroup algorithm gets initiated to

determine new membership information Resources that were online at a failed

member are brought online at active nodes

Member Regroup Algorithm Lockstep algorithm Activate. Each node waits for a clock

tick, then starts sending and collecting status messages

Closing. Determine whether partitions exist and determines whether current node is in a partition that should survive

Pruning. Prune the surviving group so that all nodes are fully-connected

Regroup Algorithm (cont’d) Cleanup. Surviving nodes local

membership information as appropriate Stabilized. Done

Joining a Cluster Sponsor authenticates the joining node

Denies access if applicant isn’t authorized to join Sponsor sends version info of config

database Also sends updates as needed, if changes were

made while applicant was offline Sponsor atomically broadcasts information

about applicant to all other members Active members update local membership

information

Forming a Cluster Use local registry to find address of

quorum resource Acquire ownership of quorum resource

Arbitration protocol ensures that at most one node owns quorum resource

Synchronize local cluster database with master copy

Leaving a Cluster Member sends an exit message to all

other cluster members and shuts down immediately

Active members gossip about exiting member and update their cluster databases

Node States Inactive nodes are offline Active members are either online or paused All active nodes participate in cluster

database updates, vote in the quorum algorithm, maintain heartbeats

Only online nodes can take ownership of resource groups

Resource Management Achieved by invoking a calls through a

resource control library (implemented as a DLL)

Through this library, MSCS can monitor the state of the resource

Resource Migration Reasons for migration:

Node failure Resource failure Resource group prefers to execute at a

different node Operator-requested migration

In the first case, resource group is pulled to new node

In all other cases, resource group is pushed

Pushing a Resource Group All resources in the old node are

brought offline Old host node chooses a new host Local copy of MSCS at new host brings

up the resource group

Pulling a Resource Group Active nodes capable of hosting the

group determine amongst themselves the new host for the group New host chosen based on attributes that are

stored in the cluster database Since database is replicated at all nodes,

decision can be made without any communication!

New host brings online the resource group

Client Access to Resources Normally, clients access SMB resources

using names of the form \\node\service This presents a problem – as resources migrate

between nodes, the resource name will change With MSCS, whenever a resource migrates,

resource’s network name also migrates as part of resource group Clients only sees services and their network

names – cluster becomes a single virtual node

Membership Manager Maintains consensus among active

nodes about who is active and who is defined A join mechanism admits new members

into the cluster A regroup mechanism determines current

membership on start up or suspected failure

Global Update Manager Used to implement atomic broadcast A single node in the cluster is always

designated as the locker Locker node takes over atomic

broadcast in case original sender fails in mid-broadcast

Frangipani Design goals:

Provide users with coherent, shared access to files

Arbitrarily scalable to provide more storage, higher performance

Highly available in spite of component failures Minimal human administration

Full and consistent backups can be made of the entire file system without bringing it down

Complexity of administration stays constant despite the addition of components

Server Layering

Userprogram

Userprogram

Userprogram

Frangipanifile server

Frangipanifile server

Petaldistributed virtualdisk service

Distributedlock service

Physical disks

Assumptions Frangipani servers trust:

One another Petal servers Lock service

Meant to run in a cluster of machines that are under a common administration and can communicate securely

System Structure Frangipani implemented as a file

system option in the OS kernel All file servers read and write the same

file system data structures on the shared Petal disk

Each file server keeps a redo log in Petal so that when it fails, another server can access log and recover

Petalserver

Lockserver Petal

server

Lockserver

User programs

File system switch

Frangipanifile server module

Petaldevice driver

Network

User programs

File system switch

Frangipanifile server module

Petaldevice driver

Petalserver

Lockserver

Petal virtual disk

Security Considerations Any Frangipani machine can access and

modify any block of the Petal virtual disk Must run only on machines with trusted OSes

Petal servers and lock servers should also run on trusted OSes

All three types of components should authenticate one another

Network security also important: eavesdropping should be prevented

Disk Layout 264 bytes of addressable disk space,

partitioned into regions: Shared configuration parameters Logs – each server owns a part of this

region to hold its private log Allocation bitmaps – each server owns parts

of this region for its exclusive use Inodes, small data blocks, large data blocks

Logging and Recovery Only log changes to metadata – user

data is not logged Use write-ahead redo logging

Log implemented as a circular buffer When log fills, reclaim oldest ¼ of buffer

Need to be able to find end of log Add monotonically increasing sequence

numbers to each block of the log

Concurrency Considerations Need to ensure logging and recovery

work in the presence of multiple logs Updates requested to same data by

different servers are serialized Recovery applies a change only if it was

logged under an active lock at the time of failure

To ensure this, never replay an update that has already been completed

keep a version number on each metadata block

Concurrency Considerations (cont’d)

Ensure that only one recovery daemon is replaying the log of a given server

Do this through an exclusive lock on the log

Cache Coherence When lock service detects conflicting

lock requests, current lock holder is asked to release or downgrade lock

Lock service uses read locks and write locks When a read lock is released, corresponding

cache entry must be invalidated When a write lock is downgraded, dirty data

must be written to disk Releasing a write lock = downgrade to read

lock, then release

Synchronization Division of on-disk data structures into

lockable segments is designed to avoid lock contention Each log is lockable Bitmap space divided into lockable units Unallocated inode or data block is protected

by lock on corresponding piece of the bitmap space

A single lock protects the inode and any file data that it points to

Locking Service Locks are sticky – they’re retained until

someone else needs them Client failure dealt with by using leases Network failures can prevent a Frangipani

server from renewing its lease Server discards all locks and all cached data If there was dirty data in the cache,

Frangipani throws errors until file system is unmounted

Locking Service Hole If a Frangipani server’s lease expires

due to temporary network outage, it might still try to access Petal Problem basically caused by lack of clock

synchronization Can be fixed without synchronized clocks by

including a lease identifier with every Petal request

Adding and Removing Servers Adding a server is easy!

Just point it to a Petal virtual disk and a lock service, and it automagically gets integrated

Removing a server is even easier! Just take a sledgehammer to it Alternatively, if you want to be nicer, you

can flush dirty data before using the sledgehammer

Backups Just use the snapshot features that are

built into Petal to do backups Resulting snapshot is crash-consistent:

reflects state reachable if all Frangipani servers were to crash

This is good enough – if you restore the backup, recovery mechanism can handle the rest

Summary Microsoft Cluster Service

Aims to provide reliable services running on a cluster

Presents itself as a virtual node to its clients Frangipani

Aims to provide a reliable distributed file system Uses metadata logging to recover from crashes Clients see it as a regular shared disk Adding and removing nodes is really easy

Documents

Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to