View
217
Download
1
Embed Size (px)
Citation preview
Scalable Clusters
Jed Liu11 April 2002
Overview Microsoft Cluster Service
Built on Windows NT Provides high availability services Presents itself to clients as a single system
Frangipani A scalable distributed file system
Microsoft Cluster Service Design goals:
Cluster composed of COTS components Scalability – able to add components
without interrupting services Transparency – clients see cluster as a
single machine Reliability – when a node fails, can restart
services on a different node
Cluster Abstractions Nodes Resources
e.g., logical disk volumes, NetBIOS names, SMB shares, mail service, SQL service
Quorum resource Implements persistent storage for cluster
configuration database and change log Resource dependencies
Tracks dependencies btw resources
Cluster Abstractions (cont’d) Resource groups
The unit of migration: resources in the same group are hosted on the same node
Cluster database Configuration data for starting the cluster is
kept in a database, accessed through the Windows registry.
Database is replicated at each node in the cluster.
Node Failure Active members broadcast periodic
heartbeat messages Failure suspicion occurs when a node
misses two successive heartbeat messages from some other node Regroup algorithm gets initiated to
determine new membership information Resources that were online at a failed
member are brought online at active nodes
Member Regroup Algorithm Lockstep algorithm Activate. Each node waits for a clock
tick, then starts sending and collecting status messages
Closing. Determine whether partitions exist and determines whether current node is in a partition that should survive
Pruning. Prune the surviving group so that all nodes are fully-connected
Regroup Algorithm (cont’d) Cleanup. Surviving nodes local
membership information as appropriate Stabilized. Done
Joining a Cluster Sponsor authenticates the joining node
Denies access if applicant isn’t authorized to join Sponsor sends version info of config
database Also sends updates as needed, if changes were
made while applicant was offline Sponsor atomically broadcasts information
about applicant to all other members Active members update local membership
information
Forming a Cluster Use local registry to find address of
quorum resource Acquire ownership of quorum resource
Arbitration protocol ensures that at most one node owns quorum resource
Synchronize local cluster database with master copy
Leaving a Cluster Member sends an exit message to all
other cluster members and shuts down immediately
Active members gossip about exiting member and update their cluster databases
Node States Inactive nodes are offline Active members are either online or paused All active nodes participate in cluster
database updates, vote in the quorum algorithm, maintain heartbeats
Only online nodes can take ownership of resource groups
Resource Management Achieved by invoking a calls through a
resource control library (implemented as a DLL)
Through this library, MSCS can monitor the state of the resource
Resource Migration Reasons for migration:
Node failure Resource failure Resource group prefers to execute at a
different node Operator-requested migration
In the first case, resource group is pulled to new node
In all other cases, resource group is pushed
Pushing a Resource Group All resources in the old node are
brought offline Old host node chooses a new host Local copy of MSCS at new host brings
up the resource group
Pulling a Resource Group Active nodes capable of hosting the
group determine amongst themselves the new host for the group New host chosen based on attributes that are
stored in the cluster database Since database is replicated at all nodes,
decision can be made without any communication!
New host brings online the resource group
Client Access to Resources Normally, clients access SMB resources
using names of the form \\node\service This presents a problem – as resources migrate
between nodes, the resource name will change With MSCS, whenever a resource migrates,
resource’s network name also migrates as part of resource group Clients only sees services and their network
names – cluster becomes a single virtual node
Membership Manager Maintains consensus among active
nodes about who is active and who is defined A join mechanism admits new members
into the cluster A regroup mechanism determines current
membership on start up or suspected failure
Global Update Manager Used to implement atomic broadcast A single node in the cluster is always
designated as the locker Locker node takes over atomic
broadcast in case original sender fails in mid-broadcast
Frangipani Design goals:
Provide users with coherent, shared access to files
Arbitrarily scalable to provide more storage, higher performance
Highly available in spite of component failures Minimal human administration
Full and consistent backups can be made of the entire file system without bringing it down
Complexity of administration stays constant despite the addition of components
Server Layering
Userprogram
Userprogram
Userprogram
Frangipanifile server
Frangipanifile server
Petaldistributed virtualdisk service
Distributedlock service
Physical disks
Assumptions Frangipani servers trust:
One another Petal servers Lock service
Meant to run in a cluster of machines that are under a common administration and can communicate securely
System Structure Frangipani implemented as a file
system option in the OS kernel All file servers read and write the same
file system data structures on the shared Petal disk
Each file server keeps a redo log in Petal so that when it fails, another server can access log and recover
Petalserver
Lockserver Petal
server
Lockserver
User programs
File system switch
Frangipanifile server module
Petaldevice driver
Network
User programs
File system switch
Frangipanifile server module
Petaldevice driver
Petalserver
Lockserver
Petal virtual disk
Security Considerations Any Frangipani machine can access and
modify any block of the Petal virtual disk Must run only on machines with trusted OSes
Petal servers and lock servers should also run on trusted OSes
All three types of components should authenticate one another
Network security also important: eavesdropping should be prevented
Disk Layout 264 bytes of addressable disk space,
partitioned into regions: Shared configuration parameters Logs – each server owns a part of this
region to hold its private log Allocation bitmaps – each server owns parts
of this region for its exclusive use Inodes, small data blocks, large data blocks
Logging and Recovery Only log changes to metadata – user
data is not logged Use write-ahead redo logging
Log implemented as a circular buffer When log fills, reclaim oldest ¼ of buffer
Need to be able to find end of log Add monotonically increasing sequence
numbers to each block of the log
Concurrency Considerations Need to ensure logging and recovery
work in the presence of multiple logs Updates requested to same data by
different servers are serialized Recovery applies a change only if it was
logged under an active lock at the time of failure
To ensure this, never replay an update that has already been completed
keep a version number on each metadata block
Concurrency Considerations (cont’d)
Ensure that only one recovery daemon is replaying the log of a given server
Do this through an exclusive lock on the log
Cache Coherence When lock service detects conflicting
lock requests, current lock holder is asked to release or downgrade lock
Lock service uses read locks and write locks When a read lock is released, corresponding
cache entry must be invalidated When a write lock is downgraded, dirty data
must be written to disk Releasing a write lock = downgrade to read
lock, then release
Synchronization Division of on-disk data structures into
lockable segments is designed to avoid lock contention Each log is lockable Bitmap space divided into lockable units Unallocated inode or data block is protected
by lock on corresponding piece of the bitmap space
A single lock protects the inode and any file data that it points to
Locking Service Locks are sticky – they’re retained until
someone else needs them Client failure dealt with by using leases Network failures can prevent a Frangipani
server from renewing its lease Server discards all locks and all cached data If there was dirty data in the cache,
Frangipani throws errors until file system is unmounted
Locking Service Hole If a Frangipani server’s lease expires
due to temporary network outage, it might still try to access Petal Problem basically caused by lack of clock
synchronization Can be fixed without synchronized clocks by
including a lease identifier with every Petal request
Adding and Removing Servers Adding a server is easy!
Just point it to a Petal virtual disk and a lock service, and it automagically gets integrated
Removing a server is even easier! Just take a sledgehammer to it Alternatively, if you want to be nicer, you
can flush dirty data before using the sledgehammer
Backups Just use the snapshot features that are
built into Petal to do backups Resulting snapshot is crash-consistent:
reflects state reachable if all Frangipani servers were to crash
This is good enough – if you restore the backup, recovery mechanism can handle the rest
Summary Microsoft Cluster Service
Aims to provide reliable services running on a cluster
Presents itself as a virtual node to its clients Frangipani
Aims to provide a reliable distributed file system Uses metadata logging to recover from crashes Clients see it as a regular shared disk Adding and removing nodes is really easy