12
What is split brain? A split brain occurs when two independent systems configured in a cluster assume they have exclusive access to resources. In SFW HA (VERITAS Cluster Server) this scenario can be caused when all cluster heartbeat links are simultaneously lost. Each cluster node will then mark the other cluster node as FAULTED. This is known as a "network partition". This is represented in the figure below: This scenario is possible when both of the LLT (Low Latency Transport cluster communication links) are connected to Node 3 via the same IP network, for example, the same network switch. This configuration of the common network switch needs careful consideration in Replicated Data Cluster (RDC) where Node 3 may be located in another Data Centre to get LLT links over separate network infrastructure. What happens in a split brain? Under cluster logic, VCS will online any groups that it now considers

What is Split Brain

Embed Size (px)

Citation preview

Page 1: What is Split Brain

What is split brain?

A split brain occurs when two independent systems configured in a cluster assume they have exclusive access to resources.  In SFW HA (VERITAS Cluster Server) this scenario can be caused when all cluster heartbeat links are simultaneously lost.  Each cluster node will then mark the other cluster node as FAULTED.  This is known as a "network partition".  

This is represented in the figure below:

 

This scenario is possible when both of the LLT (Low Latency Transport cluster communication links) are connected to Node 3 via the same IP network, for example, the same network switch. This configuration of the common network switch needs careful consideration in Replicated Data Cluster (RDC) where Node 3 may be located in another Data Centre to get LLT links over separate network infrastructure.

What happens in a split brain?

Under cluster logic, VCS will online any groups that it now considers faulted.  The service groups however will be online on the other cluster node(s), that have formed a new cluster. This may lead to disk resources and volumes being off-lined as each cluster attempts to online the "failed" service groups.

How to tell if you have been a victim of a split brain?

Symptoms of a split brain are where a service group is attempted to be on-lined on a cluster

Page 2: What is Split Brain

node on the other "side" of the network partition while it is still online elsewhere.  Initial errors will involve the original node recording disk access errors and loss of reservation of the disk group.

Using the above diagram as an example, after simultaneous LLT link failure creating a network partitions:- Partition A containing Node 0, 1, 2- Parition B containing Node 3

a) in the system event log LLT will log Event ID 10033 for links expired in the other partition, so Node 3 will log messages such as:ERROR   10033(0xc0072731) LLT <server> Link expired (tag=Adapter1, link=1, node=1)ERROR   10033(0xc0072731) LLT <server> Link expired (tag=Adapter0, link=0, node=1)

for node=0, node=1, node=2, and cluster nodes in Partition A will log LLT link expired messages for node = 3

b) in the application event log the High Availability Daemon (HAD) will log that cluster nodes in the other partition have changed to state FAULTED, so Node 3 will log:ERROR   10322(0x05dd2852) Had    <server>                 VCS ERROR V-16-1-10322 System <server> (Node '0') changed state from RUNNING to FAULTED

for Node '0', Node '1' and Node '2', and cluster node in Partition A will log these messages against Node '3'.

How to minimize chances of split brain?

VCS uses heartbeats to determine the "health" of its peers. These can be private network heartbeats and/or public (low-priority) heartbeats. Regardless of the heartbeat configuration, VCS determines that a system has faulted when all heartbeats fail simultaneously. To prevent a split brain, following measures can be taken into considerations:

- Private Heartbeat - Ensure at least 2 private heartbeats are configured and these must be completely isolated from each other so the failure of one heartbeat link cannot possibly affect the other. Configurations such as running two shared heartbeats to the same hub or switch, or using a single virtual local area network (VLAN) to trunk between two switches induce a single point of failure in the heartbeat architecture.

Refer to the reference Technote in the Related Document section for additional recommendations on the private heartbeat configurations for SFW HA.

- Low-Priority Heartbeat - Heartbeat over public network does minimum traffic over the network until you get down to one normal heartbeat remaining. Then it becomes a full functional heartbeat.

Page 3: What is Split Brain

In Replicated Data Cluster, minimize the effects of split-brain by ensuring the cluster heartbeat links pass through similar physical infrastructure as the replication links so that if one breaks, so does the other.

VCS Communications: GAB and LLT

Communications within a VCS environment are conducted by the Group Atomic

Broadcast mechanism

(GAB) and the Low Latency Transport mechanism (LLT). These kernel components

are used only by VCS,

and replace the functions of TCP/IP for VCS private network communications.

How GAB Operates

GAB performs three major functions:

Manages cluster memberships.

Monitors heartbeat communication on disk or Ethernet.

Distributes information throughout the cluster.

Managing Cluster Memberships

Because GAB is a global mechanism, all systems within the cluster are immediately

notified of changes in

resource status, cluster membership, and configuration. GAB is also atomic,

meaning that it continuously

maintains a synchronized state in the cluster membership and configuration files of

all cluster systems. If a

failover occurs while transmitting status changes, GAB’s atomicity ensures that,

upon recovery, all systems

will have the same information regarding the status of any monitored resource in

the cluster.

Monitoring Heartbeats

GAB also monitors heartbeat communication between systems. Heartbeats are

signals that are sent

Page 4: What is Split Brain

periodically from one system to another to verify that the systems are active. You

may manually configure

the heartbeat interval and specify the number of consecutive heartbeats that a

system can miss before it

determines that another system has failed.

When a system suspects that another system has failed, the system in question is

probed by other systems in

the cluster to verify the failure. If the system remains unresponsive, it is marked

DOWN and excluded from

the cluster. Its applications are then migrated to the other systems. GAB ensures

when this process begins,

all remaining systems in the cluster have the same information regarding the status

of the failed system and

the migration of the applications. Note that GAB may kill the VCS engine when the

engine is unresponsive

or when the systems previously disconnected are reconnected.

Distributing Information

GAB distributes information to all systems throughout the cluster regarding system

loads, agent reports,

and administrative commands. GAB can also be configured to track and distribute

additional information.

For a listing of GAB commands, please see VERITAS TechDoc 232090.

How LLT Operates

LLT provides kernel-to-kernel communications and monitors network

communications. LLT can be

configured to:

Set system IDs within a cluster.

Set cluster IDs for multiple clusters.

Tune network parameters such as heartbeat frequency.

LLT runs directly on top of the Data Link Protocol Interface (DLPI) layer on UNIX,

Page 5: What is Split Brain

Split brain: Split brain occurs when two or more systems within the cluster think they have exclusive access to a shared resource at the same time. This can be very damaging because data corruption is common in this situation.

Jeopardy: A system is in jeopardy when only one of its heartbeat connections is still functioning. A loss of the remaining heartbeat network will not allow VCS to know whether the host has crashed or the last heartbeat network has been disabled.

VCS Communication

Heartbeat communication takes place with Group Atomic Broadcast (GAB) and Low Latency Transport (LLT). These are additional software packages and kernel modules included with VCS. GAB runs over LLT and is analogous to UDP running over IP. LLT links are customarily run over private networks, either via Ethernet crossover cables or separate network switches. LLT also has the concept of a low-priority link.

This link is a backup to the normal communication channels and is not fully utilized unless the other connections are disabled. Typically this link is run over a normal Ethernet network and not segregated as the primary links. If you have a backup or administrative network, that would be a good choice for your low-priority network. VCS will require at least two separate heartbeat communication channels unless overridden. Both LLT and GAB need to be configured and running on all cluster systems before VCS can be started on the cluster.

Communication between the various components of VCS is managed by the high-availability daemon, also known as "had." "Had" exchanges information between the userspace components (e.g., resource agents and CLI tools) and the kernel space components (LLT and GAB). Working alongside "had" is a process called "hashadow", whose job it is to monitor the "had" process and restart it if necessary.

VCS keeps numerous log files for debugging and monitoring in /var/VRTSvcs/log. The primary log is from "had", called the engine log. Additionally, each resource agent maintains its own log. The "halog" utility can be used to display information and contents for the engine log. To send alert messages via email or SNMP, VCS includes a notifier component that interfaces with "had". Along with the notifier, VCS can take a defined action in response to particular events. These are called "event triggers" and act similarly to database triggers.

Configuration

There are several tools to view and modify VCS. These include several command-line utilities and a GUI tool called Cluster Manager, which comes with a Java console and a Web-based front-end. As with most things on Unix, it is best to understand how to use all the command-line utilities and not rely only on the GUI tools.

Page 6: What is Split Brain

There are two ways to access the Veritas tools to view or modify the cluster. You can utilize a user account defined in VCS or have root access on one of the cluster systems. Access to VCS is restricted based on several user categories within VCS. They are cluster administrator, cluster operator, group administrator, group operator, and cluster guest. Each category has all the privileges of the lower categories. For example, group administrator can do all the functions of a group operator and cluster guest. Users with root access can bypass VCS authorization and run any of the command-line utilities with cluster administrator privileges. New users by default are in the cluster guest category until explicitly put into one or more of the other categories. Broadly speaking, guest users can only view the state of things; operators can view and change the state of things but not modify the configuration; administrators can do anything.

Configuration for VCS is stored in two files with similar formats -- main.cf and types.cf -- both located in /etc/VRTSvcs/conf/config. The types.cf file holds information about each resource type. The main.cf holds information specific to the cluster -- users, resources, service groups, and dependencies. Changes made to the main.cf are performed in memory and not saved to disk until a configuration dump is performed.

Configuration for LLT and GAB are held in the /etc/llttab and /etc/gabtab files. Detailed sample files are supplied by Veritas in their respective installation directories. There are many configuration options for LLT but the minimum options needed to operate are the node id, cluster number, and network links to be used for communication. Only nodes with the same cluster number will be able to communicate with each other, and each node in the cluster must have a unique node id. For GAB, the only required configuration option is to list the number of nodes in the cluster.

Building a Sample Cluster

Putting all this together, let's see a small, two-node cluster in action. The sample hardware will be a pair of Sun v240 servers attached to EMC storage running a custom widget application. Our sample cluster will be fully redundant to the host level and there should be no single point of failure (SPOF). This design is Veritas best practices for building a cluster. The v240 servers have four network ports built-in (bge ports) and three PCI card slots. We will install two PCI Fibre cards for redundant connection to the storage. The last PCI slot is for a quad Ethernet card (qfe ports) to back up the four internal Ethernet ports. There will be one failover service group for our application. Some common applications used in this type of cluster environment are Oracle, IBM MQSeries, and NFS servers.

There will be two heartbeat communication links in addition to a low-priority link and two paths to the public network. The host will have mirrored root disks and redundant power supplies. The first steps are to install all the hardware, set up Solaris, mirror the root disks, and configure the storage to be visible to both servers. I will assume you have created just one Veritas diskgroup and volume for this cluster. The diskgroups should be deported before starting to configure the cluster.

The network should be run from Ethernet port bge0 and the backup cable to the quad card port qfe0. Run crossover Ethernet cables for the heartbeats using ports bge1 and qfe1 on each server.

Page 7: What is Split Brain

Use the next bge port for an administrative network connection. Install VCS via the install script supplied by Veritas. It will ask for your license keys, otherwise you would have to install them manually with the halic command. Set up your gabtab and llttab files. Your gabtab file should look like:

/sbin/gabconfig -c -n2In your llttab file, look for the "set-node", "set-cluster", and "link" lines. For our cluster, your file should look like: set-node 0 (Other node would have a 1)set-cluster 1link bge0 /dev/bge:0 - ether - -link qfe0 /dev/qfe:0 - ether - -link-lowpri bge2 /dev/bge:2 - ether - -Once the files are complete, start LLT and then GAB via their init.d startup scripts. You can confirm the hosts see each other with lltstat and gabconfig. Here is sample lltstat -n output from a working cluster: LLT node information:Node State Links * 0 systemA OPEN 3 1 systemB OPEN 3Once they are working correctly, you can start VCS from its init.d script. We can then use hastatus -summary to confirm that VCS is running on all systems. We are now ready to configure VCS. Start by making the cluster configuration file writeable. Then we can add a user who will be an administrator for the entire cluster. This user can be used to access the Cluster Manager GUI: # haconf -makerw# hauser -add clusadmin (This will ask you to set a password for the account)# haclus -modify Administrators -add clusadminOnce that is done, you can add the system names to the cluster. Then add the service group and define the systems on which it can be run. The numbers you see are the priority for that system: # hasys -add systemA# hasys -add systemB# hagrp -add my_grp# hagrp -modify my_grp SystemList -add systemA 1# hagrp -modify my_grp SystemList -add systemB 2At this point, we will create our resources. Most resources have various attributes that can be set. For this sample, we will only change the required attributes, but you should examine the bundled agents' reference guide to see all configurable settings. Create the diskgroup, volume, and mount resources and modify their attributes. Then link these resource dependencies that come online in this order: diskgroup, volume, and mount point: # hares -add my_diskgroup DiskGroup my_grp# hares -modify my_diskgroup DiskGroup veritas_diskgroup_name# hares -add my_volume Volume my_grp# hares -modify my_volume Volume veritas_volume_name# hares -modify my_volume DiskGroup veritas_diskgroup_name# hares -add my_mount Mount my_grp# hares -modify my_mount MountPoint /clustermount# hares -modify my_mount FSType vxfs

Page 8: What is Split Brain

# hares -modify my_mount Fsckopt %-y# hares -modify my_mount BlockDevice /dev/vx/dsk/veritas_diskgroup_name/veritas_volume_name# hares -link my_volume my_diskgroup# hares -link my_mount my_volumeNow we can add the network resources. The MultiNICA resource will control network failover between the bge0 and qfe0 ports on our sample servers. Here we specify the local bge0 and qf0 ports and which IP addresses to assign them. This does not take the place of Solaris assigning IP addresses at boot time. VCS is merely monitoring the status of the network links. The IPMulticNIC resource is the virtual IP for the cluster with which clients will communicate: # hares -add my_multinic MultiNICA my_grp# hares -local my_multinic Device# hares -modify my_multinic Device bge0 192.168.0.1 -sys systemA# hares -modify my_multinic Device qfe0 192.168.0.1 -sys systemA# hares -modify my_multinic Device bge0 192.168.0.2 -sys systemB# hares -modify my_multinic Device qfe0 192.168.0.2 -sys systemB# hares -add my_ipaddress IPMultiNIC my_grp# hares -modify my_ipaddress Address 192.168.0.3# hares -modify my_ipaddress MultiNICResName my_multinicThe final part of the process is the most important part. Here we will add the application resource. Typically, the application requires the mount point and the IP address to be online before it can start, so we will make those dependencies. Figure 1 shows the dependency layout in the Cluster Manager GUI. The application agent can monitor your application several different ways, but here we will enable all the resources in the service group. VCS will not attempt to online, offline, or monitor any resources unless they are enabled. Finally, we will dump the running configuration from memory to disk and make the file read-only: # hares -add my_application Application my_grp# hares -modify my_application PidFiles /path/to/pidfile# hares -modify my_application StartProgram /path/to/startup/script# hares -modify my_application StopProgram /path/to/shutdown/script# hares -link my_application my_ipaddress# hares -link my_application my_mount# hagrp -enableresources my_grp# haconf -dump -makeroNow we are ready to test everything. A simple online will see that the service group starts on one of the systems. If that works, you can try to switch the group over to the other node: # hagrp -online my_grp -sys systemA# hagrp -switch my_grp -to systemBAdditional valuable tests include:

Kill the "had" process and check that "hashadow" restarts it. Panic an active system so the service group will fail over to another node. This also will

test whether the applications can survive such an event. Pull heartbeat and network cables and make sure everything reacts as expected.

Page 9: What is Split Brain