Split Brain Handling through SCSI PR Disk Fencing - …€¦ · PowerHA SystemMirror 7.2 Split Brain Handling through SCSI PR Disk Fencing Authors: Abhimanyu Kumar Prabhanjan Gururaj

PowerHA SystemMirror 7.2

Split Brain Handling through SCSI PRDisk Fencing

Authors:Abhimanyu KumarPrabhanjan Gururaj

Rajeev S NimmagadaRavi Shankar

Page 1 of 23

Table of Contents..............................................................................................................................................11Introduction........................................................................................................................3..............................................................................................................................................32Cluster Split Brain Condition.............................................................................................4..............................................................................................................................................63Disk Fencing......................................................................................................................7

3.1Disk Fencing Pre requisites:.......................................................................................83.1.1SCSI-3 Persistent Reserve capabilities support...................................................8

3.1.1.1EMC Storage and SCSI-3...........................................................................93.1.1.2Hitachi Storage & SCSI-3........................................................................10

........................................................................................................................................103.2Setup..........................................................................................................................11........................................................................................................................................11

....................................................................................................................................113.2.1Prerequisites.......................................................................................................123.2.2SMIT Panel........................................................................................................16....................................................................................................................................183.2.3clmgr Command.................................................................................................19

........................................................................................................................................193.3Runtime Considerations............................................................................................20

4References........................................................................................................................22

Page 2 of 23

1 IntroductionThis blog introduces a new feature “Disk Fencing Quarantine Policy” in the PowerHA SystemMirror 7.2 release. This feature will provide protection against rare Cluster Split brain conditions in a PowerHA SystemMirror Cluster.

Page 3 of 23

2 Cluster Split Brain ConditionCluster based High Availability solutions depend on redundant communication channels between the nodes in a cluster enabling health monitoring of those nodes. This communication is a critical function that enables the cluster to start workload on an alternate node in the cluster when the production node goes down..

Figure 1. below shows a typical cluster deployment used to provide a High Availability (HA)_topology. This cluster has 3 network connections for redundancy and also a disk which is used for heartbeat purposes. This example configuration provides 4 channels of redundant communication between the nodes in the cluster. Note that in this case, the application is operating from an active LPAR on System 1 and the system 2 LPAR is in passive/standby mode ready to take over the workload in case the active or primary LPAR fails. Each node tracks the health of its partner by monitoring the heartbeats and other communications. For the heartbeats to be exchanged, there has to be a good communication channel between the nodes. Hence it is essential to have as many redundant communication channels as possible between the nodes to avoid any false failures.

Fig 1: Cluster High Availability: Redundant communication channels

Page 4 of 23

However, there could be scenarios where all the communication channels between the nodes are broken. Examples of such scenarios include:

1. In extremely rare occasions due to hardware errors (sick but not dead type of errors eg: all the IO fabrics had a freeze for one ndoe) and such it may be possible that primary LPAR freezes for long periods of time resulting in no IO communication to occur from the LPAR. In this case Standby LPAR does not receive any heartbeats from Active LPAR for a pre-determined duration (Node Failure Detection Time) and then declares Active LPAR to be dead. However after that declaration it is possible that Active LPAR unfreezes continuing the IO. This scenario would result in a Cluster split for a duration of time. This could have major impacts to data integrity as explained later.

2. Some IO failures could potentially result in an extended blackout window for IO. For example if the PCI bus master failed and therefore froze the entire PCI infrastructure for a duration of time resulting no IO activity for the period If this IO blackout time is more than the Node Failure detection time threshold, then standby LPAR would declare the primary to have failed (false failure). This would result in cluster enter split state temporarily during the IO black out time.

Cluster split results in two sides. In the example of 2 node cluster, each side would consist of one node. Note that if more nodes are part of the cluster, then each side could have more nodes than 1. For example in a 4 node cluster, split could occur such that sides could be:

• 1 node and 3 nodes• 2 nodes and 2 nodes

These sides are also called as partitions or islands (of nodes).

A Cluster split could result in Application being started on the Standby LPAR incorrectly due to the false node failure detection. This is shown in the figure below:

Page 5 of 23

Fig 2: Cluster Split Condition: Incorrect and Duplicate application start.

As can be seen Standby incorrectly declares primary LPAR to have failed and starts the application resulting in the application being active on both LPARs simultaneously. This would have the disastrous result of both application writing to the shared disks resulting in data corruption.Cluster splits are rare but unavoidable in extreme cases such as sick but not dead components in the environment.

It is critical to protect the cluster environment against these rare conditions. PowerHA SystemMirror v7 has supported many capabilities such as disk tie breaker to protect against the Cluster split conditions. PowerHA SystemMirror Version 7.2 introduces new capabilities to protect against the split brain condition. Two quarantine policies are introduced to handle split scenarios:

1. Active Node Halt Policy (ANHP)2. Disk Fencing

In this blog we will review the Disk Fencing in some detail.

Page 6 of 23

3 Disk FencingDisk Fencing policy will fence out the disks of all the volume groups added to any resource group using SCSI Persistent Reserve Protocols. This will ensure that only one island will have access to the disks and data will remain protected. This flow ensures that Application will be operating from only one node in the cluster at any time.

PowerHA SystemMirror registers at the disks of all the volume groups which are a part of any resource group.

Fig 3: Cluster Split Condition: Disk Fencing flow

As shown in the Figure 3, Standby LPAR reaches out to the storage and then requests that the Active LPAR disk access be revoked (pre-empted). Storage will then block any write accesses from the Active LPAR (even if it returns from sick to healthy state). Standby LPAR will bring up Application/RG only if it is able to successfully fence out the Active LPAR in regards to the RG related disks. If Standby encounters any errors while fencing out the Active LPAR, then workload will not be brought up and administrator would need to review and correct the environment as necessary and then bring up the RG manually if necessary.

Some of the key attributes of PowerHA SystemMirror Disk fencing are:1. Disk fencing applies to Active-Passive Cluster deployment. Disk fencing is not

suitable for RGs of type online on all nodes deployment and is not supported.

Page 7 of 23

2. Disk Fencing is supported for the entire cluster. So it can be enabled or disabled at the cluster level

3. PowerHA SystemMirror does the key setup and management related to SCSI-3 reserves. Administrator is not expected to do any SCSI-3 key management

4. All the disks managed as part of Volume groups of various Resource Groups (RG) are managed for disk fencing

5. Disk fencing can be used in the mutual takeover configurations. If multiple RGs exist in the cluster, administrator needs to choose one RG to be the most critical RG. This RG’s relative location at the time of split decides which side will win after a split (it would be the side where the critical RG was not running before split – This side is considered the standby side for the critical RG at that time)

PowerHA SystemMirror 7.2 uses as much information as possible from the cluster to determine the health of the partner nodes. For example, if the Active LPAR were going to crash, it will try to send a “Last Gasp” message to the standby LPAR before terminating. These types of notifications help to insure that the standby LPAR to be certain of the death of the Active LPAR and hence can take workload ownership safely. However, there are cases where the standby LPAR is aware that the active LPAR is not sending heartbeats but is not sure of the actual status of the active LPAR. In these cases, the standby LPAR will declare that the active LPAR has failed after waiting for time duration of Node Failure Detection Time. At that time, since the standby partition is not sure of the health of the active LPAR, it will fence out the all the disks before bringing the resource groups online. If it failed to fence even a single disk of any volume group, resource group will not be brought online.

3.1 Disk Fencing Pre requisites:1. Storage systems should be enabled for SCSI-3 PR capabilities for all the disks

managed as part of the disk fencing

2. All the disks to be managed for Disk fencing should not be in use when the disk fencing is enabled (that is all the VGs should be offline)

3. Disks should be free of any reserves before starting PowerHA SystemMirror configuration. Tools are provided to release any reserves.

3.1.1 SCSI-3 Persistent Reserve capabilities supportOne of the key requirements for Disk Fencing is that all the disk being used should support SCSI 3 Persistent Reserve (PR) Protocols. Note that some of the storage subsystems do not enable the PR capabilities by default. These storage sub systems provide commands or graphical interfaces to enable the PR capabilities. Some storage specific guidelines are provided below. Note that these instructions might be out of date depending on the storage model etc. Please refer to the Storage vendor documentation for exact methods to enable the SCSI-3 capabilities.

Page 8 of 23

3.1.1.1 EMC Storage and SCSI-3EMC disks do not support SCSI-3 capabilities by default. If you try to configure Disk fencing without enabling the capability in EMC storage, you will get an error.

Enable SCSI-3 reservation capabilities in EMC storages (VMAX and DMX) by enabling SPC2 and SC3 capabilities for each disk assigned to the Volume groups to be managed by PowerHA SystemMirror.

Please refer to the EMC documentation for detailed instructions to enable the SCSI-3 capability.

Following is an example set of steps tested with EMC storage VMAX (note that many of these commands are part of the EMC software packages installed on PowerHA/AIX LPAR):

For each disk, do the following (note that while doing these operations disks should not be in use (all the VGs with these disks should not be vary’d on etc). Ideally make sure that disks are not in use on any of the nodes in the cluster.):

1. Find the device/disk id in the EMC storage subsystem

2. Enable the SCSI-3 PR capabilities in the EMC storage subsystem

3. Rediscover the disk/s fresh in AIX:

a. Remove the device/disk

b. Run cfgmgr to discover the disk/s

4. Verify that the SCSI-3 capabilities are enabled for the disk/s

Once these steps have been completed, configure PowerHA SystemMirror disk fencing.

Here are the example commands:

Retrieve the Disk identity from EMC storage

• Symmetrix ID(sid)

• Logical device(device id)

# powermt display dev=hdiskpowerX

Pseudo name=hdiskpowerX

Symmetrix ID=000194900568

Logical device ID=0036

Device WWN=6000097000019490056853303030xxxx

state=alive; policy=SymmOpt; queued-IOs=0

Enable SCSI-3 capability using the disk identitity

# symconfigure -sid 000194900568 -cmd "set device 0036 attribute=SCSI3_persist_reserv;" commit -v -noprompt

Rediscover Disk in AIX # rmdev -Rdl hdiskpowerX

# cfgmgr

Verify the SCSI-3 #/usr/symcli/bin/symdev -sid 000194900568 show 0036 | grep

Page 9 of 23

capability is enabled in Storage

SCSI

SCSI-3 Persistent Reserve: Enabled

3.1.1.2 Hitachi Storage & SCSI-3

Hitachi disables SCSI-3 capability by default. You need to manually enable these capabilities for the disk groups assigned to the PowerHA SystemMirror LPARs for shared VG management.

Enable HMO 2 and HMO 72 options through the Hitachi provided graphical management interface software for the storage. Note that the graphical interface picture is from Hitachi and please refer to Hitachi documentation for more details.

Page 10 of 23

3.2 Setup

This section explains how to setup PowerHA SystemMirror 7.2 to enable Disk Fencing.

Before enabling the disk fencing mechanism one must ensure that all the shared disks used in the cluster are capable of SCSI3 protocols. Details are provided in the prerequisites (section 3.2.1)

Setup for this policy could be done in one of two ways:

1. Using SMIT panels (Section 3.2.2)

2. Using clmgr command line (section 3.2.3)

Page 11 of 23

3.2.1 Prerequisites

• Currently SCSI3 protocols are not supported on ISCSI disks.

• For EMC disks a minimum version of PowerPath v6.0.1 is needed.

◦ Set Flags SCSI3_persist_reserv, SPC2, SC3

• For Hitachi disks HMO 2 and HMO 72 should be set

◦ Minimum code to support HMO72 is 70-04-31-00/00

A resource group must be defined as a critical RG in the cluster.

• This Critical RG has to span on all nodes in the cluster.

• Critical RG cannot have start up policy as Online on All Available nodes.

• Critical RG cannot be a child as part of any relationship.

▪ Parent-child or Start-after

• Critical RG should have higher priority in Location dependencies.

• If Active Node Halt policy is also enabled on the cluster, only one RG will be chosen as critical RG in the cluster for both policies.

• The Critical RG can be a dummy RG without any resources.

To see if a physical volume is SCSI Persistent Reserve Type 7H capable

clmgr view pv <hdiskx>

NAME="hdisk10"

PVID="00f74e512845cbb7"

UUID="16e2d679-e986-38e3-49d2-2650f3089bad"

VOLUME_GROUP="None"

TYPE="mpioosdisk"

DESCRIPTION="MPIO 2810 XIV Disk"

SIZE="16411"

AVAILABLE="16411"

CONCURRENT="true"

Page 12 of 23

ENHANCED_CONCURRENT_MODE="true"

STATUS="Available"

SCSIPR_CAPABLE="Yes"

Note: SCSIPR_CAPABLE=''No'' ( if a physical volume would not be scsi persistent reserve type 7H capable).

Instead of checking for each disk one can run this command for each VG.

To see if a volume group is SCSI Persistent Reserve Type 7H capable

clmgr view vg <vg_name>

NAME="vg1"

TYPE="SCALABLE"

NODES="powerha13,powerha14"

LOGICAL_VOLUMES=""

PHYSICAL_VOLUMES="hdisk77@powerha13@00f74e514901ede5"

MIRROR_POOLS=""

STRICT_MIRROR_POOLS="no"

RESOURCE_GROUP="crg"

AUTO_ACTIVATE="false"

QUORUM="true"

CONCURRENT_ACCESS="true"

CRITICAL="false"

ON_LOSS_OF_ACCESS=""

NOTIFYMETHOD="''

MIGRATE_FAILED_DISKS="false"

SYNCHRONIZE="false"

LOGICAL_TRACK_GROUP_SIZE="512"

MAX_PHYSICAL_PARTITIONS="32768"

PPART_SIZE="16"

MAX_LOGICAL_VOLUMES="256"

MAJOR_NUMBER="56"

IDENTIFIER="00f74e5100004c000000014eb564aee7"

Page 13 of 23

TIMESTAMP="55af76ae130082e9"

SCSIPR_CAPABLE="Yes"

Note: SCSIPR_CAPABLE=''No'' ( if a volume group would not be scsi persistent reserve type 7H capable).

Sometimes reserves are placed on the disks which may not allow to be changed to pr_shared. PowerHA SystemMirror will try to reset the policy to pr_shared, but it is recommended that customers make sure that there are no other applications changing the reserve policy of the disks.

Reserve policy can be checked with lsattr command.

# lsattr -El hdisk10

PCM PCM/friend/vscsi Path Control Module False

algorithm fail_over Algorithm True

…........................................................

pvid 00c6fa22d39a8ec90000000000000000 Physical volume identifier False

queue_depth 3 Queue DEPTH True

reserve_policy no_reserve Reserve Policy True+

PowerHA SystemMirror also provided a new command in 720 release to check the reserves on the disk.

#clrsrvmgr -r -l hdisk1 -v

Effective reserve policy on hdisk1 : no_reserve

Configured reserve policy on hdisk1 : no_reserve

Reservation status on /dev/hdisk1 : No Reservation

This command can also be run on the VG directly

#clrsrvmgr -r -g pgvg -v


Configured reserve policy on hdisk1 : no_reserve

Reservation status on /dev/hdisk1 : No Reservation

Page 14 of 23


Configured reserve policy on hdisk2 : single_path

Reservation status on /dev/hdisk2 : Single Path Reservation

Effective reserve policy on hdisk3 : PR_shared

Configured reserve policy on hdisk3 : PR_shared

Reservation status on /dev/hdisk3 : SCSI PR Reservation

This command can be run at any time on the cluster.

Before the cluster services are started it can be run to make sure the state is no_reserve. (hdisk1 in the above example)

When the cluster services are running we can make sure that the SCSI PR reservatio is set on the disk.

(hdisk3 in the above example)

Shared disks in the cluster should not be in single_path mode.

(hdisk2 in the above example)

Page 15 of 23

3.2.2 SMIT PanelPowerHA SystemMirror does not enable Disk Fencing quarantine policy by default. Administrator has to enable it before starting the cluster services. Note that this policy can not be enabled when cluster services are active on one or more nodes of the cluster. The policy cannot be changed on a VG that is already online. Note that this policy can be combined with the Tie Breaker split handling policies (Tie Breaker decisions are made first and then the Quarantine policy kicks in). Below are the SMIT screens used for configuring Disk Fencing.

The menus can be accessed from Smit hacmp → Custom Cluster Configuration → Cluster Nodes and Networks → Initial Cluster Setup (Custom) → Configure Cluster Split and Merge Policy → Quarantine Policy → Disk Fencing

or use the fastpath, smit cm_cluster_quarintine_disk_dialog.

Page 16 of 23

Page 17 of 23

Page 18 of 23

3.2.3 clmgr CommandBelow is the clmgr option provided to configure Quarantine Policy

clmgr modify cluster \

[ SPLIT_POLICY={none|tiebreaker|manual} ] \

[ TIEBREAKER=<disk> ] \

[ MERGE_POLICY={majority|tiebreaker|priority|manual} ] \

[ NOTIFY_METHOD=<method> ] \

[ NOTIFY_INTERVAL=### ] \

[ MAXIMUM_NOTIFICATIONS=### ] \

[ DEFAULT_SURVIVING_SITE=<site> ] \

[ APPLY_TO_PPRC_TAKEOVER={yes|no} ]

[ ACTION_PLAN=reboot ]

[ QUARANTINE_POLICY=<node_halt | fencing |

halt_with_fencing>] \

[CRITICAL_RG=<rg_value>]

Below are the clmgr options for managing Disk Fencing related operations

• To check if a hdisk supports SCSI 3 PR Type 7H

clmgr query pv <hdiksXX>

• To check if all the disks of a volume group supports SCSI 3 PR Type 7H

clmgr query vg <vg_name>

• To clear the SCSI3 reservation from a disk

clmgr modify pv <hdiskXX> SCSIPR_ACTION=clear

• To clear the SCSI3 reservation from all the disk of a volume group

clmgr modify vg <vg_name> SCSIPR_ACTION=clear

Page 19 of 23

3.3 Runtime Considerations

On a cluster where SCSI PR Disk fencing has been enabled we can check that the reservation state of the disk/VG using the clrsrvmgr command.

A disk should have reserve policy as follows if cluster services are running and disk fencing is enabled.

Configured Reserve Policy : PR_shared

Effective Reserve Policy : PR_shared

Reservation Status : SCSI PR reservation – Write_Exclusive_All_Registrants

Here, configured reserve policy is ODM reserve policy, effective reserve policy is kernel reserve policy and reservation status is device reservation state( Persistent Reserve Type).

The output should be same on all nodes in the cluster where cluster services are active.

In a cluster if a Split has happened and a node has been preempted from the disks, any IO that is happening on the disk will not be permitted any more.

There will be LVM_IO_FAIl errors on the node and the Resource groups will eventually go into ERROR state.

This happens on all the nodes on the loosing island.

The Resource Groups cannot be recovered on these node without stopping the cluster services (Even when the split has been healed)

This restriction is in place so a resource cannot be started on this node again and there by causing any data corruption.

Once the cluster has healed the nodes on the loosing side should stop cluster services and start them again to rejoin the cluster.

On a Linked or Stretched cluster with some split and merge Policies setup the nodes on the loosing island will reboot, so cluster services need to be started after the Split has been healed.

Page 20 of 23

If there ever was a scenario which happens due to timing problems that the active node end up in a scenario where all the Vgs cannot be brought online because of Disk fencing issues the following steps can be followed to recover the reservations on that node.

Smit hacmp → Problem Determination Tools

And we have to selct the RG in Error in the next screen.

This will clear any RG's in Error state and the RG can be brought online again.

Page 21 of 23

4 References

• scdisk SCSI Device Driver https://ibm.biz/BdE88s• T10 Documents

◦ a. SPC-4 https://ibm.biz/BdECDk◦ b. IBMer PPT: http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-024r2.pdf

• Understanding Persistent Reserves http://www-01.ibm.com/support/knowledgecenter/SSFKCN_4.1.0/com.ibm.cluster.gpfs.v4r1.gpfs500.doc/bl1pdg_understandpr.htm

• SCSI reservation methodologies http://www.aixmind.com/?p=757

Page 22 of 23

http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-024r2.pdf

https://ibm.biz/BdE88s

https://ibm.biz/BdECDk

http://www.aixmind.com/?p=757

http://www-01.ibm.com/support/knowledgecenter/SSFKCN_4.1.0/com.ibm.cluster.gpfs.v4r1.gpfs500.doc/bl1pdg_understandpr.htm



Disclaimers:

This article is to provide an overview of the Disk Fencing management and is not expected to be complete. Refer to PowerHA SystemMirror documentation for the most recent and correct information about SCSI-3 Disk Fencing. All the information in this article is opinions from the authors and no way represents the position of PowerHA SystemMirror product or IBM.

Page 23 of 23