Download ppt - VERITAS Cluster Server for Solaris Configuring VCS Response to Resource Faults

VERITAS Cluster Serverfor SolarisConfiguring VCS Response to Resource Faults

VCS_3.5_Solaris_R3.5_20020915

10-2

How VCS Responds to Faults

Systemavailable inSystemList?

Critical online resource

in path?

Offline all resources in path

Call resfault (if present)

Offline entire service group

Start service group elsewhere

Keep group partially online

Keep service group offline

Run NoFailover trigger

Y N

N

Y

VCS_3.5_Solaris_R3.5_20020915

10-3

Failover Policies

• The AutoFailOver attribute indicates whether automatic failover is enabled for the service group.Default value is 1, enabled.

• The FailOverPolicy attribute specifies how a target system is selected:

– Priority—System with the lowest priority number in the list is selected (default).

– RoundRobin—System with the least number of active service groups is selected.

– Load—System with greatest available capacity is selected.

• Example configuration:hagrp –modify group AutoFailOver 0hagrp –modify group FailOverPolicy Load

VCS_3.5_Solaris_R3.5_20020915

10-4

Priority Failover Policy

DB

Svr2

Svr1AP1

SystemList = {Svr1 = 0, Svr2 = 1}

SystemList = {Svr2 = 0, Svr1 = 1}

Svr3

SystemList = {Svr3=0, Svr1=1, Svr2=2}AP2

Lowest numbered system in SystemList selected

VCS_3.5_Solaris_R3.5_20020915

10-5

Round Robin Failover Policy

Svr3

Svr4Svr2

Svr1

System with fewest running service groups selected

VCS_3.5_Solaris_R3.5_20020915

10-6

Load Failover Policy1. Define system Capacity based on server capability.

- Load

= Available

Capacity 300300 300 150

We decide each of these servers has a Capacity of 300.

This server has a Capacity of 150.

VCS_3.5_Solaris_R3.5_20020915

10-7

Determining Load

iPlanet requires 100 units of Load

iPlanet100

Load

Sybase requires 125

Sybase125

Load

Oracle 8i requires 150

NFS shares require 75 each

Oracle 8i150

LoadNFS1 75

Load

NFS2 75

LoadNFS3 75

Load

1. Define system Capacity based on server capability.

2. Define group Load based on application requirements.

- Load

= Available

Capacity 300

150 +75

300

100

200

300

125

175

150

75+75

075

VCS_3.5_Solaris_R3.5_20020915

10-8

Determining the Failover Target

iPlanet100

Load

Sybase125

Load

Oracle 8i150

LoadNFS1 75

Load

NFS275

LoadNFS3 75

Load

Oracle 8i150

Load

- Load

= Available

Capacity 300

150 +75

300

100

200

300

125

175

150

75+75

075

Oracle 8i FAILS.1. VCS brings Oracle 8i online on the server with 200

AvailableCapacity.2. VCS recalculates AvailableCapacity based on new Load.

VCS_3.5_Solaris_R3.5_20020915

10-9

Tracing a Server Failure

- Load

= Available

Capacity 300

iPlanet100

Load

Sybase125

Load

75

NFS175

Load

300

250

50

300

125

175

150

75+75

0225

NFS2 75

LoadNFS3 75

Load

Oracle 8i150

Load

The NFS server FAILS.1. VCS brings NFS2 online on the server with 225

AvailableCapacity.2. VCS recalculates AvailableCapacity based on new Load.

NFS2 75

Load

VCS_3.5_Solaris_R3.5_20020915

10-10

Completing Fail Over

- Load

= Available

Capacity

iPlanet100

Load

Sybase125

Load NFS1 75

Load

300

250

50

300

125

175

150

75+75

0

300

150

150

NFS3 75

Load

Oracle 8i150

Load

1. VCS brings NFS2 online on the server with 175 AvailableCapacity.

2. VCS recalculates AvailableCapacity based on new Load.

NFS2 75

LoadNFS3

75 Load

VCS_3.5_Solaris_R3.5_20020915

10-11

Setting Load and Capacity

• The Load and Capacity attributes are user-defined values.

• Set attributes using the hagrp and hasys commands.

• Examples:hasys –modify LgSrv1 Capacity 300

hagrp –modify OracleSG Load 150

• AvailableCapacity calculated by VCS:

Capacity minus Load equals AvailableCapacity

VCS_3.5_Solaris_R3.5_20020915

10-12

Dynamic Load Balancing

iPlanet100

Load

Sybase125

Load

Oracle 8i150

LoadNFS1 75

Load

Process

40 Load

Process

40 Load

External software monitors CPU utilization (30, 40, 75, and 80 percent utilization for systems shown below).

The software sets DynamicLoad attribute according to system Capacity value using hasys –load system value.

For example, if CPU utilization is 30% and Capacity is set to 300, set Dynamic load to 90 (30% of 300).

- DynLoad

= Available

Capacity 300

225

300

90

210

300

120

175

100

80

2075

30% 40% 75% 80%

VCS_3.5_Solaris_R3.5_20020915

10-13

The LoadWarning Trigger

Sybase125

Load

Oracle 8i150

LoadNFS1 75

Load

Process

40 Load

Process

40 Load

Runs when system has been running at a specified percent of Capacity level for a specified period of time.

Configured by placing loadwarning script in /opt/VRTSvcs/bin/triggers and setting system attributes.

This example configuration causes VCS to run the trigger if system Srv4 runs at 90 percent of capacity for ten minutes.

- DynLoad

= Available

Capacity 300

225

300

90

210

300

120

175

100

80

2075

80%System Svr4 ( Capacity=100 LoadWarningLevel=90 LoadTimeThreshold=600 )

System Svr4 ( Capacity=100 LoadWarningLevel=90 LoadTimeThreshold=600 )

Srv4

main.cfmain.cf

VCS_3.5_Solaris_R3.5_20020915

10-14

System Limits1. Define system Limits based on the server properties:

Limits = {Processors-4, Mem=512}

- Prereq

= Current

Limits 1,128

Each of these servers has: Processors=4 Mem=512

This server has: Processors=1 Mem=128

4,5124,5124,512

VCS_3.5_Solaris_R3.5_20020915

10-15

Service Group Prerequisites

iPlanet:1, 184

Sybase requires

1 Processor212 Mb RAMSybase:

1,212

Oracle requires 2 Processors256 Mb RAM

NFS requires 1 Proc

48 Mb RAM

Oracle 8i2,256

NFS1 1,48

NFS2 1,48

NFS3 1,48

1. Define system Limits based on the server properties.

2. Define service group Prerequisites based on application requirements.

- Prereq

= Current

Limits

iPlanet requires

1 Processor184 Mb RAM

4,512 1,128

3,3041,184 1,212 1,96

1,2081,328 1,300 1,32

4,5124,512

VCS_3.5_Solaris_R3.5_20020915

10-16

Combining Capacity and Limits

When used together, VCS determines the failover target as follows:• Limits and Prerequisites are used to determine a subset of

potential failover targets.

• Of this subset, the system with the highest value for AvailableCapacity is selected.

• If multiple systems have the same AvailableCapacity, the first system in SystemList is selected.

• Limits are hard values—if a system does not meet the Prerequisites, the service group cannot be started on that system.

• Capacity is a soft limit —the system with the lowest AvailableCapacity is selected, even if AvailableCapacity results in a negative number.

VCS_3.5_Solaris_R3.5_20020915

10-17

Failover Zones

Web

Preferred failoverzone for Web service group

Preferred failover zone for database service group

sysc sysd

syse sysf

The SystemList for both service groups includes all systems in the cluster.

Database

sysa sysb

VCS_3.5_Solaris_R3.5_20020915

10-18

SystemZones Attribute• Used to define the preferred failover zones for each service

group.• If the service group is online in a system zone, it fails to other

systems in the same zone based on the FailOverPolicy, until there are no systems available in that zone.

• When there are no other systems for failover in the same zone, VCS chooses a system in a new zone from the SystemList based on the FailOverPolicy.

• To define SystemZones:– Syntax:hagrp –modify group_name SystemZones \ sys1 zone# sys2 zone# sys zone# …

– Example:hagrp –modify OracleSG SystemZones sysa \ 0 sysb 0 sysc 1 sysd 1 syse 1 sysf 1

VCS_3.5_Solaris_R3.5_20020915

10-19

Controlling Failover Behavior with Resource Type Attributes• RestartLimit

– Affects how the agent responds to a resource fault

– Default: 0

• ConfInterval– Determines the amount of time that a tolerance or restart

counter can be incremented

– Default: 600 seconds

• ToleranceLimit– Enables the monitor entry point to return OFFLINE several

times before the resource is declared FAULTED

– Default: 0

VCS_3.5_Solaris_R3.5_20020915

10-20

Restart Example

• RestartLimit=1

Resource to be restarted one time withinthe ConfInterval timeframe

• ConfInterval=180

Resource can be restarted once within a three minute interval.

• MonitorInterval=60 seconds (default value)

Resource is monitored every 60 seconds.

MonitorIntervalRestart Faulted

Online OfflineOnlineOfflineOnlineConfInterval

VCS_3.5_Solaris_R3.5_20020915

10-21

Adjusting Monitoring

• MonitorInterval:– Default value is 60 seconds for most resource

types.– Consider reducing to 10 or 20 seconds for testing.– Use caution when changing this value:

• Load is increased on cluster systems.• Resources can fault if they cannot respond in the

interval specified.

• OfflineMonitorInterval:– Default is 300 seconds for most resource types.– Consider reducing to 60 seconds for testing.

VCS_3.5_Solaris_R3.5_20020915

10-22

Modifying Resource Type Attributes

• Can be used to optimize agents

• Applied to all resources of the specified type

• Command line example:hatype –modify FileOnOff MonitorInterval 5

VCS_3.5_Solaris_R3.5_20020915

10-23

Controlling Clean behaviour on resource faults

The ManageFaults attribute specifies whether VCS calls the Clean entry point when a resource faults. ManageFaults is a service group attribute;

• If the ManageFaults attribute is set to ALL, VCS calls the Clean

entry point when a resource faults.

• If the ManageFaults attribute is set to NONE, VCS takes no action on a resource fault; it “hangs the service group until administrative action can be taken. VCS marks the resource state as ADMIN_WAIT and does not fail over the service group until the resource fault is removed and the ADMIN_WAIT state is cleared.

VCS_3.5_Solaris_R3.5_20020915

10-24

Clearing resources in ADMIN_WAIT state

To clear a resource• 1 Take the necessary actions outside VCS to bring all resources

into the required state.

• 2 Verify that resources are in the required state by issuing the command:

# hagrp -clearadminwait group -sys system

This command clears the ADMIN_WAIT state for all resources. If VCS continues to detect resources that are not in the required state, it resets the resources to the ADMIN_WAIT state.

• 3 If resources continue in the ADMIN_WAIT state, repeat step 1 and step 2, or issue the following command to stop VCS from setting the resource to the ADMIN_WAIT state:

• # hagrp -clearadminwait -fault group -sys system

VCS_3.5_Solaris_R3.5_20020915

10-25

Controlling fault propagation

• The FaultPropagation attribute defines whether a resource fault is propagated up the resource dependency tree. It also defines whether a resource fault causes a service group failover.

• If the FaultPropagation attribute is set to 1 (default), a resource fault is propagated up the dependency tree. If a resource in the path is critical, the service group is taken offline and failed over, provided the AutoFailOver attribute is set to 1.

• If the FaultPropagation is set to 0, resource faults are contained at the resource level. VCS does not take the dependency tree offline, thus preventing failover. If the resources in the service group remain online, the service group remains in the PARTIAL|FAULTED state. If all resources are offline or faulted, the service group remains in the OFFLINE| FAULTED state.

VCS_3.5_Solaris_R3.5_20020915

10-26

Preventing Failover

• Frozen service group does not fail over when a critical resource faults.

• Service group must be unfrozen to enable fail over.

• To freeze a service group:hagrp -freeze service_group [-persistent]

• To unfreeze a service group:hagrp -unfreeze service_group [-persistent]

• A persistent freeze:

– Requires the cluster configuration to be open

– Remains in effect even if VCS stopped and restarted throughout the cluster

VCS_3.5_Solaris_R3.5_20020915

10-27

Clearing Faults

• Verify that the faulted resource is offline.

• Fix the problem that caused the fault and clean up any residual effects.

• To clear a fault, type:hares -clear resource_name [-sys system_name]

• To clear all faults in a service group, type:hagrp -clear group_name [-sys system_name]

• Persistent resources are cleared by probing:hares -probe resource_name -sys system_name

VCS_3.5_Solaris_R3.5_20020915

10-28

Probing Resources

• Causes VCS to immediately monitor the resource

• To probe a resource, type:

hares –probe resource_name –sys system_name

• You can clear a persistent resource by probing it after the underlying problem has been fixed.

VCS_3.5_Solaris_R3.5_20020915

10-29

Flushing Service Groups

• All online/offline agent processes are stopped.

• All resources in transitional states waiting to go online are taken offline.

• Propagation of the offline operation is stopped, but resources waiting to go offline remain in the transitional state.

• You must verify the physical or software resources are stopped at the operating system level after flushing to avoid creating a concurrency violation.

• To flush a service group, type:hagrp –flush group_name –sys system_name

VCS_3.5_Solaris_R3.5_20020915

10-30

Testing Failover

• Use test resources, such as FileOnOff, when applicable.

• Set lower values for MonitorInterval, OfflineMonitorInterval, and ConfInterval to detect faults more quickly.

• Manually online, offline, and switch the service group among all systems.

• Simulate failure of each resource in the service group.

• Simulate failover of the entire system.

VCS_3.5_Solaris_R3.5_20020915

10-31

Testing Examples

• Force a resource to fault.

• Reboot a system.

• Halt and reboot a system.

• Remove power from a system.