Oracle Clusterware for Sysadmins - DOAG Deutsche ORACLE ... · OCR / Voting disk placement and protection • Oracle Clusterware files include voting disks, ... 259301.1 CRS and 10g

Copyright © 2008, Oracle. All rights reserved.1

Oracle Clusterware for Sysadmins


Oracle Clusterware Architecture

• Primary components, functions, & Layers

• Process architecture

• Process interaction (CSS, CRS, EVM, ONS)

• OPROCD integration

• Voting mechanism

• Heartbeating mechanism

• OCR (Oracle Cluster Repository)


Oracle Clusterware 10g


Oracle Clusterware 11g


Primary components, functions, & Layers

• Portable cluster infrastructure that provides HA to RAC

databases and/or other applications:

– Monitors applications’ health

– Restarts applications on failure

– Can fail over applications on node failure

Oracle Clusterware

system files

ORACLE_HOME

CRS HOME

ORACLE_HOME

CRS HOME CRS HOME

Listener

RAC DB Inst

Protected App A

Listener

RAC DB Inst

Protected App B

Node 1 Node 2 Node 3


Oracle Clusterware (OCW)

Node1 public network

EVMD

CRSD

OPROCD

ONS

VIP1

CSSDNode 2

EVMD

CRSD

OPROCD

ONS

VIP2

CSSDNode n

EVMD

CRSD

OPROCD

ONS

VIPn

CSSD

/…/

shared storage

OCR and Voting DisksRAW DevicesCSSD

Runs in Real

Time Priority


Process architecture

init

crsdocssdevmd

actionaction

action

racgevtfcallout

evmlogger

Oprocd

calloutcallout

OCRVoting

disks racgimonracgimonracgimon

racgwrap

+

racgmain

Linux starting

with 10204/11.1.0.6


Oracle Clusterware daemons

• OCW is formed of several daemons, each one of those

have an special function inside the stack. The daemons

are located inside the directory $CRS_HOME/bin. Here

is a list of the

daemons, for 10.2.0.3 and later, note that depending on

the platform and whether or not we have a 3rd-party

vendor clusterware some process may not exist:

• - ocssd.bin

- crsd.bin

- evmd.bin

- oclsvmon.bin

- oclsomon.bin

- oprocd


Oracle Clusterware daemons

• When the daemons are running, we can say that Oracle

Clusterware is fully started. Those are executed via the

init.* scripts (init.cssd, init.crsd and init.evmd).

• Note that we do not have as many scripts init.* as

daemons, this is because init.cssd starts more than

one daemon

– init.cssd starts ocssd.bin, olcsomon, oclsvmon and opro

cd (CSS family).

– init.crsd starts crsd.bin

– init.evmd starts evmd.bin


Oracle Clusterware “control” files

• Control files (also know as SCLS_SRC files).

• These files are used to control some aspects of OCW

like:

– enable/disable processes from the CSSD family (Eg.

oprocd, oslsvmon)

– stop the daemons (ocssd.bin, crsd.bin, etc).

– prevent OCW from being started when the machine boots


Oracle Clusterware daemon functionality

OCSSD

• OCSSD is part of RAC and Single Instance with ASM

• Provides access to node membership

• Provides group services

• Provides basic cluster locking

• Integrates with existing vendor clusterware, when

present

• Can also runs without integration to vendor

clusterware

• Runs as Oracle.

• Failure exit causes machine reboot.

– This is a feature to prevent data corruption in event

of a split brain.



CRSD

• Engine for HA operation

• Manages 'application resources'

• Starts, stops, and fails 'application resources' over

• Spawns separate 'actions' to start/stop/check

application resources

• CRSD maintains configuration profiles as well as

resource statuses in OCR (Oracle Cluster Registry).

• Stores current known state in the OCR.

• Runs as root

• Is restarted automatically on failure



CRSD

• CRSD spawns dedicated processes called RACGIMON

that monitor the health of the database and ASM

instances and host various feature threads such as

Fast Application Notification (FAN).

• One RACGIMON process is spawned for each instance.

• CRSD can spawn temporary children to execute

particular actions such as:

– racgeut (Execute Under Timer), to kill actions that do not

complete after a certain amount of time

– racgmdb (Manage Database), to start/stop/check instances

– racgchsn (Change Service Name), to add/delete/check service

names for instances

– racgons, to add/remove ONS configuration to OCR

– racgvip, to start/stop/check instance virtual IP



EVMD

• Generates events when things happen

• Spawns a permanent child evmlogger

• Evmlogger, on demand, spawns children

• Scans callout directory and invokes callouts.

• Runs as Oracle.

• Restarted automatically on failure


OPROCD / Oracles fencing driver

• OPROCD implementation is Oracle's Cluster I/O

Fencing solution and is only started on Unix platforms

when vendor Clusterware is not running .

• OPROCD does not run on Windows

(OraFenceService)

• Linux starting with 10.2.0.4

• OPROCD executable is intended to detect potential

node hangs.

– When it detects a potential node hang, it will cause a

node reboot, to ensure that, if it has been evicted by

other cluster nodes, none of the processes can issue an

I/O after the hang clears.


OPROCD / Oracles fencing driver

• The OPROCD executable sets a signal handler for the

SIGALRM handler and sets the interval timer based on

the to-millisec parameter provided.

• The alarm handler gets the current time and checks it

against the time that the alarm handler was last

entered. If the difference exceeds (to-millisec + margin-

millisec), it will fail; the production version will cause a

node reboot.

• The OPROCD takes 2 parameters:

– timeout value (-t <to-millisec>) this is the length of time

between executions

– margin (-m <margin-millisec>) this is the acceptable

leeway for dispatches


Manually Control Oracle Clusterware Stack

Might be needed for planned outages:

# crsctl stop crs

# crsctl start crs

# crsctl disable crs

# crsctl enable crs


MISSCOUNT: Important CSS Parameter

• Determines CSS heartbeat timeouts before node

eviction

• Has a default value of 30 seconds that is appropriate in

most cases

• Can be temporarily changed:

1. Shut down Oracle Clusterware on all nodes but one.

2. As root on available node, use: crsctl set css misscount M+1

3. Reboot available node.

4. Restart all over nodes.

• Default should never be changed when using non-

Oracle clusterware


Multiplexing Voting Disks

• Voting disk is a vital resource for your cluster

availability.

• Use one voting disk if it is stored on a reliable disk.

• Otherwise, use mirrored voting disks:

– There is no need to rely on multipathing solutions.

– Mirrors should be stored on independent devices.

– Make sure that there is no I/O starvation for your voting

disks devices.

– Use at least three mirrors.

• CSS uses a simple majority rule to decide whether

voting disk reads are consistent:v = f*2+1


Change Voting Disk Configuration

• Voting disk configuration can be changed dynamically.

• To add a new voting disk:

• To remove a voting disk:

• If Oracle Clusterware is down on all nodes, use the –force option:

# crsctl add css votedisk <new voting disk path>

# crsctl delete css votedisk <old voting disk path>

# crsctl delete css votedisk <old voting disk path> -force

# crsctl add css votedisk <new voting disk path> -force


Back Up and Recover Your Voting Disks

• Recommendation is to use symbolic links.

• Back up one voting disk by using the dd command.

– After Oracle Clusterware installation

– After node addition or deletion

– Can be done online

• Recover voting disks by restoring the first one using the dd command, and then mirror it if necessary.

• If no voting disk backup is available, reinstall Oracle

Clusterware.

$ crsctl query css votedisk

$ dd if=<voting disk path> of=<backup path> bs=4k


Heartbeat Mechanisms

• Two heartbeat mechanisms for cluster membership

– Network HeartBeat (NHB)

– Disk HeartBeat (DHB)

• Heartbeat mechanisms used for different purposes, they

are not redundant mechanisms

– NHB for detection of loss of cluster viability

– DHB for network split resolution


Network HeartBeat (NHB)

• Indicates that node can participate in cluster activities,

e.g. group membership changes

• When NHB is missing for too long, a cluster membership

change (cluster reconfig) is required

• Definition of 'too long' constant over time (misscount)

• Loss of connectivity to the network not necessarily fatal


Disk HeartBeat (DHB)

• Final word on whether a node is alive, when the DHB

is missing for too long, node is assumed to be dead

• When connectivity to disk is lost for 'too long', the

disk is considered offline

• The definition of 'too long' varies

– For most of the time 'too long' is 'long disk I/O time'

(LIOT), default 200 seconds

– During cluster node membership change (reconfig)

the time is 'short disk I/O time', which is related to

misscount (misscount – reboottime) (reboottime

default is 3 seconds)

• Connectivity to a majority of voting files must be

maintained for a node to stay active


CSS Logging

• Default logging level differ

– Production default is 1

– Test default is 2

• Changing logging level in production

– Execute as root on a node with clusterware stack

up:

'crsctl debug log css CSSD:N' (N is logging level)

– Execute on all nodes, or

– Restart the stack on all other nodes after executing


Diagnosability

• Stack dump now in CSSD log

• Signals now trapped to allow printing of diagnostic

data for SEGVs, etc.

• Other diagnostic data printed prior to termination

– Detailed logging

– Most of the memory

• Data may be lost due to reboot before log buffers

flushed to disk

– Set diagwait to allow data to be flushed to disk

– crsctl set css diagwait 13

(run as root on node with CRS stack up, then restart

stack on all nodes)


OCR Architecture

Node1

OCR cache

CRS

process

Client

process

Node2

OCR cache

CRS

process

Node3

OCR cache

CRS

process

Client

process

OCR

primary

file

Shared

storageOCR

mirror

file


Automatic OCR Backups

• The OCR content is critical to Oracle Clusterware.

• OCR is automatically backed up physically:

– Every four hours: CRS keeps the last three copies.

– At the end of every day: CRS keeps the last two copies.

– At the end of every week: CRS keeps the last two copies.

• Change the default automatic backup location:

$ cd $ORACLE_BASE/Crs/cdata/jfv_clus

$ ls -lt

-rw-r--r-- 1 root root 4784128 Jan 9 02:54 backup00.ocr

-rw-r--r-- 1 root root 4784128 Jan 9 02:54 day_.ocr



-rw-r--r-- 1 root root 4784128 Jan 8 02:54 day.ocr

-rw-r--r-- 1 root root 4784128 Jan 6 02:54 week_.ocr

-rw-r--r-- 1 root root 4005888 Dec 30 14:54 week.ocr

# ocrconfig –backuploc /shared/bak


Back Up OCR Manually

• Daily backups of your automatic OCR backups to a

different storage device:

– Use your favorite backup tool.

• Logical backups of your OCR before and after making significant changes:

• Make sure that you restore OCR backups that match

your current system configuration.

# ocrconfig –export file name


OCR Considerations

• If using raw devices to store OCR files, make sure they

exist before add or replace operations.

• You must be the root user to be able to add, replace,

or remove an OCR file while using ocrconfig.

• While adding or replacing an OCR file, its mirror needs

to be online.

• If you remove a primary OCR file, the mirror OCR file

becomes primary.

• Never remove the last remaining OCR file.


OCR / Voting disk placement and protection

• Oracle Clusterware files include voting disks, used to

monitor cluster node status, and Oracle Cluster

Registry (OCR) which contains configuration

information about the cluster. The voting disks and

OCR are shared files on a cluster or network file

system environment. If you do not use a cluster file

system, then you must place these files on shared

block devices or shared raw devices. Oracle Universal

Installer (OUI) automatically initializes the OCR during

the Oracle Clusterware installation.



• For voting disk file placement, Oracle recommends that

each voting disk is configured so that it does not share

any hardware device or disk, or other single point of

failure. Any node that does not have available to it an

absolute majority of voting disks configured (more than

half) will be restarted.



• Critical cluster configuration repository, and split brain

resolution mechanism

• Oracle mirroring available in 10gR2 onwards

– crsctl add css votedisk path

– ocrconfig -replace ocrmirror destination_file or disk

• Recommend 3 mirrors for voting disk

– split brain resolution requires majority of disks to

allow sub-cluster to continue


Useful notes on metalink

Note: 259301.1 CRS and 10g Real Application Clusters

Note: 276434.1 Modifying the VIP of a Cluster Node

Note: 272332.1 Extended "CRS/CSS 10g Diagnostic Collection Guide"

Note: 268937.1 Repairing or Restoring an Inconsistent OCR in RAC

Note: 279793.1 How to Restore a Lost Voting Disk in 10g

Note: 240001.1 Troubleshooting CRS Root.sh Problems

Note: 265769.1 Troubleshooting CRS Reboots

Note: 289690.1 Data Gathering for Troubleshooting RAC and CRS issues

Note: 301137.1 OS Watcher User Guide OS Watcher is available at :

http://coe.oraclecorp.com/pls/prod/osw/

Note: 301138.1 RAC-DDT User Guide RAC Diag tool: http://coe.oraclecorp.com/pls/prod/racddt

Note: 357808.1 Diagnosability for CRS / EVM / RACG

Note: 338706.1 Cluster Ready Services (CRS) rolling upgrade

Note: 391116.1 10.2.0.3 Patch Set - List of Bug Fixes by Problem Type

Note: 401435.1 10.2.0.3 Patch Set - Known Issues

Note: 390880.1 OCR Corruption after Adding/Removing voting disk to a cluster when CRS stack is running

Note: 459694.1 Procwatcher: Script to Monitor and Examine Oracle and CRS Processes

Note: 239998.1 10g RAC How to Clean Up After a Failed CRS Install

Note: 269320.1 Removing a Node from a 10g RAC Cluster

Note: 272332.1 CRS 10g Diagnostic Collection Guide


Q U E S T I O N S

A N S W E R S

Documents

Oracle Clusterware for Sysadmins - DOAG Deutsche ORACLE ... · OCR / Voting disk placement and protection • Oracle Clusterware files include voting disks, ... 259301.1 CRS and 10g