Oracle Clusterware

8/9/2019 Oracle Clusterware

1/12

Oracle Real Application Cluster (Oracle RAC) Session 1: Oracle 10g/11gR2 RAC Architecture

Ahmed Fathi - Senior Oracle Consultant P a g e | 1

Email:[email protected]

Blog:http://ahfathi.blogspot.com

LinkedIn:http://linkedin.com/in/ahmedfathieg

Oracle Clusterware

Oracle Clusterware is software that enables servers to operate together as if they are one server. Each

servers look like a standalone server. However, each server has additional processes that communicate

with each other. So here the separate server appears as if they are one server to the application and end

users.

Starting with the version 10g Release 1 Oracle introduced an own portable cluster software Cluster Ready

Services. This product has been renamed in the version 10g Release 2 to Oracle Clusterware, from 11g

Release 2 is part of the Oracle Grid Infrastructure software.


2/12



Email: [email protected] Blog: http://ahfathi.blogspot.com LinkedIn:http://linkedin.com/in/ahmedfathieg

The benefits of using a cluster include:

Scalability: multiple nodes allow cluster database to scale by single node database.

Availability: if any nodes failure other nodes in cluster, clients can continue working without any effects.

Manageability: more than one database can be handled by oracle Cluster ware.

Ability to monitor processes and restart them if they stop

Eliminate unplanned downtime due to hardware or software malfunctions

Reduce or eliminate planned downtime for software maintenance

There are two kinds of cluster active/active and active/passive:

Active/passive

In this setup usually we have two nodes, one of the nodes are available (active) and the other one is not

(passive), Oracle software should be on Shared storage, and only run on one node (active) in case of failure

the cluster convert the shared storage to passive node in that case active node now is passive and the

passive node is now active.There is number of Third Party software Corporation support this kind of cluster such as Microsoft, Linux.

Usually its called OS cluster.

Active/Active

In this kind of setup Oracle instance run concurrently on both server and client access to both server at the

same time, the instance should communicate with other node to ensure (heartbeat) both server are

available but in case any of server goes down the other server can handle the workload, the benefits of

active/active workload can be shared between servers.

Voting Disk and Cluster Registry

Voting Disk:A voting disk is a shared disks that will be accessed by all the member of the nodes in the

cluster. It is stores the cluster membership information, and keeps the heartbeat information between the

nodes. If any of the node is unable to ping the voting disk, cluster immediately recognize the

communication failure and evicts the node from the cluster.

Used to determine which instance takes control of cluster in case of node failure to avoid split brain.

Oracle Cluster Registry (OCR):stores and manages configuration information about the cluster resourcesmanaged by Oracle clusterware such as Oracle RAC databases, database instance, listeners, VIPs, and

servers and applications.

Oracle Local Registry (11gR2):Similar to OCR, introduces in 11gR2 but it only stores information about the

local node. It is not shared by other nodes of cluster and used by OHASd while starting or joining a cluster.


3/12




RAC Components

Shared Disk System

Oracle Clusterware Stack

Cluster Interconnects Oracle Kernel Components

Shared Disk System

Below are the three major type of shared storage which are using in RAC:

Raw volumes: A raw logical volume is an area of physical and logical disk space that is under the direct

control of an application such as database or partition rather than under the direct control of the operating

system or a file system.

Cluster File system: This option is not widely used and here the cluster file system such as Oracle Cluster

file system (OCFS) for MS Windows and Linux holding the all datafiles of RAC database

Automatic Storage Management (ASM): Oracle recommended storage option which is optimized for

cluster file system for Oracle database files introduced in Oracle 10g

Oracle Clusterware Stack 10g/11gR1

The Oracle Clusterware comprises several background processes that facilitate cluster operations. The

Cluster Synchronization Service (CSS), Event Management (EVM), and Oracle Cluster components

communicate with other cluster component layers in the other instances within the same cluster database

environment. These components are also the main communication links between the Oracle Clusterware

high availability components and the Oracle Database. In addition, these components monitor and manage

database operations.

Cluster Synchronization Services (CSS): Manages the cluster configuration by controlling which nodes

are members of the cluster and by notifying members when a node joins or leaves the cluster.

Cluster Ready Services (CRS): The primary program for managing high availability operations within a

cluster. Anything that the crs process manages is known as a cluster resource which could be a

database, an instance, a service, a Listener, a virtual IP (VIP) address, an application process, and so on.

The crs process manages cluster resources based on the resource's configuration information that is

stored in the OCR. This includes start, stop, monitor and failover operations. The crs process generates

events when a resource status changes.When you have installed Oracle RAC, crs monitors the Oracle

instance, Listener, and so on, and automatically restarts these components when a failure occurs. By


4/12




default, the crs process makes five attempts to restart a resource and then does not make further

restart attempts if the resource does not restart.

Event Management (EVM):A background process that publishes events that crs creates.

Oracle Notification Service (ONS): Allows clusterware events to be (propagate) send to nodes in

cluster, middle tier application servers, clients. EVMD publishes events through ONS.

RACG: Extends clusterware to support Oracle-specific requirements and complex resources. Runs server

callout scripts when FAN events occur.

Process Monitor Daemon (OPROCD):This process is locked in memory to monitor the cluster and

provide I/O fencing. OPROCD performs its check, stops running, and if the wake up is beyond the

expected time, then OPROCD resets the processor and reboots the node. An OPROCD failure results in

Oracle Clusterware restarting the node. OPROCD uses the hangcheck timer on Linux platforms.

Oracle Clusterware Stack 11gR2

Oracle Clusterware consists of two separate stacks: an upper stack anchored by the Cluster Ready Services

(CRS) daemon (crsd) and a lower stack anchored by the Oracle High Availability Services daemon (ohasd).

These two stacks have several processes that facilitate cluster operations. The following sections describe

these stacks in more detail:

- The Cluster Ready Services Stack

The list in this section describes the processes that comprise CRS. The list includes components that are processes on

Linux and UNIX operating systems, or services on Windows.

Cluster Ready Services (CRS):The primary program for managing high availability operations in a

cluster. The CRS daemon (crsd) manages cluster resources based on the configuration information that

is stored in OCR for each resource. This includes start, stop, monitor, and failover operations. The crsd

process generates events when the status of a resource changes. When you have Oracle RAC installed,

the crsdprocess monitors the Oracle database instance, listener, and so on, and automatically restarts

these components when a failure occurs.

Cluster Synchronization Services (CSS):Manages the cluster configuration by controlling which nodes

are members of the cluster and by notifying members when a node joins or leaves the cluster

The cssdagent processmonitors the cluster and provides I/O fencing. This service formerly wasprovided by Oracle Process Monitor Daemon (oprocd), also known as OraFenceService on Windows. A

cssdagent failure may result in Oracle Clusterware restarting the node.

Oracle ASM:Provides disk management for Oracle Clusterware and Oracle Database.

Cluster Time Synchronization Service (CTSS): Provides time management in a cluster for Oracle

Clusterware.


5/12




Event Management (EVM):A background process that publishes events that Oracle Clusterware

creates.

Oracle Notification Service (ONS):A publish and subscribe service for communicating Fast Application

Notification (FAN) events.

Oracle Agent (oraagent):Extends clusterware to support Oracle-specific requirements and complex

resources. This process runs server callout scripts when FAN events occur. This process was known as

RACG in Oracle Clusterware 11g release 1 (11.1).

Oracle Root Agent (orarootagent):A specialized oraagent process that helps crsd manage resources

owned by root, such as the network, and the Grid virtual IP address.

- The Oracle High Availability Services Stack

This section describes the processes that comprise the Oracle High Availability Services stack. The list

includes components that are processes on Linux and UNIX operating systems, or services on Windows.

Cluster Logger Service (ologgerd):Receives information from all the nodes in the cluster and persists in

a CHM repository-based database. This service runs on only two nodes in a cluster.

System Monitor Service (osysmond):The monitoring and operating system metric collection service

that sends the data to the cluster logger service. This service runs on every node in a cluster.

Grid Plug and Play (GPNPD):Provides access to the Grid Plug and Play profile, and coordinates updates

to the profile among the nodes of the cluster to ensure that all of the nodes have the most recent

profile.

Grid Interprocess Communication (GIPC):A support daemon that enables Redundant Interconnect

Usage.

Multicast Domain Name Service (mDNS):Used by Grid Plug and Play to locate profiles in the cluster,

as well as by GNS to perform name resolution. The mDNS process is a background process on Linux and

UNIX and on Windows.

Oracle Grid Naming Service (GNS):Handles requests sent by external DNS servers, performing name

resolution for names defined by the cluster.

Cluster Interconnects

It is the communication path used by the cluster for the synchronization of resources and it is also used in

some cases for transfer of data from one instance to another. Typically, the interconnect is a network

connections that is dedicated to the server nodes of a cluster (thus is sometimes referred as private

interconnect)


6/12




Oracle Kernel Components

Set of additional background process in each instance is known as oracle kernel components in

RAC environment. Since buffer and shared pool became global in RAC , special handling is required to

manage the resources to avoid conflicts and corruption. Additional background process (for RAC) and

single instance background process works together and achieved this.

Global Cache and Global Enqueue Services

RAC Database System has two important services. They are Global Cache Service (GCS) and Global Enqueue

Service (GES). These are basically collections of background processes and memory structures. These two

services GCS and GES together manage the total Cache Fusion process, resource transfers, and resource

acquisition among the instances.

In Oracle RAC each instance will have its own cache but it is required for an instance to access the data

blocks currently residing in another instance cache. This management and data sharing is done by Global

Cache services (GCS). Blocks other than data such as locks, enqueue details and shared across the instancesare known as Global Enqueue Services (GES).

The Global Cache Service employs various background processes such as the Global Cache Service Processes (LMSn)

and Global Enqueue Service Daemon (LMD)

Global Resource Directory

Global Resource Directory (GRD) is the internal in-memory database and is stored on all of the running

instances that records and stores the current status of resources and the enqueues (data blocks). GRD is

maintained by GES and GCS. Whenever a block is transferred out of a local cache to another instances

cache the GRD is updated. The following resources information is available in GRD.

- Data Block information such as file # and block #

- Location of most current version

- Modes of the data blocks: (N)Null, (S)Shared, (X)Exclusive

Oracle RAC Background Processes

LMS Global Cache Service Process

LMD Global Enqueue service Daemon

LMON Global Enqueue Service Monitor LCK0 Instance Enqueue Process

DIAG Diagnosability Daemon


7/12




Global Cache Service Processes (LMSn)

LMS- Lock Manager Server Process is used in Cache Fusion. It enables consistent copies of blocks to be

transferred from a holding instance's buffer cache to a requesting instance buffer cache without a disk

write under certain conditions.

It rollbacks any uncommitted transactions for any blocks that are being requested for a consistent read by

the remote instance.

Global Enqueue Service Daemon (LMD)

LMD-Lock Manager Daemon process manages Enqueue service requests for GCS. It also handles deadlock

detection and remote resource requests.

Global Enqueue Service Monitor (LMON)

LMON-Lock Monitor Process is responsible to manage Global Enqueue Services (GES).

It maintains consistency of GCS memory in case of any process death. LMON is also responsible for the

cluster reconfiguration when an instance joins or leaves the cluster. It also checks for the instance death

and listens for local manages.

Instance Enqueue Process (LCK)

The LCK0 process manages non-Cache Fusion resource requests such as library and row cache requests.

Diagnosability Daemon (DIAG)

This background process monitors the health of the instance and captures diagnostic data about process

failures within instances. The operation of this daemon is automated and updates an alert log file to recordthe activity that it performs.

Clusterware and heartbeat mechanism

Cluster needs to know who is a member at all times. Oracle cluster has Two (02) types of heartbeats:

1.

Network heartbeat

- Performed once per second.

- Node will evict from cluster when failed to send a network heartbeat within time frame.

2. Disk (Voting Disk) heartbeat

- Each node of a cluster writes a disk heartbeat to voting disk every second

- Node evicts from cluster if no heartbeat is updated within I/O (MissCount/Disktimeout) timeout.


8/12




What is miscount in oracle RAC?

The cluster synchronization service (CSS) on RAC has miscount parameter. This value represent the

maximum time, in seconds, that a network heartbeat can be missed before entering into a cluster

reconfiguration, in order to evict a node. The default value is 60 seconds in linux 10g and 11g it is 30

seconds

I/O Fencing

There will be some situations where the left over write operations from database instances reach the

storage system. The cluster function on this node failed, but the nodes are still running at the OS level.

Since these operations are no longer in the serial order, they can damage the consistency of the stored

data. Therefore, when a cluster node fails, the failed node needs to be fenced off from all the shared disk

devices or disk groups. This methodology is called I/O fencing, disk fencing or failure fencing.

Functions of I/O fencing

Prevents the updates by failed instances and to detect failure and prevent split-

brain in the cluster.

Cluster volume manager and cluster file system play a significant role in preventing

the failed nodes from accessing shared devices. Oracle uses algorithm common to

STONITH (shoot the other node in the head) implementations to determine what

nodes needs to fenced. This simply means the healthy nodes kill the sick

node. Oracle's Clusterware does not do this; instead, it simply gives the message

"Please Reboot" to the sick node. The node bounces itself and rejoins the cluster.

There are other methods of fencing that are utilized by different hardware/software vendors. When using

Veritas Storage Foundation for RAC (VxSF RAC), you can implement I/O fencing instead of node

fencing. This means that instead of asking a server to reboot, you simply close it off from shared storage.

In versions before 11.2.0.2 Oracle Clusterware tried to prevent a split-brain with a fast reboot (better:

reset) of the server(s) without waiting for ongoing I/O operations or synchronization of the file systems.


9/12




This mechanism has been changed in version 11.2.0.2 (first 11g Release 2 patch set). After deciding which

node to evict, the Clusterware:

- attempts to shut down all Oracle resources/processes on the server (especially processes generating

I/Os)

- will stop itself on the node

- Afterwards Oracle High Availability Service Daemon (OHASD)5 will try to start the Cluster Ready

Services (CRS) stack again. Once the cluster interconnect is back online, all relevant cluster resources

on that node will automatically start

- Kill the node if stop of resources or processes generating I/O is not possible (hanging in kernel

mode, I/O path, etc.)

Generally Oracle Clusterware uses two rules to choose which nodes should leave the cluster to assure the

cluster integrity:

- In configurations with two nodes, node with the lowest ID will survive (first node that joined the

cluster), the other one will be asked to leave the cluster

- With more cluster nodes, the Clusterware will try to keep the largest sub-cluster Running

When node does reboots?

- Network failure interconnect

- Slow interconnect (latency) must fail 30 consecutive times!

- Voting disk IO cannot read or write

- CPU-bound CPU is too busy to maintain heartbeat

- Files moved, delected, changed or some other human error

- Configuration error wrong network for private interconnect

- ocssd process died

- Some Oracle Clusterware bug


10/12




Split-Brain scenario

The term "Split-Brain" is often used to describe the scenario when two or more co-operating processes in a

distributed system, typically a high availability cluster, lose connectivity with one another but then continue

to operate independently of each other, including acquiring logical or physical resources, under the

incorrect assumption that the other process(es) are no longer operational or using the said resources.

Fast Application Notification (FAN)

Notifying clients about the RAC availability and instance (actually service) performance is the purpose of the

FAN (Fast Application Notification) events. The client is not actively checking the availability or load of an

instance and is no more glued to an instance once connected. The nodes directly inform the application

server about which instance is able to provide a defined Quality of Service.

FAN is a method introduced in Oracle 10.1, by which applications can be informed of changes in cluster

status for Fast node failure detection and Workload balancing.


11/12




Advantageous by preventing applications from Waiting for TCP/IP timeouts when a node fails, Trying to

connect to currently down database service and Processing data received from failed node.

And can be notified using Server side callouts, Fast Connection Failover (FCF), ONS API

Why Use Virtual IP?

The goal is application availability.

When a node fails, the VIP associated with it is automatically failed over to some other node. When this

occurs the following thing happens:

- VIP detects public network failure which generates FAN event

- The new node announces the world indicating a new MAC address for VIP.

- Connected clients through VIP, immediately receive ORA-3113 error or equivalent.

- New connection request rapidly traverse the tnsnames.ora address list skipping over the dead

nodes, instead of having to wait on TCP-IP timeouts.

Without using VIP, clients connected to a node that died will often wait for TCP-IP timeout period (which

can be up to 10 minutes) before getting an error. As a result you dont have really good High Availability

solution without using VIP.

Connecting with Public IP Scenario


12/12




Connecting with Virtual IP scenario

Documents

Oracle Clusterware