Oracle Clusterware

Embed Size (px)

Citation preview

  • 8/9/2019 Oracle Clusterware

    1/12

    Oracle Real Application Cluster (Oracle RAC) Session 1: Oracle 10g/11gR2 RAC Architecture

    Ahmed Fathi - Senior Oracle Consultant P a g e | 1

    Email:[email protected]

    Blog:http://ahfathi.blogspot.com

    LinkedIn:http://linkedin.com/in/ahmedfathieg

    Oracle Clusterware

    Oracle Clusterware is software that enables servers to operate together as if they are one server. Each

    servers look like a standalone server. However, each server has additional processes that communicate

    with each other. So here the separate server appears as if they are one server to the application and end

    users.

    Starting with the version 10g Release 1 Oracle introduced an own portable cluster software Cluster Ready

    Services. This product has been renamed in the version 10g Release 2 to Oracle Clusterware, from 11g

    Release 2 is part of the Oracle Grid Infrastructure software.

  • 8/9/2019 Oracle Clusterware

    2/12

    Oracle Real Application Cluster (Oracle RAC) Session 1: Oracle 10g/11gR2 RAC Architecture

    Ahmed Fathi - Senior Oracle Consultant P a g e | 2

    Email: [email protected] Blog: http://ahfathi.blogspot.com LinkedIn:http://linkedin.com/in/ahmedfathieg

    The benefits of using a cluster include:

    Scalability: multiple nodes allow cluster database to scale by single node database.

    Availability: if any nodes failure other nodes in cluster, clients can continue working without any effects.

    Manageability: more than one database can be handled by oracle Cluster ware.

    Ability to monitor processes and restart them if they stop

    Eliminate unplanned downtime due to hardware or software malfunctions

    Reduce or eliminate planned downtime for software maintenance

    There are two kinds of cluster active/active and active/passive:

    Active/passive

    In this setup usually we have two nodes, one of the nodes are available (active) and the other one is not

    (passive), Oracle software should be on Shared storage, and only run on one node (active) in case of failure

    the cluster convert the shared storage to passive node in that case active node now is passive and the

    passive node is now active.There is number of Third Party software Corporation support this kind of cluster such as Microsoft, Linux.

    Usually its called OS cluster.

    Active/Active

    In this kind of setup Oracle instance run concurrently on both server and client access to both server at the

    same time, the instance should communicate with other node to ensure (heartbeat) both server are

    available but in case any of server goes down the other server can handle the workload, the benefits of

    active/active workload can be shared between servers.

    Voting Disk and Cluster Registry

    Voting Disk:A voting disk is a shared disks that will be accessed by all the member of the nodes in the

    cluster. It is stores the cluster membership information, and keeps the heartbeat information between the

    nodes. If any of the node is unable to ping the voting disk, cluster immediately recognize the

    communication failure and evicts the node from the cluster.

    Used to determine which instance takes control of cluster in case of node failure to avoid split brain.

    Oracle Cluster Registry (OCR):stores and manages configuration information about the cluster resourcesmanaged by Oracle clusterware such as Oracle RAC databases, database instance, listeners, VIPs, and

    servers and applications.

    Oracle Local Registry (11gR2):Similar to OCR, introduces in 11gR2 but it only stores information about the

    local node. It is not shared by other nodes of cluster and used by OHASd while starting or joining a cluster.

  • 8/9/2019 Oracle Clusterware

    3/12

    Oracle Real Application Cluster (Oracle RAC) Session 1: Oracle 10g/11gR2 RAC Architecture

    Ahmed Fathi - Senior Oracle Consultant P a g e | 3

    Email: [email protected] Blog: http://ahfathi.blogspot.com LinkedIn:http://linkedin.com/in/ahmedfathieg

    RAC Components

    Shared Disk System

    Oracle Clusterware Stack

    Cluster Interconnects Oracle Kernel Components

    Shared Disk System

    Below are the three major type of shared storage which are using in RAC:

    Raw volumes: A raw logical volume is an area of physical and logical disk space that is under the direct

    control of an application such as database or partition rather than under the direct control of the operating

    system or a file system.

    Cluster File system: This option is not widely used and here the cluster file system such as Oracle Cluster

    file system (OCFS) for MS Windows and Linux holding the all datafiles of RAC database

    Automatic Storage Management (ASM): Oracle recommended storage option which is optimized for

    cluster file system for Oracle database files introduced in Oracle 10g

    Oracle Clusterware Stack 10g/11gR1

    The Oracle Clusterware comprises several background processes that facilitate cluster operations. The

    Cluster Synchronization Service (CSS), Event Management (EVM), and Oracle Cluster components

    communicate with other cluster component layers in the other instances within the same cluster database

    environment. These components are also the main communication links between the Oracle Clusterware

    high availability components and the Oracle Database. In addition, these components monitor and manage

    database operations.

    Cluster Synchronization Services (CSS): Manages the cluster configuration by controlling which nodes

    are members of the cluster and by notifying members when a node joins or leaves the cluster.

    Cluster Ready Services (CRS): The primary program for managing high availability operations within a

    cluster. Anything that the crs process manages is known as a cluster resource which could be a

    database, an instance, a service, a Listener, a virtual IP (VIP) address, an application process, and so on.

    The crs process manages cluster resources based on the resource's configuration information that is

    stored in the OCR. This includes start, stop, monitor and failover operations. The crs process generates

    events when a resource status changes.When you have installed Oracle RAC, crs monitors the Oracle

    instance, Listener, and so on, and automatically restarts these components when a failure occurs. By

  • 8/9/2019 Oracle Clusterware

    4/12

    Oracle Real Application Cluster (Oracle RAC) Session 1: Oracle 10g/11gR2 RAC Architecture

    Ahmed Fathi - Senior Oracle Consultant P a g e | 4

    Email: [email protected] Blog: http://ahfathi.blogspot.com LinkedIn:http://linkedin.com/in/ahmedfathieg

    default, the crs process makes five attempts to restart a resource and then does not make further

    restart attempts if the resource does not restart.

    Event Management (EVM):A background process that publishes events that crs creates.

    Oracle Notification Service (ONS): Allows clusterware events to be (propagate) send to nodes in

    cluster, middle tier application servers, clients. EVMD publishes events through ONS.

    RACG: Extends clusterware to support Oracle-specific requirements and complex resources. Runs server

    callout scripts when FAN events occur.

    Process Monitor Daemon (OPROCD):This process is locked in memory to monitor the cluster and

    provide I/O fencing. OPROCD performs its check, stops running, and if the wake up is beyond the

    expected time, then OPROCD resets the processor and reboots the node. An OPROCD failure results in

    Oracle Clusterware restarting the node. OPROCD uses the hangcheck timer on Linux platforms.

    Oracle Clusterware Stack 11gR2

    Oracle Clusterware consists of two separate stacks: an upper stack anchored by the Cluster Ready Services

    (CRS) daemon (crsd) and a lower stack anchored by the Oracle High Availability Services daemon (ohasd).

    These two stacks have several processes that facilitate cluster operations. The following sections describe

    these stacks in more detail:

    - The Cluster Ready Services Stack

    The list in this section describes the processes that comprise CRS. The list includes components that are processes on

    Linux and UNIX operating systems, or services on Windows.

    Cluster Ready Services (CRS):The primary program for managing high availability operations in a

    cluster. The CRS daemon (crsd) manages cluster resources based on the configuration information that

    is stored in OCR for each resource. This includes start, stop, monitor, and failover operations. The crsd

    process generates events when the status of a resource changes. When you have Oracle RAC installed,

    the crsdprocess monitors the Oracle database instance, listener, and so on, and automatically restarts

    these components when a failure occurs.

    Cluster Synchronization Services (CSS):Manages the cluster configuration by controlling which nodes

    are members of the cluster and by notifying members when a node joins or leaves the cluster

    The cssdagent processmonitors the cluster and provides I/O fencing. This service formerly wasprovided by Oracle Process Monitor Daemon (oprocd), also known as OraFenceService on Windows. A

    cssdagent failure may result in Oracle Clusterware restarting the node.

    Oracle ASM:Provides disk management for Oracle Clusterware and Oracle Database.

    Cluster Time Synchronization Service (CTSS): Provides time management in a cluster for Oracle

    Clusterware.

  • 8/9/2019 Oracle Clusterware

    5/12

    Oracle Real Application Cluster (Oracle RAC) Session 1: Oracle 10g/11gR2 RAC Architecture

    Ahmed Fathi - Senior Oracle Consultant P a g e | 5

    Email: [email protected] Blog: http://ahfathi.blogspot.com LinkedIn:http://linkedin.com/in/ahmedfathieg

    Event Management (EVM):A background process that publishes events that Oracle Clusterware

    creates.

    Oracle Notification Service (ONS):A publish and subscribe service for communicating Fast Application

    Notification (FAN) events.

    Oracle Agent (oraagent):Extends clusterware to support Oracle-specific requirements and complex

    resources. This process runs server callout scripts when FAN events occur. This process was known as

    RACG in Oracle Clusterware 11g release 1 (11.1).

    Oracle Root Agent (orarootagent):A specialized oraagent process that helps crsd manage resources

    owned by root, such as the network, and the Grid virtual IP address.

    - The Oracle High Availability Services Stack

    This section describes the processes that comprise the Oracle High Availability Services stack. The list

    includes components that are processes on Linux and UNIX operating systems, or services on Windows.

    Cluster Logger Service (ologgerd):Receives information from all the nodes in the cluster and persists in

    a CHM repository-based database. This service runs on only two nodes in a cluster.

    System Monitor Service (osysmond):The monitoring and operating system metric collection service

    that sends the data to the cluster logger service. This service runs on every node in a cluster.

    Grid Plug and Play (GPNPD):Provides access to the Grid Plug and Play profile, and coordinates updates

    to the profile among the nodes of the cluster to ensure that all of the nodes have the most recent

    profile.

    Grid Interprocess Communication (GIPC):A support daemon that enables Redundant Interconnect

    Usage.

    Multicast Domain Name Service (mDNS):Used by Grid Plug and Play to locate profiles in the cluster,

    as well as by GNS to perform name resolution. The mDNS process is a background process on Linux and

    UNIX and on Windows.

    Oracle Grid Naming Service (GNS):Handles requests sent by external DNS servers, performing name

    resolution for names defined by the cluster.

    Cluster Interconnects

    It is the communication path used by the cluster for the synchronization of resources and it is also used in

    some cases for transfer of data from one instance to another. Typically, the interconnect is a network

    connections that is dedicated to the server nodes of a cluster (thus is sometimes referred as private

    interconnect)

  • 8/9/2019 Oracle Clusterware

    6/12

    Oracle Real Application Cluster (Oracle RAC) Session 1: Oracle 10g/11gR2 RAC Architecture

    Ahmed Fathi - Senior Oracle Consultant P a g e | 6

    Email: [email protected] Blog: http://ahfathi.blogspot.com LinkedIn:http://linkedin.com/in/ahmedfathieg

    Oracle Kernel Components

    Set of additional background process in each instance is known as oracle kernel components in

    RAC environment. Since buffer and shared pool became global in RAC , special handling is required to

    manage the resources to avoid conflicts and corruption. Additional background process (for RAC) and

    single instance background process works together and achieved this.

    Global Cache and Global Enqueue Services

    RAC Database System has two important services. They are Global Cache Service (GCS) and Global Enqueue

    Service (GES). These are basically collections of background processes and memory structures. These two

    services GCS and GES together manage the total Cache Fusion process, resource transfers, and resource

    acquisition among the instances.

    In Oracle RAC each instance will have its own cache but it is required for an instance to access the data

    blocks currently residing in another instance cache. This management and data sharing is done by Global

    Cache services (GCS). Blocks other than data such as locks, enqueue details and shared across the instancesare known as Global Enqueue Services (GES).

    The Global Cache Service employs various background processes such as the Global Cache Service Processes (LMSn)

    and Global Enqueue Service Daemon (LMD)

    Global Resource Directory

    Global Resource Directory (GRD) is the internal in-memory database and is stored on all of the running

    instances that records and stores the current status of resources and the enqueues (data blocks). GRD is

    maintained by GES and GCS. Whenever a block is transferred out of a local cache to another instances

    cache the GRD is updated. The following resources information is available in GRD.

    - Data Block information such as file # and block #

    - Location of most current version

    - Modes of the data blocks: (N)Null, (S)Shared, (X)Exclusive

    Oracle RAC Background Processes

    LMS Global Cache Service Process

    LMD Global Enqueue service Daemon

    LMON Global Enqueue Service Monitor LCK0 Instance Enqueue Process

    DIAG Diagnosability Daemon

  • 8/9/2019 Oracle Clusterware

    7/12

    Oracle Real Application Cluster (Oracle RAC) Session 1: Oracle 10g/11gR2 RAC Architecture

    Ahmed Fathi - Senior Oracle Consultant P a g e | 7

    Email: [email protected] Blog: http://ahfathi.blogspot.com LinkedIn:http://linkedin.com/in/ahmedfathieg

    Global Cache Service Processes (LMSn)

    LMS- Lock Manager Server Process is used in Cache Fusion. It enables consistent copies of blocks to be

    transferred from a holding instance's buffer cache to a requesting instance buffer cache without a disk

    write under certain conditions.

    It rollbacks any uncommitted transactions for any blocks that are being requested for a consistent read by

    the remote instance.

    Global Enqueue Service Daemon (LMD)

    LMD-Lock Manager Daemon process manages Enqueue service requests for GCS. It also handles deadlock

    detection and remote resource requests.

    Global Enqueue Service Monitor (LMON)

    LMON-Lock Monitor Process is responsible to manage Global Enqueue Services (GES).

    It maintains consistency of GCS memory in case of any process death. LMON is also responsible for the

    cluster reconfiguration when an instance joins or leaves the cluster. It also checks for the instance death

    and listens for local manages.

    Instance Enqueue Process (LCK)

    The LCK0 process manages non-Cache Fusion resource requests such as library and row cache requests.

    Diagnosability Daemon (DIAG)

    This background process monitors the health of the instance and captures diagnostic data about process

    failures within instances. The operation of this daemon is automated and updates an alert log file to recordthe activity that it performs.

    Clusterware and heartbeat mechanism

    Cluster needs to know who is a member at all times. Oracle cluster has Two (02) types of heartbeats:

    1.

    Network heartbeat

    - Performed once per second.

    - Node will evict from cluster when failed to send a network heartbeat within time frame.

    2. Disk (Voting Disk) heartbeat

    - Each node of a cluster writes a disk heartbeat to voting disk every second

    - Node evicts from cluster if no heartbeat is updated within I/O (MissCount/Disktimeout) timeout.

  • 8/9/2019 Oracle Clusterware

    8/12

    Oracle Real Application Cluster (Oracle RAC) Session 1: Oracle 10g/11gR2 RAC Architecture

    Ahmed Fathi - Senior Oracle Consultant P a g e | 8

    Email: [email protected] Blog: http://ahfathi.blogspot.com LinkedIn:http://linkedin.com/in/ahmedfathieg

    What is miscount in oracle RAC?

    The cluster synchronization service (CSS) on RAC has miscount parameter. This value represent the

    maximum time, in seconds, that a network heartbeat can be missed before entering into a cluster

    reconfiguration, in order to evict a node. The default value is 60 seconds in linux 10g and 11g it is 30

    seconds

    I/O Fencing

    There will be some situations where the left over write operations from database instances reach the

    storage system. The cluster function on this node failed, but the nodes are still running at the OS level.

    Since these operations are no longer in the serial order, they can damage the consistency of the stored

    data. Therefore, when a cluster node fails, the failed node needs to be fenced off from all the shared disk

    devices or disk groups. This methodology is called I/O fencing, disk fencing or failure fencing.

    Functions of I/O fencing

    Prevents the updates by failed instances and to detect failure and prevent split-

    brain in the cluster.

    Cluster volume manager and cluster file system play a significant role in preventing

    the failed nodes from accessing shared devices. Oracle uses algorithm common to

    STONITH (shoot the other node in the head) implementations to determine what

    nodes needs to fenced. This simply means the healthy nodes kill the sick

    node. Oracle's Clusterware does not do this; instead, it simply gives the message

    "Please Reboot" to the sick node. The node bounces itself and rejoins the cluster.

    There are other methods of fencing that are utilized by different hardware/software vendors. When using

    Veritas Storage Foundation for RAC (VxSF RAC), you can implement I/O fencing instead of node

    fencing. This means that instead of asking a server to reboot, you simply close it off from shared storage.

    In versions before 11.2.0.2 Oracle Clusterware tried to prevent a split-brain with a fast reboot (better:

    reset) of the server(s) without waiting for ongoing I/O operations or synchronization of the file systems.

  • 8/9/2019 Oracle Clusterware

    9/12

    Oracle Real Application Cluster (Oracle RAC) Session 1: Oracle 10g/11gR2 RAC Architecture

    Ahmed Fathi - Senior Oracle Consultant P a g e | 9

    Email: [email protected] Blog: http://ahfathi.blogspot.com LinkedIn:http://linkedin.com/in/ahmedfathieg

    This mechanism has been changed in version 11.2.0.2 (first 11g Release 2 patch set). After deciding which

    node to evict, the Clusterware:

    - attempts to shut down all Oracle resources/processes on the server (especially processes generating

    I/Os)

    - will stop itself on the node

    - Afterwards Oracle High Availability Service Daemon (OHASD)5 will try to start the Cluster Ready

    Services (CRS) stack again. Once the cluster interconnect is back online, all relevant cluster resources

    on that node will automatically start

    - Kill the node if stop of resources or processes generating I/O is not possible (hanging in kernel

    mode, I/O path, etc.)

    Generally Oracle Clusterware uses two rules to choose which nodes should leave the cluster to assure the

    cluster integrity:

    - In configurations with two nodes, node with the lowest ID will survive (first node that joined the

    cluster), the other one will be asked to leave the cluster

    - With more cluster nodes, the Clusterware will try to keep the largest sub-cluster Running

    When node does reboots?

    - Network failure interconnect

    - Slow interconnect (latency) must fail 30 consecutive times!

    - Voting disk IO cannot read or write

    - CPU-bound CPU is too busy to maintain heartbeat

    - Files moved, delected, changed or some other human error

    - Configuration error wrong network for private interconnect

    - ocssd process died

    - Some Oracle Clusterware bug

  • 8/9/2019 Oracle Clusterware

    10/12

    Oracle Real Application Cluster (Oracle RAC) Session 1: Oracle 10g/11gR2 RAC Architecture

    Ahmed Fathi - Senior Oracle Consultant P a g e | 10

    Email: [email protected] Blog: http://ahfathi.blogspot.com LinkedIn:http://linkedin.com/in/ahmedfathieg

    Split-Brain scenario

    The term "Split-Brain" is often used to describe the scenario when two or more co-operating processes in a

    distributed system, typically a high availability cluster, lose connectivity with one another but then continue

    to operate independently of each other, including acquiring logical or physical resources, under the

    incorrect assumption that the other process(es) are no longer operational or using the said resources.

    Fast Application Notification (FAN)

    Notifying clients about the RAC availability and instance (actually service) performance is the purpose of the

    FAN (Fast Application Notification) events. The client is not actively checking the availability or load of an

    instance and is no more glued to an instance once connected. The nodes directly inform the application

    server about which instance is able to provide a defined Quality of Service.

    FAN is a method introduced in Oracle 10.1, by which applications can be informed of changes in cluster

    status for Fast node failure detection and Workload balancing.

  • 8/9/2019 Oracle Clusterware

    11/12

    Oracle Real Application Cluster (Oracle RAC) Session 1: Oracle 10g/11gR2 RAC Architecture

    Ahmed Fathi - Senior Oracle Consultant P a g e | 11

    Email: [email protected] Blog: http://ahfathi.blogspot.com LinkedIn:http://linkedin.com/in/ahmedfathieg

    Advantageous by preventing applications from Waiting for TCP/IP timeouts when a node fails, Trying to

    connect to currently down database service and Processing data received from failed node.

    And can be notified using Server side callouts, Fast Connection Failover (FCF), ONS API

    Why Use Virtual IP?

    The goal is application availability.

    When a node fails, the VIP associated with it is automatically failed over to some other node. When this

    occurs the following thing happens:

    - VIP detects public network failure which generates FAN event

    - The new node announces the world indicating a new MAC address for VIP.

    - Connected clients through VIP, immediately receive ORA-3113 error or equivalent.

    - New connection request rapidly traverse the tnsnames.ora address list skipping over the dead

    nodes, instead of having to wait on TCP-IP timeouts.

    Without using VIP, clients connected to a node that died will often wait for TCP-IP timeout period (which

    can be up to 10 minutes) before getting an error. As a result you dont have really good High Availability

    solution without using VIP.

    Connecting with Public IP Scenario

  • 8/9/2019 Oracle Clusterware

    12/12

    Oracle Real Application Cluster (Oracle RAC) Session 1: Oracle 10g/11gR2 RAC Architecture

    Ahmed Fathi - Senior Oracle Consultant P a g e | 12

    Email: [email protected] Blog: http://ahfathi.blogspot.com LinkedIn:http://linkedin.com/in/ahmedfathieg

    Connecting with Virtual IP scenario