Upload
strivedi23
View
248
Download
4
Embed Size (px)
Citation preview
8/10/2019 DSI408 Real Application Clusters Internals
1/525
DSI408: Real Application C
Internals
Electronic Presentation
D16333GC10
Production 1.0
April 2003
D37990
8/10/2019 DSI408 Real Application Clusters Internals
2/525
Copyright 2003, Oracle. All rights reserved.
This documentation contains proprietary information of Oracle Corporation
license agreement containing restrictions on use and disclosure and is als
law. Reverse engineering of the software is prohibited. If this documentati
Government Agency of the Department of Defense, then it is delivered wit
following legend is applicable:
Restricted Rights Legend
Use, duplication or disclosure by the Government is subject to restrictions software and shall be deemed to be Restricted Rights software under Fed
subparagraph (c)(1)(ii) of DFARS 252.227-7013, Rights in Technical Data
(October 1988).
This material or any portion of it may not be copied in any form or by any m
prior written permission of the Education Products group of Oracle Corpor
a violation of copyright law and may result in civil and/or criminal penalties
If this documentation is delivered to a U.S. Government Agency not within
Defense, then it is delivered with Restricted Rights, as defined in FAR 52
General, including Alternate III (June 1987).
The information in this document is subject to change without notice. If yo
documentation, please report them in writing to Worldwide Education Serv500Oracle Parkway, Box SB-6, Redwood Shores, CA 94065. Oracle Corp
that this document is error-free.
Oracle and all references to Oracle Products are trademarks or registered
Corporation.
All other products or company names are used for identification purposes
trademarks of their respective owners.
Authors
Xuan Cong-Bui
John P. McHugh
Michael Mller
Technical Contributors andReviewers
Michael Cebulla
Lex de Haan
Bill Kehoe
Frank Kobylanski
Roderick Manalac
Sundar Matpadi
Sri Subramaniam
Harald van Breederode
Jim Womack
Publisher
Glenn Austin
8/10/2019 DSI408 Real Application Clusters Internals
3/525
DSI408: Real Application
Clusters Internals
Volume 1 - Student Guide
D16333GC10
Edition 1.0
April 2003
37988
8/10/2019 DSI408 Real Application Clusters Internals
4/525
Copyright 2003, Oracle. All rights reserved.
This documentation contains proprietary information of Oracle Corporation. It is
provided under a license agreement containing restrictions on use and disclosure and
is also protected by copyright law. Reverse engineering of the software is prohibited.
If this documentation is delivered to a U.S. Government Agency of the Department of
Defense, then it is delivered with Restricted Rights and the following legend is
applicable:
Restricted Rights Legend
Use, duplication or disclosure by the Government is subject to restrictions for
commercial computer software and shall be deemed to be Restricted Rights software
under Federal law, as set forth in subparagraph (c)(1)(ii) of DFARS 252.227-7013,
Rights in Technical Data and Computer Software (October 1988).
This material or any portion of it may not be copied in any form or by any means
without the express prior written permission of Oracle Corporation. Any other copying
is a violation of copyright law and may result in civil and/or criminal penalties.
If this documentation is delivered to a U.S. Government Agency not within the
Department of Defense, then it is delivered with Restricted Rights, as defined in
FAR 52.227-14, Rights in Data-General, including Alternate III (June 1987).
The information in this document is subject to change without notice. If you find any
problems in the documentation, please report them in writing to Education Products,
Oracle Corporation, 500 Oracle Parkway, Box SB-6, Redwood Shores, CA 94065.Oracle Corporation does not warrant that this document is error-free.
Oracle and all references to Oracle Products are trademarks or registered trademarks
of Oracle Corporation.
All other products or company names are used for identification purposes only, and
may be trademarks of their respective owners.
Authors
Xuan Cong-Bui
John P. McHugh
Michael Mller
Technical Contributors
and Reviewers
Michael Cebulla
Lex de Haan
Bill Kehoe
Frank KobylanskiRoderick Manalac
Sundar Matpadi
Sri Subramaniam
Harald van Breederode
Jim Womack
Publisher
Glenn Austin
8/10/2019 DSI408 Real Application Clusters Internals
5/525
DSI408: Real Application
Clusters Internals
Volume 2 - Student Guide
D16333GC10
Edition 1.0
April 2003
D37989
8/10/2019 DSI408 Real Application Clusters Internals
6/525
Copyright 2003, Oracle. All rights reserved.
This documentation contains proprietary information of Oracle Corporation. It is
provided under a license agreement containing restrictions on use and disclosure and
is also protected by copyright law. Reverse engineering of the software is prohibited.
If this documentation is delivered to a U.S. Government Agency of the Department of
Defense, then it is delivered with Restricted Rights and the following legend is
applicable:
Restricted Rights Legend
Use, duplication or disclosure by the Government is subject to restrictions for
commercial computer software and shall be deemed to be Restricted Rights software
under Federal law, as set forth in subparagraph (c)(1)(ii) of DFARS 252.227-7013,
Rights in Technical Data and Computer Software (October 1988).
This material or any portion of it may not be copied in any form or by any means
without the express prior written permission of Oracle Corporation. Any other copying
is a violation of copyright law and may result in civil and/or criminal penalties.
If this documentation is delivered to a U.S. Government Agency not within the
Department of Defense, then it is delivered with Restricted Rights, as defined in
FAR 52.227-14, Rights in Data-General, including Alternate III (June 1987).
The information in this document is subject to change without notice. If you find any
problems in the documentation, please report them in writing to Education Products,
Oracle Corporation, 500 Oracle Parkway, Box SB-6, Redwood Shores, CA 94065.Oracle Corporation does not warrant that this document is error-free.
Oracle and all references to Oracle Products are trademarks or registered trademarks
of Oracle Corporation.
All other products or company names are used for identification purposes only, and
may be trademarks of their respective owners.
Authors
Xuan Cong-Bui
John P. McHugh
Michael Mller
Technical Contributors
and Reviewers
Michael Cebulla
Lex de Haan
Bill Kehoe
Frank KobylanskiRoderick Manalac
Sundar Matpadi
Sri Subramaniam
Harald van Breederode
Jim Womack
Publisher
Glenn Austin
8/10/2019 DSI408 Real Application Clusters Internals
7/525
Preface
I Course Overview DSI 408: RAC Internals
Prerequisites I-2
Course Overview I-3
Practical Exercises I-5
Section I: Introduction
1 Introduction to RAC
Objectives 1-2
Why Use Parallel Processing? 1-3
Scaleup and Speedup 1-5
Scalability Considerations 1-7
RAC Costs: Synchronization 1-9
RAC Costs: Global Resource Directory 1-10
RAC Costs: Cache Coherency 1-12RAC Terminology 1-14
Terminology Translations 1-16
Programmer Terminology 1-18
History 1-19
History Overview 1-20
Internalizing Components 1-21
Oracle7 1-22
Oracle8 1-23
Oracle8i 1-24
Oracle9i 1-25
Summary 1-26
2 Introduction to RAC InternalsObjectives 2-2
Simple RAC Diagram 2-3
One RAC Instance 2-4
Internal RAC Instance 2-5
Oracle Code Stack 2-6
RAC Component List 2-7
Module Relation View 2-8
Alternate Module Relation View 2-9
Module, Code Stack, Process 2-10
Operating System Dependencies (OSD) 2-11Platform-Specific RAC 2-12
OSD Module: Example 2-13
Summary 2-15
References 2-16
Contents
ii i
8/10/2019 DSI408 Real Application Clusters Internals
8/525
Section II: Architecture
3 Cluster Layer: Cluster Monitor
Objectives 3-2
RAC and Cluster Software 3-3
Generic CM Functionality: Distributed Architecture 3-4
Generic CM Functionality: Cluster State 3-5
Generic CM Functionality: Node Failure Detection 3-6
Cluster Layer and Cluster Manager 3-7
Oracle-Supplied CM 3-8
Summary 3-9
4 Cluster Group Services and Node Monitor
Objectives 4-2
RAC and CGS/GMS and NM 4-3
Node Monitor (NM) 4-4RDBMS SKGXN Membership 4-5
NM Groups 4-6
NM Internals 4-7
Node Membership 4-8
Instance Membership Changes 4-10
NM Membership Death 4-12
Starting an Instance: Traditional 4-13
Starting an Instance: Internal 4-14
Stopping an Instance: Traditional 4-15
Stopping an Instance: Internal 4-16
NM Trace and Debug 4-17
Cluster Group Services (CGS) 4-18Configuration Control 4-19
Valid Members 4-20
Membership Validation 4-23
Membership Invalidation 4-24
CGS Reconfiguration Types 4-26
CGS Reconfiguration Protocol 4-27
Reconfiguration Steps 4-28
IMR-Initiated Reconfiguration: Example 4-30
Code References 4-32
Summary 4-33
5 RAC Messaging SystemObjectives 5-2
RAC and Messaging 5-3
Typical Three-Way Lock Messages 5-4
Asynchronous Traps 5-5
AST and BAST 5-6
Message Buffers 5-7
Message Buffer Queues 5-8
iv
8/10/2019 DSI408 Real Application Clusters Internals
9/525
Messaging Deadlocks 5-9
Message Traffic Controller (TRFC) 5-10TRFC Tickets 5-11
TRFC Flow 5-13Message Traffic Statistics 5-15
IPC 5-18
IPC Code Stack 5-19
Reference Implementation 5-20
KSXP Wait Interface to KSL 5-21
KSXP Tracing 5-22
KSXP Trace Records 5-23
SKGXP Interface 5-24
Choosing an SKGXP Implementation 5-25
SKGXP Tracing 5-26
Possible Hang Scenarios 5-27
Other Events for IPC Tracing 5-28Code References 5-29
Summary 5-30
6 System Commit Number
Objectives 6-2
System Commit Number 6-3
Logical Clock and Causality Propagation 6-4
Basics of SCN 6-5
SCN Latching 6-7
Lamport Implementation 6-8
Lamport SCN 6-9
Limitations on SCN Propagation 6-10max_ commi t _ pr opagat i on_ del ay 6-11
Piggybacking SCN in Messages 6-12
Periodic Synchronization 6-13
SCN Generation in Earlier Versions of Oracle 6-14
Code References 6-15
Summary 6-16
7 Global Resource Directory: Formerly the Distributed Lock Manager
Objectives 7-2
RAC and Global Resource Directory (GRD) 7-3
DLM History 7-4
DLM Concepts: Terminology 7-5
DLM Concepts: Resources 7-6
DLM Concepts: Locks 7-7
DLM Concepts: Processes 7-8
DLM Concepts: Shadow Resources 7-9
DLM Concepts: Copy Locks 7-10
Resource or Lock Mastering 7-11
Basic Resource Structures 7-12
v
8/10/2019 DSI408 Real Application Clusters Internals
10/525
DLM Structures 7-13
Lock Mode Changes 7-16
Simple Lock Changes on a Resource 7-17
Changes on a Resource with Deadlock 7-18DLM Functions 7-19
DLM Functionality in Global Enqueue Service Daemon (LMD0) 7-20
DLM Functionality in Global Enqueue Service Monitor (LMON) 7-22
DLM Functionality in Global Cache Service Process (LMS) 7-23
DLM Functionality in Other Processes 7-24
Configuring GES Resources 7-25
Configuring GES Locks 7-26
Configuring GCS Resources 7-27
Configuring GCS Locks 7-28
Configuring DLM processes 7-29
Logical to Physical Nodes Mapping 7-30
Buckets to Logical Nodes Mapping 7-31
Mapping for a New Node Joining the Cluster 7-32
Remapping When Node Joins 7-34
Mapping Broadcast by Master Node 7-35
Master Node Determination for GES 7-36
Master Node Determination for GCS 7-37
Dump and Trace of Remastering 7-38
DLM Functions 7-39kj ual Connection to DLM 7-40kj ual Flow 7-42kj psod Flow 7-43
DML Enqueue Handling Flow: Example 7-44Step 1: P1 Locks Table in Share Mode 7-45
Step 2: P2 Locks Table in Share Mode 7-46
Step 3: P2 Does Rollback 7-47
Step 4: P1 Locks Table in Exclusive Mode 7-48
Step 5: P3 Locks Table in Share Mode 7-49
Step 6: P1 Does Rollback 7-50
Steps 1 and 2: Code Flow 7-51Step 1: kj usuc Flow Detail 7-52Step 2: kj usuc Flow Detail 7-54
Step 3: Code Flow 7-55Step 3: kj uscl Flow Detail 7-56
Step 4: Code Flow 7-57Step 4: kj uscv Flow Detail 7-58Step 5: kj uscv Flow Detail 7-60Step 6: kj uscl Flow Detail 7-61
Code References 7-63
Summary 7-64
References and Further Reading 7-65
vi
8/10/2019 DSI408 Real Application Clusters Internals
11/525
8 Cache Coherency (Part One): Enqueues/Non-PCM
Objectives 8-2
Cache Coherency: Enqueues 8-3
Enqueue Types 8-6Enqueue Structure 8-7
Examining Enqueues 8-8
Enqueues and DLM 8-9
Source Tree for Non-PCM Lock Flow 8-10
Lock Modes 8-11
Lock Compatibility 8-12
Deadlock Detection: The Classic Deadlock 8-13
Deadlock Detection: A More General Example 8-15
Deadlock Detection and Resolution 8-16
Timeout-Based Deadlock Detection 8-17
Deadlock Graph Printout 8-18
Deadlock Flow 8-19
Deadlock Flow: One Node 8-21
Deadlock Flow: Two Nodes 8-22
Parallel DML (PDML) Deadlocks 8-23
Deadlock Detection Algorithm 8-24
Deadlock Validation Steps 8-27
Code References 8-28
Summary 8-29
9 Cache Coherency (Part Two): Blocks/PCM Locks
Objectives 9-2
Cache Coherency: Blocks 9-3
Block Cache Contention 9-4
Earlier Cache Coherency: Oracle8 Ping Protocol 9-5
Earlier Cache Coherency: Oracle8i CR Server 9-6
Earlier Cache Coherency: Oracle8i CR Server 9-7
Oracle9i Cache Fusion Protocol 9-8
GCS (PCM) Locks 9-9
PCM Lock Attributes 9-10
Lock Modes 9-11
Lock Roles 9-12
Past Image 9-13
Local Lock Role 9-14
Global Lock Role 9-15Block Classes 9-16
Lock Elements (LE) 9-17
Allocation of New LE 9-18
Hash Chain of LE 9-19
Block to LE Mapping 9-20
Queues of LE for LMS 9-21
LMSn Free of LE 9-22
Cache Fusion Examples: Overview 9-23
vii
8/10/2019 DSI408 Real Application Clusters Internals
12/525
Cache Fusion: Example 1 9-25
Cache Fusion: Example 2 9-26
Cache Fusion: Example 3 9-27
Cache Fusion: Example 4 9-28Cache Fusion: Example 5 9-29
Cache Fusion: Example 6 9-30
Cache Fusion: Example 7 9-31
Cache Fusion: Example 8 9-32
Cache Fusion: Example 9 9-33
Cache Fusion: Example 10 9-34
Cache Fusion: Example 11 9-35
Views 9-36
Parameters 9-39
Summary 9-40
10 Cache Fusion 1: CR ServerObjectives 10-2
Cache Fusion: Consistent Read Blocks 10-3
Consistent Read Review 10-4
Getting a CR Buffer 10-5
Getting a CR Buffer in Oracle9i Release 2 10-7
CR Server in Oracle9i Release 2 10-8
CR Requests 10-9
Light Work Rule 10-11
Fairness 10-12
Statistics 10-13
Wait Events 10-14Fixed Table X$KCLCRST Statistics 10-15
CR Requestor-Side Algorithm 10-16
CR Requestor-Side AST Delivery 10-21
CR Requestor-Side CR Buffer Delivery 10-22
CR Server-Side Algorithm 10-23
Summary 10-27
11 Cache Fusion 2: Current Block: XCUR
Objectives 11-2
Cache Fusion: Current Blocks 11-3
PCM Locks and Resources 11-4
Fusion: Long Example 11-5Initial State 11-7
Step 1: Instance 3 Performs SELECT 11-8
Lock Changes in Instance 3 11-9
Lock Changes in Instance 2 11-10
Step 2: Instance 2 Performs SELECT 11-11
Lock Changes in Instance 2 11-12
Step 3: Instance 2 Performs UPDATE 11-13
Lock Changes in Instance 2 11-14
viii
8/10/2019 DSI408 Real Application Clusters Internals
13/525
Lock Changes in Instance 3 11-15
Step 4: Instance 1 Performs UPDATE 11-16
Lock Changes in Instance 2 11-17
Lock Changes in Instance 1 11-18Step 5: Instance 3 Performs SELECT 11-19
Lock Changes in Instance 3 11-20
Step 6: Instance 1 Performs WRITE 11-21
Lock Changes in Instance 2 11-22
Lock Changes in Instance 1 11-23
Tables and Views 11-24
Summary 11-26
12 Cache Fusion Recovery
Objectives 12-2
NonCache Fusion OPS and Database Recovery 12-3
Cache Fusion RAC and Database Recovery 12-4Overview of Fusion Lock States 12-5
Instance or Crash Recovery 12-6
SMON Process 12-7
First-Pass Log Read 12-8
Block Written Record (BWR) 12-9
BWR Dump 12-10
Recovery Set 12-11
Recovery Claim Locks 12-12IDLM Response to Recover yCl ai mLoc k Message on PCM Resource 12-13
No Lock Held by Recovering Instance on the PCM Resource 12-14
Recovery Claim Locks 12-15
Second-Pass Log Read 12-17
Large Recovery Set and Partial IR Lock Mode 12-19
Lock Database Availability During Recovery 12-22
Handling BASTs on Recovery Buffers 12-23
IR of Nonfusion Blocks 12-24
Failures During Instance Recovery 12-26
Memory Contingencies 12-28
Code References 12-29
Summary 12-31
Section III: Platforms
13 Linux PlatformObjectives 13-2
Linux RAC Architecture 13-3
Storage: Raw Devices 13-4
Extended Storage 13-5
Linux Cluster Software 13-6
OCMS 13-7
OCMS Components 13-8
ix
8/10/2019 DSI408 Real Application Clusters Internals
14/525
WDD, NM, and CM Flow (Up to version 9.2.0.1) 13-9
Watchdog Daemon 13-10
Hangcheck, NM, and CM Flow (After version 9.2.0.2) 13-11
Hangcheck Module 13-12
Node Monitor (NM) 13-13
Cluster Manager 13-14
Linux Port-Specific Code 13-15
Cluster Manager 13-16
skgxpt and skgxpu 13-17
Installing RAC on Linux 13-18
Running RAC on Linux 13-21
Starting CM 13-22
Starting WDD 13-23
Starting NM 13-24Starting CM 13-25
Debugging 13-26
Summary 13-27
References 13-28
14 HP-UX Platform
Objectives 14-2
HP-UX RAC Architecture 14-3
HP-UX Cluster Software 14-4
HP-UX Port-Specific Code 14-5
SKGXP (UDP Implementation) 14-6SKGXP: Lowfat 14-7
Installing RAC on HP-UX 14-8
Running RAC on HP-UX 14-9
Debugging on HP-UX 14-10
Summary 14-11
15 Tru64 Platform
Objectives 15-2
Tru64 RAC Architecture 15-3
Shared Disk Systems 15-4
Tru64 Cluster Software 15-5Tru64 Port-Specific Code 15-6
Node Monitor: SKGXN 15-7
IPC: SKGXP 15-8
SKGXPM: RDG 15-9
Installing RAC on Tru64 15-11
Debugging on Tru64 15-12
x
8/10/2019 DSI408 Real Application Clusters Internals
15/525
Useful Tru64 Commands 15-13
Summary 15-15
16 AIX PlatformObjectives 16-2
AIX RAC Architecture 16-3
AIX SP Clusters 16-4
AIX HACMP Clusters 16-5
AIX Cluster Software 16-6
AIX Cluster Layer 16-7
AIX Port-Specific Code 16-8
RAC on AIX Stack 16-9
Node Monitor (NM) 16-10
Installing RAC on AIX 16-12
Debugging on AIX 16-14
Summary 16-15
References 16-16
17 Other Platforms
Objectives 17-2
RAC Architecture: Solaris 17-3
RAC Architecture: Windows 17-4
RAC Architecture: OpenVMS 17-5
Port-Specific Code 17-6
Installing RAC 17-7
Summary 17-8
Section IV: Debug
18 V$ and X$ Views and Events
Objectives 18-2
V$ and GV$ Views 18-3
List of Views 18-4
Old and New Views 18-5
V$ Views for Lock Information 18-6
X$ Tables 18-7
Events 18-8
19 KST and X$TRACE
Objectives 19-2
KST: X$TRACE 19-3
KST Concepts 19-4
KST Concepts 19-6
Circular Buffer 19-7
xi
8/10/2019 DSI408 Real Application Clusters Internals
16/525
Data Structure k s t r c 19-8
Trace Control Interfaces 19-9
KST Initialization Parameters 19-10
KST Trace Control Interfaces 19-12
KST Fixed Table Views 19-14
KST Trace Output 19-15
KST Current Instrumentation 19-18
KST Performance 19-19
KST: Examples 19-20
KST Sample Trace File 19-24
KST Demonstration 19-25
DIAG Daemon 19-26
DIAG Daemon: Features 19-27
DIAG Daemon: Design 19-29DIAG Daemon: Startup and Shutdown 19-33
DIAG Daemon: Crash Dumping 19-34
Summary 19-36
20 ORADEBUG and Other Debugging Tools
Objectives 20-2
ORADEBUG 20-3
Flash Freeze 20-5
LKDEBUG 20-6
NSDBX 20-7
HANGANALYZE 20-8Summary 20-9
References 20-10
Appendix A: Practices
Appendix B: Solutions
xi i
8/10/2019 DSI408 Real Application Clusters Internals
17/525
Copyright 2003, Oracle. All rights reserved.
Course Overview
DSI 408: RAC Internals
8/10/2019 DSI408 Real Application Clusters Internals
18/525DSI408: Real Application Clusters Internals I-2
I-2 Copyright 2003, Oracle. All rights reserved.I-2
Prerequisites
Before taking this course, you should have: Taken DSI 401, 402, and 403 so that you know
about the server internals on crashes, dumps,transactions, block handling, and recoverysystems
Taken the Real Appl ication Clusters (RAC)administration course so that you know about the
external view of RAC Performed at least one RAC installation and
assisted in at least one RAC debugging case
Prerequisites
The prerequisites ensure that the course is useful to you, instead of being too hard, and that
the instructor need not cover basic material.
You must have your TAO account ready for examining source code.
8/10/2019 DSI408 Real Application Clusters Internals
19/525DSI408: Real Application Clusters Internals I-3
I-3 Copyright 2003, Oracle. All rights reserved.I-3
Course Overview
The course includes the following four sections: Introduction
Architecture
Platforms
Debug
Subjects that are not covered include:
Utilities (srvctl, OCFS, HA)
Performance tuning Pre-Oracle9i versions (OPS)
Course Overview
This course contains four sections. It is scheduled to take four days but does not require
one day per section. Most of the time is spent on the Architecture section.
Introduction
The Introduction section provides a summary of the public RAC architecture and its
accurate terminology. An overview of architecture changes between versions is also given.
Architecture
The Architecture section covers the theory of operation of RAC. The RAC code stack is
examined from the bottom up. There are many references to the source code.
Platforms
The Platforms section covers the differences and architectural details of RAC
implementation on different platforms. Installation issues and known gotchas are
included.
8/10/2019 DSI408 Real Application Clusters Internals
20/525DSI408: Real Application Clusters Internals I-4
Course Overview (continued)
Debug
The Debug section provides a detailed explanation of the trace and dump mechanisms that
are placed inside RAC for fault location. A number of practical exercises use these
mechanisms.
Subjects not Covered
This course does not cover utility modules that are not part of the primary core RAC
functionality. It also does not cover some of the external programs that RAC depends on.
Performance is not covered as a separate topic. The knowledge from this course should be
sufficient to identify performance bottlenecks that are purely relevant to RAC; otherwise,
tuning is the same as for a single instance.
For versions of Oracle Parallel Server, you should review earlier courses. In earlier courses,
the differences between RAC and OPS are pointed out, whereas the RAC knowledge in
this course is not applicable to OPS.
8/10/2019 DSI408 Real Application Clusters Internals
21/525DSI408: Real Application Clusters Internals I-5
I-5 Copyright 2003, Oracle. All rights reserved.I-5
Practical Exercises
The course includes practical exercises. Exercises run on a shared Solaris cluster.
Practical Exercises
The cluster hardware is shared between students and other classesthis prevents practices
that involve node shutdown, or breaking the interconnect.
8/10/2019 DSI408 Real Application Clusters Internals
22/525
8/10/2019 DSI408 Real Application Clusters Internals
23/525
Copyright 2003, Oracle. All rights reserved.
GES GCS
ES GCS
ES GCS
Buffer Cache
uffer Cache
uffer Cache
CGS
GS
GS
SQL Layer
QL Layer
QL Layer
GES GCS
ES GCS
ES GCS
Buffer Cache
uffer Cache
uffer Cache
CGS
GS
GS
SQL Layer
QL Layer
QL Layer
Node Monitor
ode Monitor
ode Monitor
Node Monitor
ode Monitor
ode Monitor
Cluster Manager
luster Manager
luster Manager
I
I
P
P
C
C
I
I
P
P
C
C
Section IIntroduction
8/10/2019 DSI408 Real Application Clusters Internals
24/525
Copyright 2003, Oracle. All rights reserved.
Introduction to RAC
8/10/2019 DSI408 Real Application Clusters Internals
25/525DSI408: Real Application Clusters Internals I-10
Copyright 2003, Oracle. All rights reserved.1-10
Objectives
After completing this lesson, you should be able to dothe following:
Review the design objectives of Real ApplicationClusters (RAC)
Relate Oracle9i RAC to its predecessors
8/10/2019 DSI408 Real Application Clusters Internals
26/525DSI408: Real Application Clusters Internals I-11
Copyright 2003, Oracle. All rights reserved.1-11
Why Use Parallel Processing?
Scaleup: Increased throughput
Speedup: Increased performance or fasterresponse
Higher availability
Support for a greater number of users
Why Use Parallel Processing?
Scaleup: Increased Throughput
Parallel processing breaks a large task into smaller subtasks that can be performed
concurrently. With tasks that grow larger over time, a parallel system that also grows (or
scales up) can maintain a constant time for completing the same task.
Speedup: Increased Performance
For a given task, a parallel system that can scale up improves the response time for
completing the same task.
For decision support system (DSS) applications and parallel queries, parallelprocessing decreases the response time.
For online transaction processing (OLTP) applications, speedup cannot be expected
due to the overhead of synchronization. Depending on the precise circumstances, a
decrease in performance can occur.
8/10/2019 DSI408 Real Application Clusters Internals
27/525DSI408: Real Application Clusters Internals I-12
Why Use Parallel Processing? (continued)
Higher Availability
Because each node running in the parallel system is isolated from other nodes, a single node
failure or crash should not cause other nodes to fail. Other instances in the parallel server
environment remain up and running.
The operating systems failover capabilities and fault tolerance of the distributed clustersoftware are an important infrastructure component.
Support for a Greater Number of Users
Each node can support several users because each node has its own set of resources, such as
memory, CPU, and so on. As nodes are added to the system, more users can also be added,
allowing the system to continue to scale up.
8/10/2019 DSI408 Real Application Clusters Internals
28/525DSI408: Real Application Clusters Internals I-13
Copyright 2003, Oracle. All rights reserved.1-13
Scaleup and Speedup
Original system
Hard-ware
100% of taskTime
Cluster system scaleup
Up to200%
oftask
Up to300%oftask
Hard-ware Time
Hard-ware Time
Hard-ware Time
50% of task
Cluster system speedup
Hard-ware Time
Hard-ware Time
50% of task
Scaleup and Speedup
Scaleup
Scaleup is the capability of providing continued increases in throughput in the presence of
limited increases in processing capability while keeping the time constant:
Scaleup = (volumeparallel) / (volumeoriginal) time for interprocess communication
For example, if 30 users consume close to 100% of the CPU during their normal
processing, adding more users would cause the system to slow down due to contention for
limited CPU cycles. By adding CPUs, however, extra users can be supported without
degrading performance.Speedup
Speedup is the capability of providing continued increases in speed in the presence of
limited increases in processing capability while keeping the task constant:
Speedup = (timeoriginal) / (timeparallel) time for interprocess communication
Speedup results in resource availability for other tasks. For example, if queries normally
take 10 minutes to process, and running in parallel reduces the time to 5 minutes, then
additional queries can run without introducing the contention that might occur if they were
to run concurrently.
8/10/2019 DSI408 Real Application Clusters Internals
29/525DSI408: Real Application Clusters Internals I-14
Scaleup and Speedup (cont inued)
Speedup (continued)
Example 1: A particular application might takeNseconds to fully scan and produce a
summary of a 1 GB table
With scaleup, if the table doubles in size, then doubling hardware resources should allow
the query to still complete inNseconds.
With speedup, if the table does not grow in size, doubling the hardware resources should
allow the query to complete inN/2 seconds.
Example 2: A particular application might have 100 users, each getting a three-second
response on queries.
With scaleup, if the number of users doubles in size, then doubling hardware resources
should allow response time to remain at three seconds.
With speedup, if the number of users remains the same, doubling the hardware resources
should reduce the response time. This occurs only if the three-second activity can be
broken down into two separate activities that can run independently of each other.
A Success Example of Scaleup
The following testimonial is from the internal RAC mailing list. This was a response to
a question about the ease of changing a single instance to an RAC system.
Just yesterday, we tested with a customer a migration from single instance to two-node
RAC on Solaris. They were using Veritas DBE/AC for the cluster system.
These are the steps we took:
1. Node 1 Server running 9i single instance at approx 80% CPU load.
2. Connection through Transparent Application Failover with 40 retries and a delay offive seconds.
3. Alter shared initialization file to set Cluster Database = true and add extra
parameters for the second node (bdump location and so on).
4. Shut down Database on Node 1.
5. Start up Database on Node 2 using new initialization file.
6. Start up Database on Node 1 using new initialization file.
At this point we had 85% of users on Node 1 and 15% on Node 2.
7. Run a script to disconnect sessions on Node 1 to allow them to load balance across
to Node 2.At this point we had 50% of users on Node 1 and 50% on Node 2. The database was no
longer highly loaded and we were able to add more (now load-balanced) users.
The application was written in Java and was TAF-aware (i.e., it knew to retry transactions
with certain warning messages). Once we added the second node, the TPMs per Node
remained approximately the same so we had over 1.9 x improvement in TPMs, which was
pretty good scaling.
8/10/2019 DSI408 Real Application Clusters Internals
30/525DSI408: Real Application Clusters Internals I-15
Copyright 2003, Oracle. All rights reserved.1-15
Scalabili ty Considerations
Hardware: Disk I/O
Internode communication: High bandwidth andlow latency
Operating system: Number of CPUs (for example,SMP)
Cache Coherency and the Global Cache Service
Database: Design
Application: Design
Scalability Considerations
It is important to remember that if any of these six areas are not scalable (no matter how
scalable the other areas are), parallel cluster processing may not be successful.
Hardware scalability: High bandwidth and low latency offer the maximum scalability.
A high amount of remote I/O may prevent system scalability, because remote I/O is
much slower than local I/O.
Bandwidth of the communication interface is the total size of messages that can be
sent per second. Latency of the communication interface is the time required to place
a message on the interconnect. It indicates the number of messages that can be put on
the interconnect per unit of time.
Operating system: Nodes with multiple CPUs and methods of synchronization in the
OS can determine how well the system scales. Symmetric multiprocessing can
process multiple requests to resources concurrently.
8/10/2019 DSI408 Real Application Clusters Internals
31/525DSI408: Real Application Clusters Internals I-16
Scalability Considerations (continued) The processes that manage local resource coordination in a cluster database are
identical to the local resource coordination processes in single instance Oracle. This
means that row and block level access, space management, system change number
(SCN) creation, and data dictionary cache and library cache management are the
same in Real Application Clusters as in single instance Oracle. If the resource is
modified by more than one instance, then RAC performs further synchronization on aglobal level to permit shared access to this block across the cluster. Synchronization
in this case requires intranode messaging as well as the preparation of consistent read
versions of the block and the transmission of copies of the block between memory
caches within the cluster database." (See Oracle9i Real Application Clusters
Concepts Release 2 (9.2), Part Number A96597-01, Chapter 5,Real Application
Clusters Resource Coordination.)
Database scalability: Database scalability depends on how well the database is
designed (for example, how the data files are arranged, how well the locks are
allocated, and how well the objects are partitioned).
Scalability of the application: Application design is one of the keys to taking
advantage of the other elements of scalability. Regardless of how well the hardware
and database scale, parallel processing does not work as desired if the application
does not scale.
A typical cause for the lack of scalability is one common shared resource that must be
accessed often. This causes the otherwise parallel operations to serialize on this bottleneck.
A high latency in the synchronization increases the cost of synchronization, counteracting
the benefits of parallelization. This is a general limitation and not a RAC-specific
limitation.
8/10/2019 DSI408 Real Application Clusters Internals
32/525DSI408: Real Application Clusters Internals I-17
Copyright 2003, Oracle. All rights reserved.1-17
RAC Costs: Synchronization
To scale, there is a cost in synchronization: Scalabili ty = Synchronization
Less synchronization = Speedup and scaleup
Synchronization is necessary to maintain cachecoherency in RAC.
RAC Costs: Synchronization
Synchronization is a necessary part of parallel processing, but for parallel processing to be
advantageous, the cost of synchronization must be determined.
Synchronization provides the coordination of concurrent tasks and is essential for parallel
processing to maintain data integrity or correctness. Proper locking between disjoint SGAs
(Oracle instances) must be maintained to ensure correct data. This is cache coherency.
Partitioning can help reduce synchronization costs because there are fewer
concurrent tasks (that is, fewer concurrent users modifying the same set of data).
An application that modifies a small set of data can cause a high overhead forsynchronization if performed in disjoint SGAs.
Contention occurs between instances using a single block or row, such as a table with
one row that is used to generate sequence numbers.
Two ways to synchronize:
Locks: Latches, enqueues, locks
Messages: Send/wait for messages
Synchronization = Amount Cost Amount: How often do you need to synchronize?
Cost: How expensive is it to synchronize?
8/10/2019 DSI408 Real Application Clusters Internals
33/525DSI408: Real Application Clusters Internals I-18
Copyright 2003, Oracle. All rights reserved.1-18
Levels of Syncronization
Row-Level (Database) Oracle Row-Locking feature
Maximize concurrency
SCN coerency
Local Cache Level (intra-instance) Every buffer in cache is protected by logical semaphores (spin latches)
Access to buffers is synchronized
Global Cache Fusion (inter-instance DLM) Every buffer in every cache is tracked by GCS
cache coherency / cache consistency
- CACHE BUFFERS CHAINS , CACHE BUFFER HANDLES , CACHEBUFFER HANDLES
- Global Resource Directory managed by Global Cache Services (GCS) .
(Old DLM in pre9i)- cache coherency
The synchronization of data in multiple caches so that reading a memorylocation
by way of any cache will return the most recent data written to that location
by way
of any other cache. Sometimes calledcache consistency.
8/10/2019 DSI408 Real Application Clusters Internals
34/525DSI408: Real Application Clusters Internals I-19
Copyright 2003, Oracle. All rights reserved.1-19
Levels of Syncronization Row Level
Block 100 Block 101
Database
Instance
Global Cache
(iDLM)
fg2
fg1
Updat e row2
Updat e row1
Enqueues are local locks that serialize access to various resources. Thiswait event indicates a wait for a lock that is held by another session (orsessions) in an incompatible mode to the requested mode. See (about V$LOCK) for details of which lock modes are
compatible with which. Enqueues are usually represented in the format"TYPE-ID1-ID2" where
"TYPE" is a 2 character text string
"ID1" is a 4 byte hexadecimal number
"ID2" is a 4 byte hexadecimal number
8/10/2019 DSI408 Real Application Clusters Internals
35/525DSI408: Real Application Clusters Internals I-20
Copyright 2003, Oracle. All rights reserved.1-20
Levels of Syncronization Local Cache
Block 100 Block 101
Database
Instance
fg2
fg1
BCacheUpdater ow1
Updater ow2
Global Cache
(iDLM)
8/10/2019 DSI408 Real Application Clusters Internals
36/525DSI408: Real Application Clusters Internals I-21
Copyright 2003, Oracle. All rights reserved.1-21
Levels of Syncronization Global Cache
Block 100 Block 101
Database
Instance
fg1fg1
Updater ow1
Updat er ow2
BCache BCache
Gl obal Resour ce Di r ector y
Global Cache
(iDLM)
global resources
Inter-instance synchronization mechanisms that provide cache coherency for
Real
Application Clusters. The term can refer to both Global Cache Service (GCS)
resources and Global Enqueue Service (GES) resources.
8/10/2019 DSI408 Real Application Clusters Internals
37/525DSI408: Real Application Clusters Internals I-22
Copyright 2003, Oracle. All rights reserved.1-22
We need a cache
Block
Database
fg
fg
Block Block
fgfg
fg
Serialize
Serialize Sequencing
operationsguaranteeconsistency of
data
But : Minimize thelevel ofconcurrency of thesystem
And : time tocompletesequence ofoperationsdepends by theslower element :disks
Serialization is the easiest method to manageconcurrency, But, conversely cost in term ofsystem througput
Evolutions of
Oracleminimize theset of tasksthat areserialized
Give a set of Tasks: [T1,T2,Tn] that arrive at the times [t1
8/10/2019 DSI408 Real Application Clusters Internals
38/525DSI408: Real Application Clusters Internals I-23
Copyright 2003, Oracle. All rights reserved.1-23
Coerency
The systems reach amaximum level of
concurrency
fg1fg1
sel ectr ow1
BCache BCache
Start SC#900
Start SC#1010
scn: 900
sel ectr ow2
Scn: 1010
Block 100
scn: 800
SS
Res: 1, 0x100
Ex: ALTER SYSTEM DUMP DATAFILE 5 BLOCK 4690;
ALTER SYSTEM DUMP DATAFILE {'filename'}|{filenumber}
|---BLOCK MIN {blockno} BLOCK MAX {blockno}|-->
|---BLOCK {blockno}-----------------------|
Note : blockdump report the BC block if block is CURRENT/Dirty in
current instance
alter session set events 'immediate trace name BUFFER level ';
8/10/2019 DSI408 Real Application Clusters Internals
39/525DSI408: Real Application Clusters Internals I-24
Copyright 2003, Oracle. All rights reserved.1-24
Coerency costs of locks
8/10/2019 DSI408 Real Application Clusters Internals
40/525DSI408: Real Application Clusters Internals I-25
Copyright 2003, Oracle. All rights reserved.1-25
Fixed*/Releasable 1:M lock model (static)
Block 100 Block 101
Database
Instance
Block 102 Block 103 Block 104
(*)starting 9i
removed fixed-
locking mode
Global Cache
(iDLM)
GC_FILES_TO_LOCKS = 1=100:2=0:3=1000:4-5=0EACH
GC_FILES_TO_LOCKS ={ file_list= lock_count[! blocks][EACH][:...]}
PCM l ock names
type is always BL (because PCM locks are buffer locks)
ID1 is the block class (described in Classes of Blocks)
ID2 For fixed locks,ID2 is the lock element (LE) index number obtained by hashing the block address
(see the GV$LOCK_ELEMENT/ GV$GC_ELEMENT fixed view) For releasable locks,ID2 is the database address ofthe block.
Non PCM l ocks
CF Controlfile Transaction IV Library Cache Invalidation
CI Cross-Instance Call Invocation L[A-P] Library Cache Lock
DF Datafile N[A-Z] Library Cache Pin
DL Direct Loader Index Creation Q[A-Z] Row Cache
DM Database Mount PF Password File
DX Distributed Recovery PR Process Startup
FS File Set PS Parallel Slave Synchronization
KK Redo Log Kick RT Redo Thread
IN Instance Number SC System Commit Number
IR Instance Recovery SM SMON
IS Instance State SN Sequence Number
MM Mount Definition SQ Sequence Number Enqueue
MR Media Recovery SV Sequence Number Value
ST Space Management Transaction TT Temporary Table
8/10/2019 DSI408 Real Application Clusters Internals
41/525DSI408: Real Application Clusters Internals I-26
Copyright 2003, Oracle. All rights reserved.1-26
False Pinging
Block 100 Block 101
Database
Instance
Block 102 Block 103 Block 104
fg1
updat i ngBCache
dba: 10 dba:103 dba: 105
LE: 23
Global Cache
(iDLM)
Another instance need access to dba:100, the owning instance must ping all the dirty blocksthat are covered by LE
8/10/2019 DSI408 Real Application Clusters Internals
42/525DSI408: Real Application Clusters Internals I-27
Copyright 2003, Oracle. All rights reserved.1-27
Releasable 1:1 lock model (dynamic)
Block 100 Block 101
Database
Instance
Block 102 Block 103 Block 104
fg1
BCache
dba:101 dba:103 dba:105
LE: 100 LE: 105
updat i ng
Global Cache
(iDLM)
break on GC_ELEMENT_NAME
select inst_id,GC_ELEMENT_NAME,CLASS,MODE_HELD
from gv$gc_element where GC_ELEMENT_NAME>20970000
order by GC_ELEMENT_NAME;
INST_ID GC_ELEMENT_NAME CLASS MODE_HELD
---------- --------------- ---------- ----------
1 20971522 0 5
1 20971523 0 5
1 20971913 0 3
1 20971914 0 3
1 20976209 0 3
2 0 3
1 20976210 0 0
2 0 5
--
V SPLIT ==> DBA (Hex) = File#,Block# (Hex
File#,Block#)
8/10/2019 DSI408 Real Application Clusters Internals
43/525DSI408: Real Application Clusters Internals I-28
Copyright 2003, Oracle. All rights reserved.1-28
Scalability
ScaleupScal eup i s t he capabi l i t y t o pr ovi de cont i nued i ncr eases i n
t hr oughput i n the pr esence of l i mi t ed i ncr eases i n pr ocessi ngcapabi l i t y whi l e keepi ng t i me const ant :
Scal eup = ( vol ume par al l el ) / ( vol ume or i gi nal )
SpeedupSpeedup i s t he capabi l i t y t o pr ovi de cont i nued i ncr eases i n speed i n
t he pr esence of l i mi t ed i ncreases i n pr ocessi ng capabi l i t y, whi l ekeepi ng t he t ask const ant :
Speedup = ( t i me or i gi nal ) / ( t i me par al l el )
8/10/2019 DSI408 Real Application Clusters Internals
44/525DSI408: Real Application Clusters Internals I-29
Copyright 2003, Oracle. All rights reserved.1-29
RAC Costs: Global Resource Directory
Single instance: Synchronization of concurrenttasks and access to shared resources
Global Resource Directory (GRD) to recordinformation about how resources are used withina cluster database. The Global Cache Service(GCS) and Global Enqueue Service (GES) managethe information in this directory. Each instance
maintains part of the global resource directory inthe System Global Area (SGA).
RAC Costs: Global Resource Directory
In single-instance environments, locking coordinates access to a common resource, such as
a row in a table. Locking prevents two processes from changing the same resource (or row)
at the same time.
In RAC environments, internode synchronization is critical because it maintains proper
coordination between processes on different nodes, preventing them from changing the
same resource at the same time. Internode synchronization guarantees that each instance
sees the most recent version of a block in its buffer cache.
8/10/2019 DSI408 Real Application Clusters Internals
45/525DSI408: Real Application Clusters Internals I-30
RAC Costs: Global Resource Directory (cont inued)
Resource coordination within Real Application Clusters occurs at both an instance level
and at a cluster database level. Instance level resource coordination within Real
Application Clusters is referred to as local resource coordination. Cluster level
coordination is referred to as global resource coordination.
The processes that manage local resource coordination in a cluster database are identical to
the local resource coordination processes in single instance Oracle. This means that rowand block level access, space management, system change number (SCN) creation, and
data dictionary cache and library cache management are the same in Real Application
Clusters as in single instance Oracle.
If the resource is modified by more than one instance, then RAC performs further
synchronization on a global level to permit shared access to this block across the cluster.
Synchronization in this case requires intranode messaging as well as the preparation of
consistent read versions of the block and the transmission of copies of the block between
memory caches within the cluster database." (See Oracle9i Real Application Clusters
Concepts Release 2 (9.2), Part Number A96597-01, Chapter 5, Real Application ClustersResource Coordination.)
Note: Global Cache Service (GCS) and Global Enqueue Service (GES) do not interfere
with row-level locking and vice versa. Row-level locking is a transaction feature.
8/10/2019 DSI408 Real Application Clusters Internals
46/525DSI408: Real Application Clusters Internals I-31
Copyright 2003, Oracle. All rights reserved.1-31
RAC Costs: Cache Coherency
Cache coherency is the technique of keeping mult iplecopies of an object consistent between dif ferentOracle instances.
RAC Costs: Cache Coherency
Maintaining cache coherency is an important part of a cluster. Cache coherency is the
technique of keeping multiple copies of an object consistent between different Oracle
instances (or disjoint caches) on different nodes.
Global cache management ensures that access to a master copy of a data block in an SGA
is coordinated with the copy of the block in other SGAs.
Therefore, the most recent copy of a block in all SGAs contains all changes that are made
to that block by any instance in the system, regardless of whether those changes have been
committed on the transaction level. Full redo protection of the block changes is maintained.
8/10/2019 DSI408 Real Application Clusters Internals
47/525DSI408: Real Application Clusters Internals I-32
Copyright 2003, Oracle. All rights reserved.1-32
RAC Costs: Cache Coherency
Node 1
Instance A
SGA
GES/GCS
Node 2
Instance B
SGA
GES/GCS
Node 3
Instance C
SGA
GES/GCS
RAC Costs: Cache Coherency (continued)
The cost (or overhead) of cache coherency is the need before any access to a specific
shared resource to first check with the other instances whether this particular access is
permitted. The algorithms optimize the need to coordinate on each and every access, but
some overhead is incurred.
The GCS tracks the locations, modes, and roles of data blocks. The GCS therefore also
manages the access privileges of various instances in relation to resources. Oracle uses the
GCS for cache coherency when the current version of a data block is in one instance's
buffer cache and another instance requests that block for modification. If an instance readsa block in exclusive mode, then in subsequent operations multiple transactions within the
instance can share access to a set of data blocks without using the GCS. This is true,
however, only if the block is not transferred out of the local cache. If the block is
transferred out of the local cache, then the GCS updates the Global Resource Directory
that the resource has a global role; whether the resources mode converts from exclusive to
another mode depends on how other instances use the resource.
8/10/2019 DSI408 Real Application Clusters Internals
48/525
8/10/2019 DSI408 Real Application Clusters Internals
49/525DSI408: Real Application Clusters Internals I-34
RAC Terminology (continued)
Data buffer cache blocks are the most obvious and most heavily used global resource.
There are other data item resources that are global in the cluster, such as transaction
enqueues and database data structures. The data buffer cache blocks are handled by the
Global Cache Service (GCS), and Parallel Cache Management (PCM). The nondata
block resources are handled by Global Enqueue Services (GES), also called Non-Parallel Cache Management (non-PCM).
The Global Resource Manager (GRM) keeps the lock information valid and correct
across the cluster.
From the module skgxn. h: Node: An i ndi vi dual comput er wi t h one or mor e CPUs, some
memor y, and access t o di sk st or age ( gener al l y capabl e ofr unni ng an i nst ance of OPS) .
Cl ust er : A col l ect i on of l oosel y coupl ed nodes t hat
suppor t a par al l el Or acl e dat abase. Cl ust er Member shi p: The set of act i ve nodes i n acl ust er . These ar e t he nodes t hat ar e "al i ve" and haveaccess t o shar ed r esour ces ( t hat i s, shar ed di sk) . Nodest hat ar e not i n t he cur r ent cl ust er member shi p must nothave access t o shared resour ces.
I nst ance: Di st r i but ed ser vi ces t ypi cal l y ar e made up ofsever al i dent i cal component s, one on each node of acl ust er . One of t hese component s wi l l be cal l ed an" i nst ance. " For exampl e, an OPS dat abase wi l l have an
Or acl e i nst ance r unni ng on each node. Pr ocess: For t he pur poses of t hi s i nt er f ace, a pr ocess
i s a uni t of execut i on. On some oper at i ng syst ems, t hi smay be equi val ent t o an OS pr ocess . On ot hers, i t may beequi val ent t o an OS t hr ead. A pr ocess i s consi der edt ermi nat ed when i t can no l onger execut e, pendi ng OSr equest s ar e compl et ed/ cancel ed, and any pr ocess- l ocalr esour ces are r el eased.
Note that the older OPS terms are used in the code, but the terms are also valid for RAC.
8/10/2019 DSI408 Real Application Clusters Internals
50/525DSI408: Real Application Clusters Internals I-35
Copyright 2003, Oracle. All rights reserved.1-35
Terminology Translations
Terminology depends on the speaker Product managers to sales or marketing
Support, technical teams, development
Terminology depends on the version
Older terms tend to stay in code
Variable names and prefixes reflect the older name
Newer names reflect newer application or
functionality
Terminology Translations
RAC = OPS. OPS is the older term. See the History slide (#19) in this lesson.
Row Cache = Dictionary Cache. Row Cache is the older term. It is the SGA area to cache
database dictionary information. It is a global resource.
Distributed Lock Manager (DLM) = Global Resource Manager (GRM). DLM is the older
term; GRM has slightly more functionality. The terms are used for any locking system that
can handle several processes, typically (but not necessarily) on several nodes.
DLM = IDLM = UDLM. The DLM term is a very general term, but also refers to the
external operating systemsupplied DLM used by Oracle7. IDLM refers to the IntegratedDLM introduced in Oracle8. UDLM is the Universal DLM, that is, the reference
implementation of a DLM made on the Solaris platform. It is often called by its codereference skgxn- v2.
Some of the RAC processes have retained their old names but are described with a
different purpose:
LMON: Global Enqueue Service Monitor, previously Lock Monitor
LMD: Global Enqueue Service Daemon, previously Lock Monitor Daemon
LMS: Global Cache Service Processes, previously Lock Manager Services
8/10/2019 DSI408 Real Application Clusters Internals
51/525DSI408: Real Application Clusters Internals I-36
Terminology Translations (continued)
Terminology in This Course
This course reflects the mixed usage of similar terms and aligns more with the terminology
of code than with the externalized names.
8/10/2019 DSI408 Real Application Clusters Internals
52/525DSI408: Real Application Clusters Internals I-37
Copyright 2003, Oracle. All rights reserved.1-37
Programmer Terminology
Client or user: calling code
Callback: routine to execute when the calledprogram has new information
Programmer Terminology
Inside the code, comments often refer to the programmers point of view.
Client and User are used interchangeably, and refer to the calling code.
Client code can register interest in a service by giving a pointer to a data structure that is to
be updated or a routine that is to be called, when the service has completed the required
action.
8/10/2019 DSI408 Real Application Clusters Internals
53/525DSI408: Real Application Clusters Internals I-38
Copyright 2003, Oracle. All rights reserved.1-38
History
Real Application Clusters (RAC) is the currentproduct.
RAC has some simi larity to Oracle Parallel Server(OPS)
Has same end-user capability; a clustered database
Scales better because of better internal handling ofcache coherency
Has some internal, fundamental changes in theglobal cache
History
Oracle Parallel Server (OPS) historically had a bad reputation; it was not scalable. Most
applications ran slower on an OPS system than on a single instance. There was a need to
carefully determine which instance performed DML on which tables or (more accurately)
on which blocks. With RAC this need has been eliminated, resulting in true scalability.
Although RAC borrows much code from OPS, the official policy is not to mention that
RAC is an evolved version of OPS. Oracle does not want the bad reputation of OPS to
adversely affect the reputation of RAC in the market. Internally (in the code), the OPS
heritage in RAC is evident.
8/10/2019 DSI408 Real Application Clusters Internals
54/525DSI408: Real Application Clusters Internals I-39
Copyright 2003, Oracle. All rights reserved.1-39
History Overview
OPS 6 was not in production and was availableonly on limi ted platforms.
OPS 7 was platform generic, relying on externalDLM.
OPS 8 had Integrated Distr ibuted Lock Manager.
OPS 8i had Cache Fusion Stage 1.
RAC 9i has Cache Fusion Stage 2.
The database layout for dif ferent versions has notchanged.
History Overview
Some components have undergone changes in scope and name. The system that ensures
that access to a block is coherent is the Global Cache Manager in Oracle9i. In Oracle8i and
Oracle8, this was the Integrated Distributed Lock Manager. Earlier it was an external
operating systemsupplied service that the Oracle processes called. The Cluster Group
Service of Oracle9i and Oracle8i was the Group Membership Services module in Oracle8
and (before that) part of the external Distributed Lock Manager.
Although there have been many changes to the architecture in the instance, the database
structure has changed only marginally. Separate redo threads and undo spaces are stillused.
8/10/2019 DSI408 Real Application Clusters Internals
55/525DSI408: Real Application Clusters Internals I-40
Copyright 2003, Oracle. All rights reserved.1-40
Internalizing Components
RDBMS
DLM APIDLM,CM&
Op.Sys.
RDBMS
IDLM
CM&
Op.Sys.
Callbacks,enqueues
Local state inSGA memory
No localstate ininstance
Simulatedcallback,enqueue
translation
Oracle7 Oracle8
Internalizing Components
The development of RAC has internalized more operating system components for each
version. As an example, the diagram on the slide shows the internalization of the
Distributed Lock Manager (DLM) in the development of Oracle7 to Oracle8. Instead of
calling the external operating system whenever any lock status needed checking by the
DLM API module, the IDLM module in the Oracle server only needs to examine its SGA.
The RDBMS routines did not in principle need to reflect the change.
The earlier versions had the DLM external, which limited the functionality (lowest
common denominator effect) that the Oracle server could rely on, and the need to passdata to external services. Data transfer used pipes or network communication to the
external processes; control for I/O completion used Asynchronous Trap (AST)
mechanisms, polling mechanisms, or blocked waits. Internal communication inside the
Oracle servereven between the various background processescan use the common
SGA memory area that includes latches and enqueues.
This is merely illustrative and is not an accurate summary of the changes made.
The Oracle8 to Oracle9i development similarly internalized the GMS interface (that is, the
Node Monitor (NM) functionality), relying on only the Cluster Manager (CM) interface
routines.
8/10/2019 DSI408 Real Application Clusters Internals
56/525DSI408: Real Application Clusters Internals I-41
Copyright 2003, Oracle. All rights reserved.1-41
Oracle7
The differences between a non-OPS server and anOPS-enabled Oracle server were few:
Database structure changes
Separate redo per instance
Separate undo per instance
Addition of LCK process in instance
Oracle7
OPS in Oracle7 consisted of the database structural changes for cluster operation (as in all
versions) and the addition of the LCK process that communicated with the external DLM.
The instances not only coordinated global cache coherency through the DLM but also used
the DLM as the communication channel for registering into the OPS cluster.
The method for sending the SCN or other messages was platform specific.
External DLM
The external DLM usage had the following characteristics:
It had to be running before any instance started.
Resources and locks had to be adequately configured.
Death of the DLM on a node implied death of all its clients on the node.
OPS/DLM diagnostics had to have port-specific lock dumps.
Internode parallel query code had to be port specific.
8/10/2019 DSI408 Real Application Clusters Internals
57/525DSI408: Real Application Clusters Internals I-42
Copyright 2003, Oracle. All rights reserved.1-42
Oracle8
First stage in internalizing cluster communications:
Oracles own lock manager in Oracle server
New communication path for clusterwidemessages
New background processes LMD and LMON
Cluster state communication through externalGroup Membership Service (GMS)
Oracle8
The internal DLM meant that resource allocation was inside the Oracle server. Diagnostic
lock dumps no longer needed to be port specific. The Oracle server, version 8 (and later),
started communicating with the cluster services of the operating system. The interface
consisted of the GMS that was an Oracle-specified API. The GMS functionality included:
Supplying each instance with the current set of registered members, clusterwide
Notifying other members when a member joins or leaves
Automatically deregistering dead processes/instances from their groups
Interfacing with the node monitor for cluster events
8/10/2019 DSI408 Real Application Clusters Internals
58/525DSI408: Real Application Clusters Internals I-43
Copyright 2003, Oracle. All rights reserved.1-43
Oracle8i
Cache Fusion Stage 1 Read/write blocks sent via interconnect and not
through the disk
CR server process BSP
More cluster communication functions as part ofOracle server code
GMS functionality spl it into Cluster Group Services
(CGS) and Node Monitor (NM) in the skgxnv2 Lock Manager structures in shared pool
Oracle8i
The Cache Fusion Stage 1 satisfied some types of block requests across the cluster
communication paths (rather than via disk) and made use of the messaging services.
The Oracle8 GMS has been split into OSD and Oracle kernel components. Node monitorOSD skgxn is extended from monitoring a single client per node to arbitrarily namedprocess groups. The rest of the GMS functionality is moved into Oracle as CGS. A
distributed name service is added to CGS.
LMON executes most of the CGS functionality:
Joins the skgxnprocess group representing the instances of the specified group Connects to other members and performs synchronization to ensure that all of them
have the same view of group membership
8/10/2019 DSI408 Real Application Clusters Internals
59/525DSI408: Real Application Clusters Internals I-44
Copyright 2003, Oracle. All rights reserved.1-44
Oracle9i
Cache Fusion Stage 2 Write/write blocks handled concurrently
GCS and GES instead of IDLM
Enhanced instance availablity
Instance Member Reconfiguration (IMR)
New recovery features
Enhanced messaging for inter-instance
communication
Oracle9i
The remainder of this course is based on Oracle9i.
8/10/2019 DSI408 Real Application Clusters Internals
60/525DSI408: Real Application Clusters Internals I-45
Copyright 2003, Oracle. All rights reserved.1-45
Summary
In this lesson, you should have learned how to: Determine whether to use RAC in application
design
Describe RAC improvements over its predecessor
8/10/2019 DSI408 Real Application Clusters Internals
61/525
Copyright 2003, Oracle. All rights reserved.
Introduction to RAC Internals
8/10/2019 DSI408 Real Application Clusters Internals
62/525DSI408: Real Application Clusters Internals I-47
Copyright 2003, Oracle. All rights reserved.2-47
Objectives
After completing this lesson, you should be able to dothe following:
Outline the RAC architecture with internalreferences
Relate the RAC-related modules to the Oraclecode stack
8/10/2019 DSI408 Real Application Clusters Internals
63/525DSI408: Real Application Clusters Internals I-48
Copyright 2003, Oracle. All rights reserved.2-48
Simple RAC Diagram
Node
Instance(SGA,processes)
Node
Instance(SGA,processes)
Node
Instance(SGA,processes)
Clusterdisk/filesystem
High-speed interconnect
Simple RAC Diagram
The node contains more than just the instance. It includes the operating system, network
stacks for various protocols, disk software, and a number of Oracle noninstance processes:
Listener, Intelligent Agent, and the foreground/shadow server processes.
The instance has its usual complement of background processes (more so with the RAC
configuration). They connect to the disk system, the network, and the high-speed
interconnect.
The cluster disk or file system may be mirrored, RAID-based, SAN/Fiber-based, or JBOD
(just a bunch of disks). If it is a clusterwide file system, it can contain the Oracle homecode. The clusterwide disks can be host-managed (that is, the controller is part of the node)
but are serviced to the cluster and equivalent to clusterwide disks. Local disks are of little
interest to RAC but are used for noncommon files where the common disks are raw disks.
Note: There are some issues with node-specific files of the Intelligent Agent or passwordfile orapwwhen using a cluster file system. The solution varies with the platform and theCFS that are used.
8/10/2019 DSI408 Real Application Clusters Internals
64/525DSI408: Real Application Clusters Internals I-49
Copyright 2003, Oracle. All rights reserved.2-49
One RAC Instance
SGA contains (but is not limitedto):
Library, row, and buffer caches
Global Resource Directory
Other background processes are:
LGWR, SMON, and so on
PQ, Jobs, and so on
Dispatchers and servers Foreground processes not shown
Node
Instance
CM
LMON
DIAG
LMDLMS
LCK
DBW0 PMON
SGA
One RAC Instance
This is the traditional view of an instance and its background processes. All processes are,however, the same programor acl e. exe or or acl ejust instantiated with differentstartup parameters (see source opi r i p and WebIV Note:33174.1). On Windows, this ismore apparent; there is clearly only one Oracle process showing in the Task Manager, but
with a number of threads.
All caches in the SGA are either global and must be coherent across all instances, or they
are local. The library, row (also called dictionary), and buffer caches are global. The large
and Java pool buffers are local. For RAC, the Global Resource Directory is global in itselfand also used to control the coherency.
The LMON process communicates with its partner process on the remote nodes. Other
processes may have message exchanges with peer processes on the other nodes (for
example, PQ). The LMS and LMD processes, for example, may directly receive requests
from remote processes.
The Cluster Monitor (CM) system communicates with the other CMs on other nodes and is
not part of the Oracle RAC instance. But it is a necessary component.
8/10/2019 DSI408 Real Application Clusters Internals
65/525DSI408: Real Application Clusters Internals I-50
Copyright 2003, Oracle. All rights reserved.2-50
Internal RAC Instance
kqlm: Library cache (fusion)
kqr: Dictionary/row cache
kcl: Buffer cache
ksi: Instance locks
kjb: Global Cache Service
kju: Global Enqueue
Service
CGS: Cluster Group Services
NM: Node Monitor
IPC: InterprocessCommunication
Node
Instance
CM
NM skgxn.v2
ksi
GCS kjb/GES kju
CGS kjxg
kcl
s
k
g
x
p
I
PC
kqlkqr
kqlm
Internal RAC Instance
This is an internal view of some of the instance code stack and the RAC-relevant sections
and modules.
The NM layer is the communication layer to the CM. The IPC services facilitate other
process to process communication on different instances.
The CGS maintains the state of the RAC-cluster, knowing which instances are in the
cluster and which are not. Contrast this with the node availability.
The GRD is the data structure that stores Global Enqueue and Global Cache objects; it is
aware of every clusterwide resource. Resources are typically a buffer element, like a databuffer, or a data file, but can also be abstract entities, such as an enqueue or NM resource.
The three buffer caches are used by the various user foreground processes by callinghandling routines (kql m, l qr , kcl ) for allocation, deallocation, and locking. Thehandling routines maintain coherency by using kcl . The data buffer cache is the sole user
of the GCS.
Note: Other skg-interfaces, such as skgf r (disk I/O), are not shown.
8/10/2019 DSI408 Real Application Clusters Internals
66/525DSI408: Real Application Clusters Internals I-51
Copyright 2003, Oracle. All rights reserved.2-51
Oracle Code Stack
User Program Interface
Oracle Program Interface
Kernel Execut ion Layer
Kernel Distr ibuted Execution Layer
Network Program Interface
Kernel Security Layer
Kernel Query Layer
Recursive Program Interface
Kernel Access Layer
Kernel Data LayerKernel Transaction Layer
Kernel Cache Layer
UPI
OPIKernel Compilation Layer KK
KX
K2
NPI
KZ
KQ
RPI
KA
KDKT
KC
Kernel Services Layer KSKernel Lock Management Layer KJ
Kernel Generic Layer KG
Operating System Dependencies S
Oracle Call Interface OCI
Oracle Code Stack
The first few characters of the routine and structure names indicate which layer in the code
stack they come from.
8/10/2019 DSI408 Real Application Clusters Internals
67/525DSI408: Real Application Clusters Internals I-52
Copyright 2003, Oracle. All rights reserved.2-52
RAC Component List
This course examines the following RAC componentlist:
Cluster Layer and Cluster Manager (CM)
Node Monitor (NM)
Cluster Group Services (CGS)
Global Cache Service and Global Enqueue Service(GCS and GES)
Interprocess Communication (IPC)
Cache Fusion in the GCS
Cache Fusion Recovery
RAC Component Lis t
This course examines the components listed in the slide. This is the stack, with the most
fundamental module listed first (with some exceptions).
8/10/2019 DSI408 Real Application Clusters Internals
68/525DSI408: Real Application Clusters Internals I-53
Copyright 2003, Oracle. All rights reserved.2-53
Module Relation View
ORACLE
CGS/IMRDLM (GRD) NMIPC
KSXP
SKGXP
SKGXNGCS DRM/FRGES
Module Relation View
GCS: Global Cache Service, or PCM locks
GES: Global Enqueue Service, or non-PCM locks
DRM/FR: Dynamic Resource Mastering/Fast Reconfiguration. Only partially activated in
a standard Oracle9i Release 2 installation.
IMR: Instance Membership Recovery. LMON handles instance death and split brain (two
networks).
KSXP: Multiplexing service (multithreaded layer). Allows DLM to do a lazy send;ksxp informs client after send is completed.
NM:Node Monitor. Instances joining and leaving the cluster
IPC: Interprocess Communication. There is usually a choice of underlying protocols to
use, depending on the platform and hardware. The default is UDP (light; consumes no
resources/connections) memory mapped I/O (enhanced to IPC interface used by cache
fusion) versus port-based communication.
CGS: Cluster Group Service. Handles the sync up the bitmap. Also a name service for
publishing and querying configuration data. CGS in Oracle9i is changed from earlier
versions to speed up the reconfiguration.
8/10/2019 DSI408 Real Application Clusters Internals
69/525DSI408: Real Application Clusters Internals I-54
Copyright 2003, Oracle. All rights reserved.2-54
Alternate Module Relation View
CGS
Clientcode
kcl ksq ksi
DLM
PQ KSXP SKGXP
8/10/2019 DSI408 Real Application Clusters Internals
70/525DSI408: Real Application Clusters Internals I-55
Copyright 2003, Oracle. All rights reserved.2-55
Module, Code Stack, Process
The same code is present in all foreground andbackground processes.
Modules may be constrained to run in a specificprocess.
Module, Code Stack, Process
Although the running Oracle server consists of several processes (both foreground and
background), remember that this is the same program that runs in all processes. Processes
are limited to performing a set of functions, and thus some code is active in only some
processes. Thus there is no LMON program module, but some routines in the KJB source
modules have a comment stating that the function runs only in the LMON process. This is
confusing to remember when one process calls another process when examining code.
Cross process calls require a message or posting, and execution may have to wait until the
called process starts executing; in other words, a context switch must occur.
On the Windows platform, there is only one process. The various Oracle server processes
are implemented as threads inside this program.
8/10/2019 DSI408 Real Application Clusters Internals
71/525DSI408: Real Application Clusters Internals I-56
Copyright 2003, Oracle. All rights reserved.2-56
Operating System Dependencies(OSD)
Code that must be separate for each platform istypically col lected in OSD modules.
Generic version: Runs on development system
Reference version: Classic version ported to allplatforms
Platform version: Optimized and specialized;several versions may exist .
OSD code is bracketed with #ifdef #endif insome modules.
Operating System Dependencies (OSD)
This applies to many other Oracle server products or functions but is much more visible
with RAC.
If the platform dependency is small, it may be bracketed by the #i f def #endi fconstruction; otherwise, a common routine is called in an OSD module, which is
appropriately rewritten for each platform. Such modules are generic. For example, refer tothe skgxnr . c module.
For some OSD modules, there may be more than one version. For example, the IPC
implementation has a number of protocols to be used. One OSD module with the sameinterface is written for each protocol. Only one module is linked to the Oracle server, thus
deciding the IPC protocol to be used.
Where several implementations are possible, a reference module is constructed. This is
runable on all platforms and is the lowest common denominator. It proves functionality
and is used to verify the correct functionality of the other specialized version of the
module. However, it may not be used.
8/10/2019 DSI408 Real Application Clusters Internals
72/525DSI408: Real Application Clusters Internals I-57
Copyright 2003, Oracle. All rights reserved.2-57
Platform-Specif ic RAC
These are kernelroutines, so the namesstart with K.
Service routines startwith KS.
OSD routines start withS or SS.
OSD code is written bythe porting groups.
Cache KC*
GES and GCS KJ*
Generic Layer KG*(common functions)
Platform Specific CodeOSD S*
Operating SystemRoutines
Higher layers
SQL, Transaction, Data
Service KS*
Platform-Specific RAC
Many RAC problems are platform specific. The Operating System Dependency (OSD)
layer therefore must be examined for the platform concerned. The subdirectory is calledsosd or osds .
This cannot be examined in TAO with cscope; you need the vobs access.
OSD code is partially available at/ expor t / home/ ssuppor t / 920/ r dbms/ sr c/ ser ver / osds.
8/10/2019 DSI408 Real Application Clusters Internals
73/525DSI408: Real Application Clusters Internals I-58
Copyright 2003, Oracle. All rights reserved.2-58
OS routines
OSD Module: Example
skgxp.hGeneric interface
skgxp.c
Referenceimplementation
sskgxpu.c
UDP implementation,
port-specific sskgxph.c
HMP implementation,port specific (HP-UX)
SKGXP
UDP
TCP
HMP
2
3
45
2 2
1
SKGXPmodule,3 alternativeversions
OSD Module: Example
A module that needs to call the operating system must be port specific. Calling an I/O
routine may vary in name, arguments, and other particulars between platforms, even
though they give the same functionality.
The skgxp module has an official upward API (1). Internally, there are some commonfunctions and one way of achieving the necessary communication function of the SKGXP.
The UDP option, for example, performs the required OS-related calls through the OS API
(3) that send, receive, check status, and so on, by using UDP packets. It also possibly has
some code to hide or simulate functions so that the common set (2) is maintained. Thefunctions are similar for the other protocol options.
The reference implementation is made to compile and work on all platforms, but the whole
module is additionally rewritten by most platform groups. As explained previously, a
platform group makes several versions by using different protocols. This is selected at link
time by using the appropriate library. The HMP module, shown in this example, is only
available on the HP platform
8/10/2019 DSI408 Real Application Clusters Internals
74/525DSI408: Real Application Clusters Internals I-59
OSD Module: Example (continued)
Dependencies on the OSD Module
For the skgxp module, some OSD variants have additional interfaces callable fromhigher modules. The kcl module, for example, can call for a special memory map pointerfor the HMP protocol. Higher levels in the stack have #i f def #endi fbracketed callsto the extended sskgxph.
8/10/2019 DSI408 Real Application Clusters Internals
75/525DSI408: Real Application Clusters Internals I-60
Copyright 2003, Oracle. All rights reserved.2-60
Summary
In this lesson, you should have learned about the: RAC architecture outl ine with internal references
Relationship between the RAC-related modulesand the Oracle code stack
8/10/2019 DSI408 Real Application Clusters Internals
76/525DSI408: Real Application Clusters Internals I-61
Copyright 2003, Oracle. All rights reserved.2-61
References
Main sources for general RAC information:
RAC Web site
http://rac.us.oracle.com:7778
RAC Pack repository on OFO
http://files.oraclecorp.com/content/AllPublic/Workspaces/RAC%20Pack-Public/
WebIV
Check folder Server.HA.RAC
8/10/2019 DSI408 Real Application Clusters Internals
77/525
Copyright 2003, Oracle. All rights reserved.
Cluster Layer
Cluster Monitor
8/10/2019 DSI408 Real Application Clusters Internals
78/525DSI408: Real Application Clusters Internals I-63
Copyright 2003, Oracle. All rights reserved.3-63
Objectives
After completing this lesson, you should be able to: Describe the generic Cluster Manager (CM)
functionality
Outline the interaction between CM and RACcluster layers
8/10/2019 DSI408 Real Application Clusters Internals
79/525DSI408: Real Application Clusters Internals I-64
Copyright 2003, Oracle. All rights reserved.3-64
RAC and Cluster Software
Node
Instance
CM
NM
ksi/ksq/kcl
GRD
CGS
Other
nodes(not
shown)
IP
C
Caches
Cluster Layer in RAC
The cluster layer is not part of the RAC instance. The Cluster Manager (CM) is part of the
cluster layer.
It has its own communication path with the peer cluster software on other nodes. It can
determine the status of other nodes in the cluster but does not maintain any consistent view.
Most of the synchronization and consistency is handled in the Node Monitor (NM).
8/10/2019 DSI408 Real Application Clusters Internals
80/525DSI408: Real Application Clusters Internals I-65
Copyright 2003, Oracle. All rights reserved.3-65
Generic CM Functionality:Distributed Architecture
Local cluster manager daemons All daemons make up the Cluster Manager
One daemon elected as master node
Generic CM Functionality: Distributed Architecture
Every node in the cluster must have a local CM daemon(s) running. The set of all CM
daemons makes up the Cluster Manager. The CM daemons on all nodes communicate with
one another. The CM daemons on all nodes may elect a master node, which is responsible
for managing cluster state transitions.
Upon communication failure remaining CM daemons form a new cluster using an
established protocol and re-elect a new master if necessary.
The CM and the RAC cluster are distinct entities acting as physically distinct services. The
CM is responsible for cluster consistency. The CM detects and manages cluster statetransitions. The CM co-ordinates RAC cluster recovery brought about by cluster state
transitions.
8/10/2019 DSI408 Real Application Clusters Internals
81/525DSI408: Real Application Clusters Internals I-66
Copyright 2003, Oracle. All rights reserved.3-66
Generic CM Functionality:Cluster State
State change Cluster Incarnation Number
Cluster Membership List
IDLM Membership List
Generic CM Functionality: Cluster State
A cluster is said to change state when one or more nodes join or leave the cluster. This
transition is complete when the cluster moves from a previous stable configuration to a
new one. Each stable configuration is identified by a number called the cluster incarnation
number. Every state change in the cluster monotonically increases the cluster incarnation
number
The set of all nodes in a cluster form a cluster membership list. The set of all nodes in the
cluster where the RAC IDLM is running form anIDLM membership list. Every node in a
cluster is identified by a node-IDprovided by the CM, which remains unchanged duringthe lifetime of a cluster. The IDLM uses this node-ID to identify and distinguish between
members in the IDLM membership list
8/10/2019 DSI408 Real Application Clusters Internals
82/525DSI408: Real Application Clusters Internals I-67
Copyright 2003, Oracle. All rights reserved.3-67
Generic CM Functionality:Node Failure Detection
Node failure detection Communication failure detection
Generic CM Funct ionality: Node Failure Detection
To insure integrity of the cluster, the CM must detect node failures. The RAC cluster may
suspect node failure (for example, a communication failure with a node) in which it may:
Freeze activity and expect a message from the CM to start reconfiguration
Inform the CM of an error condition and await reconfiguration notification after a
new stable cluster state is established
If the CM and RAC cluster are to detect the same communication failures, CM should
monitor cluster health on the same physical circuit used by the RAC cluster (for example,
on HP use of HMP). Performance considerations may require the CM and RAC cluster touse separate virtual circuits.
If the CM and RAC cluster are using separate physical circuits, the CM should be aware of
the RAC clusters physical circuit and monitor for cluster health via the same circuit. The
CM may provide for physical circuit redundancy for failover and performance.
RAC Cluster reconfiguration is begun after a cluster has reached a new stable state.
CM must be able to handle nested state transitions and communicate these state
changes to the RAC cluster.
Nested cluster transitions interrupt any in-process RAC cluster reconfiguration.
8/10/2019 DSI408 Real Application Clusters Internals
83/525DSI408: Real Application Clusters Internals I-68
Copyright 2003, Oracle. All rights reserved.3-68
Cluster Layer and Cluster Manager
RAC cluster registers theinstance in the CM.
Primarily the LMONprocess
Secondarily other I/Ocapable processes (DBWR,PQ-slaves, )
Obtains Node-ID fromcluster
Node
Instance
CM
NM
Cluster Layer and Cluster Manager
The Cluster Manager is a vendor- or Oracle-provided facility to communicate between all
the nodes in the cluster about node state. The CM uses a different protocol or channel. It
uses heartbeat and sanity checks to validate node status. The RAC proces