DSI408 Real Application Clusters Internals

8/10/2019 DSI408 Real Application Clusters Internals

1/525

DSI408: Real Application C

Internals

Electronic Presentation

D16333GC10

Production 1.0

April 2003

D37990


2/525

Copyright 2003, Oracle. All rights reserved.

This documentation contains proprietary information of Oracle Corporation

license agreement containing restrictions on use and disclosure and is als

law. Reverse engineering of the software is prohibited. If this documentati

Government Agency of the Department of Defense, then it is delivered wit

following legend is applicable:

Restricted Rights Legend

Use, duplication or disclosure by the Government is subject to restrictions software and shall be deemed to be Restricted Rights software under Fed

subparagraph (c)(1)(ii) of DFARS 252.227-7013, Rights in Technical Data

(October 1988).

This material or any portion of it may not be copied in any form or by any m

prior written permission of the Education Products group of Oracle Corpor

a violation of copyright law and may result in civil and/or criminal penalties

If this documentation is delivered to a U.S. Government Agency not within

Defense, then it is delivered with Restricted Rights, as defined in FAR 52

General, including Alternate III (June 1987).

The information in this document is subject to change without notice. If yo

documentation, please report them in writing to Worldwide Education Serv500Oracle Parkway, Box SB-6, Redwood Shores, CA 94065. Oracle Corp

that this document is error-free.

Oracle and all references to Oracle Products are trademarks or registered

Corporation.

All other products or company names are used for identification purposes

trademarks of their respective owners.

Authors

Xuan Cong-Bui

John P. McHugh

Michael Mller

Technical Contributors andReviewers

Michael Cebulla

Lex de Haan

Bill Kehoe

Frank Kobylanski

Roderick Manalac

Sundar Matpadi

Sri Subramaniam

Harald van Breederode

Jim Womack

Publisher

Glenn Austin


3/525

DSI408: Real Application

Clusters Internals

Volume 1 - Student Guide

D16333GC10

Edition 1.0

April 2003

37988


4/525


This documentation contains proprietary information of Oracle Corporation. It is

provided under a license agreement containing restrictions on use and disclosure and

is also protected by copyright law. Reverse engineering of the software is prohibited.

If this documentation is delivered to a U.S. Government Agency of the Department of

Defense, then it is delivered with Restricted Rights and the following legend is

applicable:


Use, duplication or disclosure by the Government is subject to restrictions for

commercial computer software and shall be deemed to be Restricted Rights software

under Federal law, as set forth in subparagraph (c)(1)(ii) of DFARS 252.227-7013,

Rights in Technical Data and Computer Software (October 1988).

This material or any portion of it may not be copied in any form or by any means

without the express prior written permission of Oracle Corporation. Any other copying

is a violation of copyright law and may result in civil and/or criminal penalties.

If this documentation is delivered to a U.S. Government Agency not within the

Department of Defense, then it is delivered with Restricted Rights, as defined in

FAR 52.227-14, Rights in Data-General, including Alternate III (June 1987).

The information in this document is subject to change without notice. If you find any

problems in the documentation, please report them in writing to Education Products,

Oracle Corporation, 500 Oracle Parkway, Box SB-6, Redwood Shores, CA 94065.Oracle Corporation does not warrant that this document is error-free.

Oracle and all references to Oracle Products are trademarks or registered trademarks

of Oracle Corporation.

All other products or company names are used for identification purposes only, and

may be trademarks of their respective owners.

Authors

Xuan Cong-Bui

John P. McHugh

Michael Mller

Technical Contributors

and Reviewers

Michael Cebulla

Lex de Haan

Bill Kehoe

Frank KobylanskiRoderick Manalac

Sundar Matpadi

Sri Subramaniam


Jim Womack

Publisher

Glenn Austin


5/525

DSI408: Real Application

Clusters Internals

Volume 2 - Student Guide

D16333GC10

Edition 1.0

April 2003

D37989


6/525


This documentation contains proprietary information of Oracle Corporation. It is

provided under a license agreement containing restrictions on use and disclosure and

is also protected by copyright law. Reverse engineering of the software is prohibited.

If this documentation is delivered to a U.S. Government Agency of the Department of

Defense, then it is delivered with Restricted Rights and the following legend is

applicable:


Use, duplication or disclosure by the Government is subject to restrictions for

commercial computer software and shall be deemed to be Restricted Rights software

under Federal law, as set forth in subparagraph (c)(1)(ii) of DFARS 252.227-7013,

Rights in Technical Data and Computer Software (October 1988).

This material or any portion of it may not be copied in any form or by any means

without the express prior written permission of Oracle Corporation. Any other copying

is a violation of copyright law and may result in civil and/or criminal penalties.

If this documentation is delivered to a U.S. Government Agency not within the

Department of Defense, then it is delivered with Restricted Rights, as defined in

FAR 52.227-14, Rights in Data-General, including Alternate III (June 1987).

The information in this document is subject to change without notice. If you find any

problems in the documentation, please report them in writing to Education Products,

Oracle Corporation, 500 Oracle Parkway, Box SB-6, Redwood Shores, CA 94065.Oracle Corporation does not warrant that this document is error-free.

Oracle and all references to Oracle Products are trademarks or registered trademarks

of Oracle Corporation.

All other products or company names are used for identification purposes only, and

may be trademarks of their respective owners.

Authors

Xuan Cong-Bui

John P. McHugh

Michael Mller

Technical Contributors

and Reviewers

Michael Cebulla

Lex de Haan

Bill Kehoe

Frank KobylanskiRoderick Manalac

Sundar Matpadi

Sri Subramaniam


Jim Womack

Publisher

Glenn Austin


7/525

Preface

I Course Overview DSI 408: RAC Internals

Prerequisites I-2

Course Overview I-3

Practical Exercises I-5

Section I: Introduction

1 Introduction to RAC

Objectives 1-2

Why Use Parallel Processing? 1-3

Scaleup and Speedup 1-5

Scalability Considerations 1-7

RAC Costs: Synchronization 1-9

RAC Costs: Global Resource Directory 1-10

RAC Costs: Cache Coherency 1-12RAC Terminology 1-14

Terminology Translations 1-16

Programmer Terminology 1-18

History 1-19

History Overview 1-20

Internalizing Components 1-21

Oracle7 1-22

Oracle8 1-23

Oracle8i 1-24

Oracle9i 1-25

Summary 1-26

2 Introduction to RAC InternalsObjectives 2-2

Simple RAC Diagram 2-3

One RAC Instance 2-4

Internal RAC Instance 2-5

Oracle Code Stack 2-6

RAC Component List 2-7

Module Relation View 2-8

Alternate Module Relation View 2-9

Module, Code Stack, Process 2-10

Operating System Dependencies (OSD) 2-11Platform-Specific RAC 2-12

OSD Module: Example 2-13

Summary 2-15

References 2-16

Contents

ii i


8/525

Section II: Architecture

3 Cluster Layer: Cluster Monitor

Objectives 3-2

RAC and Cluster Software 3-3

Generic CM Functionality: Distributed Architecture 3-4

Generic CM Functionality: Cluster State 3-5

Generic CM Functionality: Node Failure Detection 3-6

Cluster Layer and Cluster Manager 3-7

Oracle-Supplied CM 3-8

Summary 3-9

4 Cluster Group Services and Node Monitor

Objectives 4-2

RAC and CGS/GMS and NM 4-3

Node Monitor (NM) 4-4RDBMS SKGXN Membership 4-5

NM Groups 4-6

NM Internals 4-7

Node Membership 4-8

Instance Membership Changes 4-10

NM Membership Death 4-12

Starting an Instance: Traditional 4-13

Starting an Instance: Internal 4-14

Stopping an Instance: Traditional 4-15

Stopping an Instance: Internal 4-16

NM Trace and Debug 4-17

Cluster Group Services (CGS) 4-18Configuration Control 4-19

Valid Members 4-20

Membership Validation 4-23

Membership Invalidation 4-24

CGS Reconfiguration Types 4-26

CGS Reconfiguration Protocol 4-27

Reconfiguration Steps 4-28

IMR-Initiated Reconfiguration: Example 4-30

Code References 4-32

Summary 4-33

5 RAC Messaging SystemObjectives 5-2

RAC and Messaging 5-3

Typical Three-Way Lock Messages 5-4

Asynchronous Traps 5-5

AST and BAST 5-6

Message Buffers 5-7

Message Buffer Queues 5-8

iv


9/525

Messaging Deadlocks 5-9

Message Traffic Controller (TRFC) 5-10TRFC Tickets 5-11

TRFC Flow 5-13Message Traffic Statistics 5-15

IPC 5-18

IPC Code Stack 5-19

Reference Implementation 5-20

KSXP Wait Interface to KSL 5-21

KSXP Tracing 5-22

KSXP Trace Records 5-23

SKGXP Interface 5-24

Choosing an SKGXP Implementation 5-25

SKGXP Tracing 5-26

Possible Hang Scenarios 5-27

Other Events for IPC Tracing 5-28Code References 5-29

Summary 5-30

6 System Commit Number

Objectives 6-2

System Commit Number 6-3

Logical Clock and Causality Propagation 6-4

Basics of SCN 6-5

SCN Latching 6-7

Lamport Implementation 6-8

Lamport SCN 6-9

Limitations on SCN Propagation 6-10max_ commi t _ pr opagat i on_ del ay 6-11

Piggybacking SCN in Messages 6-12

Periodic Synchronization 6-13

SCN Generation in Earlier Versions of Oracle 6-14


Summary 6-16

7 Global Resource Directory: Formerly the Distributed Lock Manager

Objectives 7-2

RAC and Global Resource Directory (GRD) 7-3

DLM History 7-4

DLM Concepts: Terminology 7-5

DLM Concepts: Resources 7-6

DLM Concepts: Locks 7-7

DLM Concepts: Processes 7-8

DLM Concepts: Shadow Resources 7-9

DLM Concepts: Copy Locks 7-10

Resource or Lock Mastering 7-11

Basic Resource Structures 7-12

v


10/525

DLM Structures 7-13

Lock Mode Changes 7-16

Simple Lock Changes on a Resource 7-17

Changes on a Resource with Deadlock 7-18DLM Functions 7-19

DLM Functionality in Global Enqueue Service Daemon (LMD0) 7-20

DLM Functionality in Global Enqueue Service Monitor (LMON) 7-22

DLM Functionality in Global Cache Service Process (LMS) 7-23

DLM Functionality in Other Processes 7-24

Configuring GES Resources 7-25

Configuring GES Locks 7-26

Configuring GCS Resources 7-27

Configuring GCS Locks 7-28

Configuring DLM processes 7-29

Logical to Physical Nodes Mapping 7-30

Buckets to Logical Nodes Mapping 7-31

Mapping for a New Node Joining the Cluster 7-32

Remapping When Node Joins 7-34

Mapping Broadcast by Master Node 7-35

Master Node Determination for GES 7-36

Master Node Determination for GCS 7-37

Dump and Trace of Remastering 7-38

DLM Functions 7-39kj ual Connection to DLM 7-40kj ual Flow 7-42kj psod Flow 7-43

DML Enqueue Handling Flow: Example 7-44Step 1: P1 Locks Table in Share Mode 7-45

Step 2: P2 Locks Table in Share Mode 7-46

Step 3: P2 Does Rollback 7-47

Step 4: P1 Locks Table in Exclusive Mode 7-48

Step 5: P3 Locks Table in Share Mode 7-49

Step 6: P1 Does Rollback 7-50

Steps 1 and 2: Code Flow 7-51Step 1: kj usuc Flow Detail 7-52Step 2: kj usuc Flow Detail 7-54

Step 3: Code Flow 7-55Step 3: kj uscl Flow Detail 7-56

Step 4: Code Flow 7-57Step 4: kj uscv Flow Detail 7-58Step 5: kj uscv Flow Detail 7-60Step 6: kj uscl Flow Detail 7-61


Summary 7-64

References and Further Reading 7-65

vi


11/525

8 Cache Coherency (Part One): Enqueues/Non-PCM

Objectives 8-2

Cache Coherency: Enqueues 8-3

Enqueue Types 8-6Enqueue Structure 8-7

Examining Enqueues 8-8

Enqueues and DLM 8-9

Source Tree for Non-PCM Lock Flow 8-10

Lock Modes 8-11

Lock Compatibility 8-12

Deadlock Detection: The Classic Deadlock 8-13

Deadlock Detection: A More General Example 8-15

Deadlock Detection and Resolution 8-16

Timeout-Based Deadlock Detection 8-17

Deadlock Graph Printout 8-18

Deadlock Flow 8-19

Deadlock Flow: One Node 8-21

Deadlock Flow: Two Nodes 8-22

Parallel DML (PDML) Deadlocks 8-23

Deadlock Detection Algorithm 8-24

Deadlock Validation Steps 8-27


Summary 8-29

9 Cache Coherency (Part Two): Blocks/PCM Locks

Objectives 9-2

Cache Coherency: Blocks 9-3

Block Cache Contention 9-4

Earlier Cache Coherency: Oracle8 Ping Protocol 9-5

Earlier Cache Coherency: Oracle8i CR Server 9-6

Earlier Cache Coherency: Oracle8i CR Server 9-7

Oracle9i Cache Fusion Protocol 9-8

GCS (PCM) Locks 9-9

PCM Lock Attributes 9-10

Lock Modes 9-11

Lock Roles 9-12

Past Image 9-13

Local Lock Role 9-14

Global Lock Role 9-15Block Classes 9-16

Lock Elements (LE) 9-17

Allocation of New LE 9-18

Hash Chain of LE 9-19

Block to LE Mapping 9-20

Queues of LE for LMS 9-21

LMSn Free of LE 9-22

Cache Fusion Examples: Overview 9-23

vii


12/525

Cache Fusion: Example 1 9-25



Cache Fusion: Example 4 9-28Cache Fusion: Example 5 9-29







Views 9-36

Parameters 9-39

Summary 9-40

10 Cache Fusion 1: CR ServerObjectives 10-2

Cache Fusion: Consistent Read Blocks 10-3

Consistent Read Review 10-4

Getting a CR Buffer 10-5

Getting a CR Buffer in Oracle9i Release 2 10-7

CR Server in Oracle9i Release 2 10-8

CR Requests 10-9

Light Work Rule 10-11

Fairness 10-12

Statistics 10-13

Wait Events 10-14Fixed Table X$KCLCRST Statistics 10-15

CR Requestor-Side Algorithm 10-16

CR Requestor-Side AST Delivery 10-21

CR Requestor-Side CR Buffer Delivery 10-22

CR Server-Side Algorithm 10-23

Summary 10-27

11 Cache Fusion 2: Current Block: XCUR

Objectives 11-2

Cache Fusion: Current Blocks 11-3

PCM Locks and Resources 11-4

Fusion: Long Example 11-5Initial State 11-7

Step 1: Instance 3 Performs SELECT 11-8

Lock Changes in Instance 3 11-9


Step 2: Instance 2 Performs SELECT 11-11


Step 3: Instance 2 Performs UPDATE 11-13


viii


13/525


Step 4: Instance 1 Performs UPDATE 11-16


Lock Changes in Instance 1 11-18Step 5: Instance 3 Performs SELECT 11-19


Step 6: Instance 1 Performs WRITE 11-21



Tables and Views 11-24

Summary 11-26

12 Cache Fusion Recovery

Objectives 12-2

NonCache Fusion OPS and Database Recovery 12-3

Cache Fusion RAC and Database Recovery 12-4Overview of Fusion Lock States 12-5

Instance or Crash Recovery 12-6

SMON Process 12-7

First-Pass Log Read 12-8

Block Written Record (BWR) 12-9

BWR Dump 12-10

Recovery Set 12-11

Recovery Claim Locks 12-12IDLM Response to Recover yCl ai mLoc k Message on PCM Resource 12-13

No Lock Held by Recovering Instance on the PCM Resource 12-14

Recovery Claim Locks 12-15

Second-Pass Log Read 12-17

Large Recovery Set and Partial IR Lock Mode 12-19

Lock Database Availability During Recovery 12-22

Handling BASTs on Recovery Buffers 12-23

IR of Nonfusion Blocks 12-24

Failures During Instance Recovery 12-26

Memory Contingencies 12-28


Summary 12-31

Section III: Platforms

13 Linux PlatformObjectives 13-2

Linux RAC Architecture 13-3

Storage: Raw Devices 13-4

Extended Storage 13-5

Linux Cluster Software 13-6

OCMS 13-7

OCMS Components 13-8

ix


14/525

WDD, NM, and CM Flow (Up to version 9.2.0.1) 13-9

Watchdog Daemon 13-10

Hangcheck, NM, and CM Flow (After version 9.2.0.2) 13-11

Hangcheck Module 13-12

Node Monitor (NM) 13-13

Cluster Manager 13-14

Linux Port-Specific Code 13-15

Cluster Manager 13-16

skgxpt and skgxpu 13-17

Installing RAC on Linux 13-18

Running RAC on Linux 13-21

Starting CM 13-22

Starting WDD 13-23

Starting NM 13-24Starting CM 13-25

Debugging 13-26

Summary 13-27

References 13-28

14 HP-UX Platform

Objectives 14-2

HP-UX RAC Architecture 14-3

HP-UX Cluster Software 14-4

HP-UX Port-Specific Code 14-5

SKGXP (UDP Implementation) 14-6SKGXP: Lowfat 14-7

Installing RAC on HP-UX 14-8

Running RAC on HP-UX 14-9

Debugging on HP-UX 14-10

Summary 14-11

15 Tru64 Platform

Objectives 15-2

Tru64 RAC Architecture 15-3

Shared Disk Systems 15-4

Tru64 Cluster Software 15-5Tru64 Port-Specific Code 15-6

Node Monitor: SKGXN 15-7

IPC: SKGXP 15-8

SKGXPM: RDG 15-9

Installing RAC on Tru64 15-11

Debugging on Tru64 15-12

x


15/525

Useful Tru64 Commands 15-13

Summary 15-15

16 AIX PlatformObjectives 16-2

AIX RAC Architecture 16-3

AIX SP Clusters 16-4

AIX HACMP Clusters 16-5

AIX Cluster Software 16-6

AIX Cluster Layer 16-7

AIX Port-Specific Code 16-8

RAC on AIX Stack 16-9

Node Monitor (NM) 16-10

Installing RAC on AIX 16-12

Debugging on AIX 16-14

Summary 16-15

References 16-16

17 Other Platforms

Objectives 17-2

RAC Architecture: Solaris 17-3

RAC Architecture: Windows 17-4

RAC Architecture: OpenVMS 17-5

Port-Specific Code 17-6

Installing RAC 17-7

Summary 17-8

Section IV: Debug

18 V$ and X$ Views and Events

Objectives 18-2

V$ and GV$ Views 18-3

List of Views 18-4

Old and New Views 18-5

V$ Views for Lock Information 18-6

X$ Tables 18-7

Events 18-8

19 KST and X$TRACE

Objectives 19-2

KST: X$TRACE 19-3

KST Concepts 19-4

KST Concepts 19-6

Circular Buffer 19-7

xi


16/525

Data Structure k s t r c 19-8

Trace Control Interfaces 19-9

KST Initialization Parameters 19-10

KST Trace Control Interfaces 19-12

KST Fixed Table Views 19-14

KST Trace Output 19-15

KST Current Instrumentation 19-18

KST Performance 19-19

KST: Examples 19-20

KST Sample Trace File 19-24

KST Demonstration 19-25

DIAG Daemon 19-26

DIAG Daemon: Features 19-27

DIAG Daemon: Design 19-29DIAG Daemon: Startup and Shutdown 19-33

DIAG Daemon: Crash Dumping 19-34

Summary 19-36

20 ORADEBUG and Other Debugging Tools

Objectives 20-2

ORADEBUG 20-3

Flash Freeze 20-5

LKDEBUG 20-6

NSDBX 20-7

HANGANALYZE 20-8Summary 20-9

References 20-10

Appendix A: Practices

Appendix B: Solutions

xi i


17/525


Course Overview

DSI 408: RAC Internals


18/525DSI408: Real Application Clusters Internals I-2

I-2 Copyright 2003, Oracle. All rights reserved.I-2

Prerequisites

Before taking this course, you should have: Taken DSI 401, 402, and 403 so that you know

about the server internals on crashes, dumps,transactions, block handling, and recoverysystems

Taken the Real Appl ication Clusters (RAC)administration course so that you know about the

external view of RAC Performed at least one RAC installation and

assisted in at least one RAC debugging case

Prerequisites

The prerequisites ensure that the course is useful to you, instead of being too hard, and that

the instructor need not cover basic material.

You must have your TAO account ready for examining source code.




Course Overview

The course includes the following four sections: Introduction

Architecture

Platforms

Debug

Subjects that are not covered include:

Utilities (srvctl, OCFS, HA)

Performance tuning Pre-Oracle9i versions (OPS)

Course Overview

This course contains four sections. It is scheduled to take four days but does not require

one day per section. Most of the time is spent on the Architecture section.

Introduction

The Introduction section provides a summary of the public RAC architecture and its

accurate terminology. An overview of architecture changes between versions is also given.

Architecture

The Architecture section covers the theory of operation of RAC. The RAC code stack is

examined from the bottom up. There are many references to the source code.

Platforms

The Platforms section covers the differences and architectural details of RAC

implementation on different platforms. Installation issues and known gotchas are

included.



Course Overview (continued)

Debug

The Debug section provides a detailed explanation of the trace and dump mechanisms that

are placed inside RAC for fault location. A number of practical exercises use these

mechanisms.

Subjects not Covered

This course does not cover utility modules that are not part of the primary core RAC

functionality. It also does not cover some of the external programs that RAC depends on.

Performance is not covered as a separate topic. The knowledge from this course should be

sufficient to identify performance bottlenecks that are purely relevant to RAC; otherwise,

tuning is the same as for a single instance.

For versions of Oracle Parallel Server, you should review earlier courses. In earlier courses,

the differences between RAC and OPS are pointed out, whereas the RAC knowledge in

this course is not applicable to OPS.




Practical Exercises

The course includes practical exercises. Exercises run on a shared Solaris cluster.

Practical Exercises

The cluster hardware is shared between students and other classesthis prevents practices

that involve node shutdown, or breaking the interconnect.


22/525


23/525


GES GCS

ES GCS

ES GCS

Buffer Cache

uffer Cache

uffer Cache

CGS

GS

GS

SQL Layer

QL Layer

QL Layer

GES GCS

ES GCS

ES GCS

Buffer Cache

uffer Cache

uffer Cache

CGS

GS

GS

SQL Layer

QL Layer

QL Layer

Node Monitor

ode Monitor

ode Monitor

Node Monitor

ode Monitor

ode Monitor

Cluster Manager

luster Manager

luster Manager

I

I

P

P

C

C

I

I

P

P

C

C

Section IIntroduction


24/525


Introduction to RAC



Copyright 2003, Oracle. All rights reserved.1-10

Objectives

After completing this lesson, you should be able to dothe following:

Review the design objectives of Real ApplicationClusters (RAC)

Relate Oracle9i RAC to its predecessors




Why Use Parallel Processing?

Scaleup: Increased throughput

Speedup: Increased performance or fasterresponse

Higher availability

Support for a greater number of users

Why Use Parallel Processing?

Scaleup: Increased Throughput

Parallel processing breaks a large task into smaller subtasks that can be performed

concurrently. With tasks that grow larger over time, a parallel system that also grows (or

scales up) can maintain a constant time for completing the same task.

Speedup: Increased Performance

For a given task, a parallel system that can scale up improves the response time for

completing the same task.

For decision support system (DSS) applications and parallel queries, parallelprocessing decreases the response time.

For online transaction processing (OLTP) applications, speedup cannot be expected

due to the overhead of synchronization. Depending on the precise circumstances, a

decrease in performance can occur.



Why Use Parallel Processing? (continued)

Higher Availability

Because each node running in the parallel system is isolated from other nodes, a single node

failure or crash should not cause other nodes to fail. Other instances in the parallel server

environment remain up and running.

The operating systems failover capabilities and fault tolerance of the distributed clustersoftware are an important infrastructure component.

Support for a Greater Number of Users

Each node can support several users because each node has its own set of resources, such as

memory, CPU, and so on. As nodes are added to the system, more users can also be added,

allowing the system to continue to scale up.




Scaleup and Speedup

Original system

Hard-ware

100% of taskTime

Cluster system scaleup

Up to200%

oftask

Up to300%oftask

Hard-ware Time

Hard-ware Time

Hard-ware Time

50% of task

Cluster system speedup

Hard-ware Time

Hard-ware Time

50% of task

Scaleup and Speedup

Scaleup

Scaleup is the capability of providing continued increases in throughput in the presence of

limited increases in processing capability while keeping the time constant:

Scaleup = (volumeparallel) / (volumeoriginal) time for interprocess communication

For example, if 30 users consume close to 100% of the CPU during their normal

processing, adding more users would cause the system to slow down due to contention for

limited CPU cycles. By adding CPUs, however, extra users can be supported without

degrading performance.Speedup

Speedup is the capability of providing continued increases in speed in the presence of

limited increases in processing capability while keeping the task constant:

Speedup = (timeoriginal) / (timeparallel) time for interprocess communication

Speedup results in resource availability for other tasks. For example, if queries normally

take 10 minutes to process, and running in parallel reduces the time to 5 minutes, then

additional queries can run without introducing the contention that might occur if they were

to run concurrently.



Scaleup and Speedup (cont inued)

Speedup (continued)

Example 1: A particular application might takeNseconds to fully scan and produce a

summary of a 1 GB table

With scaleup, if the table doubles in size, then doubling hardware resources should allow

the query to still complete inNseconds.

With speedup, if the table does not grow in size, doubling the hardware resources should

allow the query to complete inN/2 seconds.

Example 2: A particular application might have 100 users, each getting a three-second

response on queries.

With scaleup, if the number of users doubles in size, then doubling hardware resources

should allow response time to remain at three seconds.

With speedup, if the number of users remains the same, doubling the hardware resources

should reduce the response time. This occurs only if the three-second activity can be

broken down into two separate activities that can run independently of each other.

A Success Example of Scaleup

The following testimonial is from the internal RAC mailing list. This was a response to

a question about the ease of changing a single instance to an RAC system.

Just yesterday, we tested with a customer a migration from single instance to two-node

RAC on Solaris. They were using Veritas DBE/AC for the cluster system.

These are the steps we took:

1. Node 1 Server running 9i single instance at approx 80% CPU load.

2. Connection through Transparent Application Failover with 40 retries and a delay offive seconds.

3. Alter shared initialization file to set Cluster Database = true and add extra

parameters for the second node (bdump location and so on).

4. Shut down Database on Node 1.

5. Start up Database on Node 2 using new initialization file.

6. Start up Database on Node 1 using new initialization file.

At this point we had 85% of users on Node 1 and 15% on Node 2.

7. Run a script to disconnect sessions on Node 1 to allow them to load balance across

to Node 2.At this point we had 50% of users on Node 1 and 50% on Node 2. The database was no

longer highly loaded and we were able to add more (now load-balanced) users.

The application was written in Java and was TAF-aware (i.e., it knew to retry transactions

with certain warning messages). Once we added the second node, the TPMs per Node

remained approximately the same so we had over 1.9 x improvement in TPMs, which was

pretty good scaling.




Scalabili ty Considerations

Hardware: Disk I/O

Internode communication: High bandwidth andlow latency

Operating system: Number of CPUs (for example,SMP)

Cache Coherency and the Global Cache Service

Database: Design

Application: Design

Scalability Considerations

It is important to remember that if any of these six areas are not scalable (no matter how

scalable the other areas are), parallel cluster processing may not be successful.

Hardware scalability: High bandwidth and low latency offer the maximum scalability.

A high amount of remote I/O may prevent system scalability, because remote I/O is

much slower than local I/O.

Bandwidth of the communication interface is the total size of messages that can be

sent per second. Latency of the communication interface is the time required to place

a message on the interconnect. It indicates the number of messages that can be put on

the interconnect per unit of time.

Operating system: Nodes with multiple CPUs and methods of synchronization in the

OS can determine how well the system scales. Symmetric multiprocessing can

process multiple requests to resources concurrently.



Scalability Considerations (continued) The processes that manage local resource coordination in a cluster database are

identical to the local resource coordination processes in single instance Oracle. This

means that row and block level access, space management, system change number

(SCN) creation, and data dictionary cache and library cache management are the

same in Real Application Clusters as in single instance Oracle. If the resource is

modified by more than one instance, then RAC performs further synchronization on aglobal level to permit shared access to this block across the cluster. Synchronization

in this case requires intranode messaging as well as the preparation of consistent read

versions of the block and the transmission of copies of the block between memory

caches within the cluster database." (See Oracle9i Real Application Clusters

Concepts Release 2 (9.2), Part Number A96597-01, Chapter 5,Real Application

Clusters Resource Coordination.)

Database scalability: Database scalability depends on how well the database is

designed (for example, how the data files are arranged, how well the locks are

allocated, and how well the objects are partitioned).

Scalability of the application: Application design is one of the keys to taking

advantage of the other elements of scalability. Regardless of how well the hardware

and database scale, parallel processing does not work as desired if the application

does not scale.

A typical cause for the lack of scalability is one common shared resource that must be

accessed often. This causes the otherwise parallel operations to serialize on this bottleneck.

A high latency in the synchronization increases the cost of synchronization, counteracting

the benefits of parallelization. This is a general limitation and not a RAC-specific

limitation.




RAC Costs: Synchronization

To scale, there is a cost in synchronization: Scalabili ty = Synchronization

Less synchronization = Speedup and scaleup

Synchronization is necessary to maintain cachecoherency in RAC.

RAC Costs: Synchronization

Synchronization is a necessary part of parallel processing, but for parallel processing to be

advantageous, the cost of synchronization must be determined.

Synchronization provides the coordination of concurrent tasks and is essential for parallel

processing to maintain data integrity or correctness. Proper locking between disjoint SGAs

(Oracle instances) must be maintained to ensure correct data. This is cache coherency.

Partitioning can help reduce synchronization costs because there are fewer

concurrent tasks (that is, fewer concurrent users modifying the same set of data).

An application that modifies a small set of data can cause a high overhead forsynchronization if performed in disjoint SGAs.

Contention occurs between instances using a single block or row, such as a table with

one row that is used to generate sequence numbers.

Two ways to synchronize:

Locks: Latches, enqueues, locks

Messages: Send/wait for messages

Synchronization = Amount Cost Amount: How often do you need to synchronize?

Cost: How expensive is it to synchronize?




Levels of Syncronization

Row-Level (Database) Oracle Row-Locking feature

Maximize concurrency

SCN coerency

Local Cache Level (intra-instance) Every buffer in cache is protected by logical semaphores (spin latches)

Access to buffers is synchronized

Global Cache Fusion (inter-instance DLM) Every buffer in every cache is tracked by GCS

cache coherency / cache consistency

- CACHE BUFFERS CHAINS , CACHE BUFFER HANDLES , CACHEBUFFER HANDLES

- Global Resource Directory managed by Global Cache Services (GCS) .

(Old DLM in pre9i)- cache coherency

The synchronization of data in multiple caches so that reading a memorylocation

by way of any cache will return the most recent data written to that location

by way

of any other cache. Sometimes calledcache consistency.




Levels of Syncronization Row Level

Block 100 Block 101

Database

Instance

Global Cache

(iDLM)

fg2

fg1

Updat e row2

Updat e row1

Enqueues are local locks that serialize access to various resources. Thiswait event indicates a wait for a lock that is held by another session (orsessions) in an incompatible mode to the requested mode. See (about V$LOCK) for details of which lock modes are

compatible with which. Enqueues are usually represented in the format"TYPE-ID1-ID2" where

"TYPE" is a 2 character text string

"ID1" is a 4 byte hexadecimal number

"ID2" is a 4 byte hexadecimal number




Levels of Syncronization Local Cache

Block 100 Block 101

Database

Instance

fg2

fg1

BCacheUpdater ow1

Updater ow2

Global Cache

(iDLM)




Levels of Syncronization Global Cache

Block 100 Block 101

Database

Instance

fg1fg1

Updater ow1

Updat er ow2

BCache BCache

Gl obal Resour ce Di r ector y

Global Cache

(iDLM)

global resources

Inter-instance synchronization mechanisms that provide cache coherency for

Real

Application Clusters. The term can refer to both Global Cache Service (GCS)

resources and Global Enqueue Service (GES) resources.




We need a cache

Block

Database

fg

fg

Block Block

fgfg

fg

Serialize

Serialize Sequencing

operationsguaranteeconsistency of

data

But : Minimize thelevel ofconcurrency of thesystem

And : time tocompletesequence ofoperationsdepends by theslower element :disks

Serialization is the easiest method to manageconcurrency, But, conversely cost in term ofsystem througput

Evolutions of

Oracleminimize theset of tasksthat areserialized

Give a set of Tasks: [T1,T2,Tn] that arrive at the times [t1




Coerency

The systems reach amaximum level of

concurrency

fg1fg1

sel ectr ow1

BCache BCache

Start SC#900

Start SC#1010

scn: 900

sel ectr ow2

Scn: 1010

Block 100

scn: 800

SS

Res: 1, 0x100

Ex: ALTER SYSTEM DUMP DATAFILE 5 BLOCK 4690;

ALTER SYSTEM DUMP DATAFILE {'filename'}|{filenumber}

|---BLOCK MIN {blockno} BLOCK MAX {blockno}|-->

|---BLOCK {blockno}-----------------------|

Note : blockdump report the BC block if block is CURRENT/Dirty in

current instance

alter session set events 'immediate trace name BUFFER level ';




Coerency costs of locks




Fixed*/Releasable 1:M lock model (static)

Block 100 Block 101

Database

Instance

Block 102 Block 103 Block 104

(*)starting 9i

removed fixed-

locking mode

Global Cache

(iDLM)

GC_FILES_TO_LOCKS = 1=100:2=0:3=1000:4-5=0EACH

GC_FILES_TO_LOCKS ={ file_list= lock_count[! blocks][EACH][:...]}

PCM l ock names

type is always BL (because PCM locks are buffer locks)

ID1 is the block class (described in Classes of Blocks)

ID2 For fixed locks,ID2 is the lock element (LE) index number obtained by hashing the block address

(see the GV$LOCK_ELEMENT/ GV$GC_ELEMENT fixed view) For releasable locks,ID2 is the database address ofthe block.

Non PCM l ocks

CF Controlfile Transaction IV Library Cache Invalidation

CI Cross-Instance Call Invocation L[A-P] Library Cache Lock

DF Datafile N[A-Z] Library Cache Pin

DL Direct Loader Index Creation Q[A-Z] Row Cache

DM Database Mount PF Password File

DX Distributed Recovery PR Process Startup

FS File Set PS Parallel Slave Synchronization

KK Redo Log Kick RT Redo Thread

IN Instance Number SC System Commit Number

IR Instance Recovery SM SMON

IS Instance State SN Sequence Number

MM Mount Definition SQ Sequence Number Enqueue

MR Media Recovery SV Sequence Number Value

ST Space Management Transaction TT Temporary Table




False Pinging

Block 100 Block 101

Database

Instance


fg1

updat i ngBCache

dba: 10 dba:103 dba: 105

LE: 23

Global Cache

(iDLM)

Another instance need access to dba:100, the owning instance must ping all the dirty blocksthat are covered by LE




Releasable 1:1 lock model (dynamic)

Block 100 Block 101

Database

Instance


fg1

BCache

dba:101 dba:103 dba:105

LE: 100 LE: 105

updat i ng

Global Cache

(iDLM)

break on GC_ELEMENT_NAME

select inst_id,GC_ELEMENT_NAME,CLASS,MODE_HELD

from gv$gc_element where GC_ELEMENT_NAME>20970000

order by GC_ELEMENT_NAME;

INST_ID GC_ELEMENT_NAME CLASS MODE_HELD

---------- --------------- ---------- ----------

1 20971522 0 5

1 20971523 0 5

1 20971913 0 3

1 20971914 0 3

1 20976209 0 3

2 0 3

1 20976210 0 0

2 0 5

--

V SPLIT ==> DBA (Hex) = File#,Block# (Hex

File#,Block#)




Scalability

ScaleupScal eup i s t he capabi l i t y t o pr ovi de cont i nued i ncr eases i n

t hr oughput i n the pr esence of l i mi t ed i ncr eases i n pr ocessi ngcapabi l i t y whi l e keepi ng t i me const ant :

Scal eup = ( vol ume par al l el ) / ( vol ume or i gi nal )

SpeedupSpeedup i s t he capabi l i t y t o pr ovi de cont i nued i ncr eases i n speed i n

t he pr esence of l i mi t ed i ncreases i n pr ocessi ng capabi l i t y, whi l ekeepi ng t he t ask const ant :

Speedup = ( t i me or i gi nal ) / ( t i me par al l el )




RAC Costs: Global Resource Directory

Single instance: Synchronization of concurrenttasks and access to shared resources

Global Resource Directory (GRD) to recordinformation about how resources are used withina cluster database. The Global Cache Service(GCS) and Global Enqueue Service (GES) managethe information in this directory. Each instance

maintains part of the global resource directory inthe System Global Area (SGA).

RAC Costs: Global Resource Directory

In single-instance environments, locking coordinates access to a common resource, such as

a row in a table. Locking prevents two processes from changing the same resource (or row)

at the same time.

In RAC environments, internode synchronization is critical because it maintains proper

coordination between processes on different nodes, preventing them from changing the

same resource at the same time. Internode synchronization guarantees that each instance

sees the most recent version of a block in its buffer cache.



RAC Costs: Global Resource Directory (cont inued)

Resource coordination within Real Application Clusters occurs at both an instance level

and at a cluster database level. Instance level resource coordination within Real

Application Clusters is referred to as local resource coordination. Cluster level

coordination is referred to as global resource coordination.

The processes that manage local resource coordination in a cluster database are identical to

the local resource coordination processes in single instance Oracle. This means that rowand block level access, space management, system change number (SCN) creation, and

data dictionary cache and library cache management are the same in Real Application

Clusters as in single instance Oracle.

If the resource is modified by more than one instance, then RAC performs further

synchronization on a global level to permit shared access to this block across the cluster.

Synchronization in this case requires intranode messaging as well as the preparation of

consistent read versions of the block and the transmission of copies of the block between

memory caches within the cluster database." (See Oracle9i Real Application Clusters

Concepts Release 2 (9.2), Part Number A96597-01, Chapter 5, Real Application ClustersResource Coordination.)

Note: Global Cache Service (GCS) and Global Enqueue Service (GES) do not interfere

with row-level locking and vice versa. Row-level locking is a transaction feature.




RAC Costs: Cache Coherency

Cache coherency is the technique of keeping mult iplecopies of an object consistent between dif ferentOracle instances.


Maintaining cache coherency is an important part of a cluster. Cache coherency is the

technique of keeping multiple copies of an object consistent between different Oracle

instances (or disjoint caches) on different nodes.

Global cache management ensures that access to a master copy of a data block in an SGA

is coordinated with the copy of the block in other SGAs.

Therefore, the most recent copy of a block in all SGAs contains all changes that are made

to that block by any instance in the system, regardless of whether those changes have been

committed on the transaction level. Full redo protection of the block changes is maintained.





Node 1

Instance A

SGA

GES/GCS

Node 2

Instance B

SGA

GES/GCS

Node 3

Instance C

SGA

GES/GCS

RAC Costs: Cache Coherency (continued)

The cost (or overhead) of cache coherency is the need before any access to a specific

shared resource to first check with the other instances whether this particular access is

permitted. The algorithms optimize the need to coordinate on each and every access, but

some overhead is incurred.

The GCS tracks the locations, modes, and roles of data blocks. The GCS therefore also

manages the access privileges of various instances in relation to resources. Oracle uses the

GCS for cache coherency when the current version of a data block is in one instance's

buffer cache and another instance requests that block for modification. If an instance readsa block in exclusive mode, then in subsequent operations multiple transactions within the

instance can share access to a set of data blocks without using the GCS. This is true,

however, only if the block is not transferred out of the local cache. If the block is

transferred out of the local cache, then the GCS updates the Global Resource Directory

that the resource has a global role; whether the resources mode converts from exclusive to

another mode depends on how other instances use the resource.


48/525



RAC Terminology (continued)

Data buffer cache blocks are the most obvious and most heavily used global resource.

There are other data item resources that are global in the cluster, such as transaction

enqueues and database data structures. The data buffer cache blocks are handled by the

Global Cache Service (GCS), and Parallel Cache Management (PCM). The nondata

block resources are handled by Global Enqueue Services (GES), also called Non-Parallel Cache Management (non-PCM).

The Global Resource Manager (GRM) keeps the lock information valid and correct

across the cluster.

From the module skgxn. h: Node: An i ndi vi dual comput er wi t h one or mor e CPUs, some

memor y, and access t o di sk st or age ( gener al l y capabl e ofr unni ng an i nst ance of OPS) .

Cl ust er : A col l ect i on of l oosel y coupl ed nodes t hat

suppor t a par al l el Or acl e dat abase. Cl ust er Member shi p: The set of act i ve nodes i n acl ust er . These ar e t he nodes t hat ar e "al i ve" and haveaccess t o shar ed r esour ces ( t hat i s, shar ed di sk) . Nodest hat ar e not i n t he cur r ent cl ust er member shi p must nothave access t o shared resour ces.

I nst ance: Di st r i but ed ser vi ces t ypi cal l y ar e made up ofsever al i dent i cal component s, one on each node of acl ust er . One of t hese component s wi l l be cal l ed an" i nst ance. " For exampl e, an OPS dat abase wi l l have an

Or acl e i nst ance r unni ng on each node. Pr ocess: For t he pur poses of t hi s i nt er f ace, a pr ocess

i s a uni t of execut i on. On some oper at i ng syst ems, t hi smay be equi val ent t o an OS pr ocess . On ot hers, i t may beequi val ent t o an OS t hr ead. A pr ocess i s consi der edt ermi nat ed when i t can no l onger execut e, pendi ng OSr equest s ar e compl et ed/ cancel ed, and any pr ocess- l ocalr esour ces are r el eased.

Note that the older OPS terms are used in the code, but the terms are also valid for RAC.




Terminology Translations

Terminology depends on the speaker Product managers to sales or marketing

Support, technical teams, development

Terminology depends on the version

Older terms tend to stay in code

Variable names and prefixes reflect the older name

Newer names reflect newer application or

functionality

Terminology Translations

RAC = OPS. OPS is the older term. See the History slide (#19) in this lesson.

Row Cache = Dictionary Cache. Row Cache is the older term. It is the SGA area to cache

database dictionary information. It is a global resource.

Distributed Lock Manager (DLM) = Global Resource Manager (GRM). DLM is the older

term; GRM has slightly more functionality. The terms are used for any locking system that

can handle several processes, typically (but not necessarily) on several nodes.

DLM = IDLM = UDLM. The DLM term is a very general term, but also refers to the

external operating systemsupplied DLM used by Oracle7. IDLM refers to the IntegratedDLM introduced in Oracle8. UDLM is the Universal DLM, that is, the reference

implementation of a DLM made on the Solaris platform. It is often called by its codereference skgxn- v2.

Some of the RAC processes have retained their old names but are described with a

different purpose:

LMON: Global Enqueue Service Monitor, previously Lock Monitor

LMD: Global Enqueue Service Daemon, previously Lock Monitor Daemon

LMS: Global Cache Service Processes, previously Lock Manager Services



Terminology Translations (continued)

Terminology in This Course

This course reflects the mixed usage of similar terms and aligns more with the terminology

of code than with the externalized names.




Programmer Terminology

Client or user: calling code

Callback: routine to execute when the calledprogram has new information

Programmer Terminology

Inside the code, comments often refer to the programmers point of view.

Client and User are used interchangeably, and refer to the calling code.

Client code can register interest in a service by giving a pointer to a data structure that is to

be updated or a routine that is to be called, when the service has completed the required

action.




History

Real Application Clusters (RAC) is the currentproduct.

RAC has some simi larity to Oracle Parallel Server(OPS)

Has same end-user capability; a clustered database

Scales better because of better internal handling ofcache coherency

Has some internal, fundamental changes in theglobal cache

History

Oracle Parallel Server (OPS) historically had a bad reputation; it was not scalable. Most

applications ran slower on an OPS system than on a single instance. There was a need to

carefully determine which instance performed DML on which tables or (more accurately)

on which blocks. With RAC this need has been eliminated, resulting in true scalability.

Although RAC borrows much code from OPS, the official policy is not to mention that

RAC is an evolved version of OPS. Oracle does not want the bad reputation of OPS to

adversely affect the reputation of RAC in the market. Internally (in the code), the OPS

heritage in RAC is evident.




History Overview

OPS 6 was not in production and was availableonly on limi ted platforms.

OPS 7 was platform generic, relying on externalDLM.

OPS 8 had Integrated Distr ibuted Lock Manager.

OPS 8i had Cache Fusion Stage 1.

RAC 9i has Cache Fusion Stage 2.

The database layout for dif ferent versions has notchanged.

History Overview

Some components have undergone changes in scope and name. The system that ensures

that access to a block is coherent is the Global Cache Manager in Oracle9i. In Oracle8i and

Oracle8, this was the Integrated Distributed Lock Manager. Earlier it was an external

operating systemsupplied service that the Oracle processes called. The Cluster Group

Service of Oracle9i and Oracle8i was the Group Membership Services module in Oracle8

and (before that) part of the external Distributed Lock Manager.

Although there have been many changes to the architecture in the instance, the database

structure has changed only marginally. Separate redo threads and undo spaces are stillused.




Internalizing Components

RDBMS

DLM APIDLM,CM&

Op.Sys.

RDBMS

IDLM

CM&

Op.Sys.

Callbacks,enqueues

Local state inSGA memory

No localstate ininstance

Simulatedcallback,enqueue

translation

Oracle7 Oracle8

Internalizing Components

The development of RAC has internalized more operating system components for each

version. As an example, the diagram on the slide shows the internalization of the

Distributed Lock Manager (DLM) in the development of Oracle7 to Oracle8. Instead of

calling the external operating system whenever any lock status needed checking by the

DLM API module, the IDLM module in the Oracle server only needs to examine its SGA.

The RDBMS routines did not in principle need to reflect the change.

The earlier versions had the DLM external, which limited the functionality (lowest

common denominator effect) that the Oracle server could rely on, and the need to passdata to external services. Data transfer used pipes or network communication to the

external processes; control for I/O completion used Asynchronous Trap (AST)

mechanisms, polling mechanisms, or blocked waits. Internal communication inside the

Oracle servereven between the various background processescan use the common

SGA memory area that includes latches and enqueues.

This is merely illustrative and is not an accurate summary of the changes made.

The Oracle8 to Oracle9i development similarly internalized the GMS interface (that is, the

Node Monitor (NM) functionality), relying on only the Cluster Manager (CM) interface

routines.




Oracle7

The differences between a non-OPS server and anOPS-enabled Oracle server were few:

Database structure changes

Separate redo per instance

Separate undo per instance

Addition of LCK process in instance

Oracle7

OPS in Oracle7 consisted of the database structural changes for cluster operation (as in all

versions) and the addition of the LCK process that communicated with the external DLM.

The instances not only coordinated global cache coherency through the DLM but also used

the DLM as the communication channel for registering into the OPS cluster.

The method for sending the SCN or other messages was platform specific.

External DLM

The external DLM usage had the following characteristics:

It had to be running before any instance started.

Resources and locks had to be adequately configured.

Death of the DLM on a node implied death of all its clients on the node.

OPS/DLM diagnostics had to have port-specific lock dumps.

Internode parallel query code had to be port specific.




Oracle8

First stage in internalizing cluster communications:

Oracles own lock manager in Oracle server

New communication path for clusterwidemessages

New background processes LMD and LMON

Cluster state communication through externalGroup Membership Service (GMS)

Oracle8

The internal DLM meant that resource allocation was inside the Oracle server. Diagnostic

lock dumps no longer needed to be port specific. The Oracle server, version 8 (and later),

started communicating with the cluster services of the operating system. The interface

consisted of the GMS that was an Oracle-specified API. The GMS functionality included:

Supplying each instance with the current set of registered members, clusterwide

Notifying other members when a member joins or leaves

Automatically deregistering dead processes/instances from their groups

Interfacing with the node monitor for cluster events




Oracle8i

Cache Fusion Stage 1 Read/write blocks sent via interconnect and not

through the disk

CR server process BSP

More cluster communication functions as part ofOracle server code

GMS functionality spl it into Cluster Group Services

(CGS) and Node Monitor (NM) in the skgxnv2 Lock Manager structures in shared pool

Oracle8i

The Cache Fusion Stage 1 satisfied some types of block requests across the cluster

communication paths (rather than via disk) and made use of the messaging services.

The Oracle8 GMS has been split into OSD and Oracle kernel components. Node monitorOSD skgxn is extended from monitoring a single client per node to arbitrarily namedprocess groups. The rest of the GMS functionality is moved into Oracle as CGS. A

distributed name service is added to CGS.

LMON executes most of the CGS functionality:

Joins the skgxnprocess group representing the instances of the specified group Connects to other members and performs synchronization to ensure that all of them

have the same view of group membership




Oracle9i

Cache Fusion Stage 2 Write/write blocks handled concurrently

GCS and GES instead of IDLM

Enhanced instance availablity

Instance Member Reconfiguration (IMR)

New recovery features

Enhanced messaging for inter-instance

communication

Oracle9i

The remainder of this course is based on Oracle9i.




Summary

In this lesson, you should have learned how to: Determine whether to use RAC in application

design

Describe RAC improvements over its predecessor


61/525


Introduction to RAC Internals




Objectives

After completing this lesson, you should be able to dothe following:

Outline the RAC architecture with internalreferences

Relate the RAC-related modules to the Oraclecode stack




Simple RAC Diagram

Node

Instance(SGA,processes)

Node


Node


Clusterdisk/filesystem

High-speed interconnect

Simple RAC Diagram

The node contains more than just the instance. It includes the operating system, network

stacks for various protocols, disk software, and a number of Oracle noninstance processes:

Listener, Intelligent Agent, and the foreground/shadow server processes.

The instance has its usual complement of background processes (more so with the RAC

configuration). They connect to the disk system, the network, and the high-speed

interconnect.

The cluster disk or file system may be mirrored, RAID-based, SAN/Fiber-based, or JBOD

(just a bunch of disks). If it is a clusterwide file system, it can contain the Oracle homecode. The clusterwide disks can be host-managed (that is, the controller is part of the node)

but are serviced to the cluster and equivalent to clusterwide disks. Local disks are of little

interest to RAC but are used for noncommon files where the common disks are raw disks.

Note: There are some issues with node-specific files of the Intelligent Agent or passwordfile orapwwhen using a cluster file system. The solution varies with the platform and theCFS that are used.




One RAC Instance

SGA contains (but is not limitedto):

Library, row, and buffer caches

Global Resource Directory

Other background processes are:

LGWR, SMON, and so on

PQ, Jobs, and so on

Dispatchers and servers Foreground processes not shown

Node

Instance

CM

LMON

DIAG

LMDLMS

LCK

DBW0 PMON

SGA

One RAC Instance

This is the traditional view of an instance and its background processes. All processes are,however, the same programor acl e. exe or or acl ejust instantiated with differentstartup parameters (see source opi r i p and WebIV Note:33174.1). On Windows, this ismore apparent; there is clearly only one Oracle process showing in the Task Manager, but

with a number of threads.

All caches in the SGA are either global and must be coherent across all instances, or they

are local. The library, row (also called dictionary), and buffer caches are global. The large

and Java pool buffers are local. For RAC, the Global Resource Directory is global in itselfand also used to control the coherency.

The LMON process communicates with its partner process on the remote nodes. Other

processes may have message exchanges with peer processes on the other nodes (for

example, PQ). The LMS and LMD processes, for example, may directly receive requests

from remote processes.

The Cluster Monitor (CM) system communicates with the other CMs on other nodes and is

not part of the Oracle RAC instance. But it is a necessary component.




Internal RAC Instance

kqlm: Library cache (fusion)

kqr: Dictionary/row cache

kcl: Buffer cache

ksi: Instance locks

kjb: Global Cache Service

kju: Global Enqueue

Service

CGS: Cluster Group Services

NM: Node Monitor

IPC: InterprocessCommunication

Node

Instance

CM

NM skgxn.v2

ksi

GCS kjb/GES kju

CGS kjxg

kcl

s

k

g

x

p

I

PC

kqlkqr

kqlm

Internal RAC Instance

This is an internal view of some of the instance code stack and the RAC-relevant sections

and modules.

The NM layer is the communication layer to the CM. The IPC services facilitate other

process to process communication on different instances.

The CGS maintains the state of the RAC-cluster, knowing which instances are in the

cluster and which are not. Contrast this with the node availability.

The GRD is the data structure that stores Global Enqueue and Global Cache objects; it is

aware of every clusterwide resource. Resources are typically a buffer element, like a databuffer, or a data file, but can also be abstract entities, such as an enqueue or NM resource.

The three buffer caches are used by the various user foreground processes by callinghandling routines (kql m, l qr , kcl ) for allocation, deallocation, and locking. Thehandling routines maintain coherency by using kcl . The data buffer cache is the sole user

of the GCS.

Note: Other skg-interfaces, such as skgf r (disk I/O), are not shown.




Oracle Code Stack

User Program Interface

Oracle Program Interface

Kernel Execut ion Layer

Kernel Distr ibuted Execution Layer

Network Program Interface

Kernel Security Layer

Kernel Query Layer

Recursive Program Interface

Kernel Access Layer

Kernel Data LayerKernel Transaction Layer

Kernel Cache Layer

UPI

OPIKernel Compilation Layer KK

KX

K2

NPI

KZ

KQ

RPI

KA

KDKT

KC

Kernel Services Layer KSKernel Lock Management Layer KJ

Kernel Generic Layer KG

Operating System Dependencies S

Oracle Call Interface OCI

Oracle Code Stack

The first few characters of the routine and structure names indicate which layer in the code

stack they come from.




RAC Component List

This course examines the following RAC componentlist:

Cluster Layer and Cluster Manager (CM)

Node Monitor (NM)

Cluster Group Services (CGS)

Global Cache Service and Global Enqueue Service(GCS and GES)

Interprocess Communication (IPC)

Cache Fusion in the GCS

Cache Fusion Recovery

RAC Component Lis t

This course examines the components listed in the slide. This is the stack, with the most

fundamental module listed first (with some exceptions).




Module Relation View

ORACLE

CGS/IMRDLM (GRD) NMIPC

KSXP

SKGXP

SKGXNGCS DRM/FRGES

Module Relation View

GCS: Global Cache Service, or PCM locks

GES: Global Enqueue Service, or non-PCM locks

DRM/FR: Dynamic Resource Mastering/Fast Reconfiguration. Only partially activated in

a standard Oracle9i Release 2 installation.

IMR: Instance Membership Recovery. LMON handles instance death and split brain (two

networks).

KSXP: Multiplexing service (multithreaded layer). Allows DLM to do a lazy send;ksxp informs client after send is completed.

NM:Node Monitor. Instances joining and leaving the cluster

IPC: Interprocess Communication. There is usually a choice of underlying protocols to

use, depending on the platform and hardware. The default is UDP (light; consumes no

resources/connections) memory mapped I/O (enhanced to IPC interface used by cache

fusion) versus port-based communication.

CGS: Cluster Group Service. Handles the sync up the bitmap. Also a name service for

publishing and querying configuration data. CGS in Oracle9i is changed from earlier

versions to speed up the reconfiguration.




Alternate Module Relation View

CGS

Clientcode

kcl ksq ksi

DLM

PQ KSXP SKGXP




Module, Code Stack, Process

The same code is present in all foreground andbackground processes.

Modules may be constrained to run in a specificprocess.

Module, Code Stack, Process

Although the running Oracle server consists of several processes (both foreground and

background), remember that this is the same program that runs in all processes. Processes

are limited to performing a set of functions, and thus some code is active in only some

processes. Thus there is no LMON program module, but some routines in the KJB source

modules have a comment stating that the function runs only in the LMON process. This is

confusing to remember when one process calls another process when examining code.

Cross process calls require a message or posting, and execution may have to wait until the

called process starts executing; in other words, a context switch must occur.

On the Windows platform, there is only one process. The various Oracle server processes

are implemented as threads inside this program.




Operating System Dependencies(OSD)

Code that must be separate for each platform istypically col lected in OSD modules.

Generic version: Runs on development system

Reference version: Classic version ported to allplatforms

Platform version: Optimized and specialized;several versions may exist .

OSD code is bracketed with #ifdef #endif insome modules.

Operating System Dependencies (OSD)

This applies to many other Oracle server products or functions but is much more visible

with RAC.

If the platform dependency is small, it may be bracketed by the #i f def #endi fconstruction; otherwise, a common routine is called in an OSD module, which is

appropriately rewritten for each platform. Such modules are generic. For example, refer tothe skgxnr . c module.

For some OSD modules, there may be more than one version. For example, the IPC

implementation has a number of protocols to be used. One OSD module with the sameinterface is written for each protocol. Only one module is linked to the Oracle server, thus

deciding the IPC protocol to be used.

Where several implementations are possible, a reference module is constructed. This is

runable on all platforms and is the lowest common denominator. It proves functionality

and is used to verify the correct functionality of the other specialized version of the

module. However, it may not be used.




Platform-Specif ic RAC

These are kernelroutines, so the namesstart with K.

Service routines startwith KS.

OSD routines start withS or SS.

OSD code is written bythe porting groups.

Cache KC*

GES and GCS KJ*

Generic Layer KG*(common functions)

Platform Specific CodeOSD S*

Operating SystemRoutines

Higher layers

SQL, Transaction, Data

Service KS*

Platform-Specific RAC

Many RAC problems are platform specific. The Operating System Dependency (OSD)

layer therefore must be examined for the platform concerned. The subdirectory is calledsosd or osds .

This cannot be examined in TAO with cscope; you need the vobs access.

OSD code is partially available at/ expor t / home/ ssuppor t / 920/ r dbms/ sr c/ ser ver / osds.




OS routines

OSD Module: Example

skgxp.hGeneric interface

skgxp.c

Referenceimplementation

sskgxpu.c

UDP implementation,

port-specific sskgxph.c

HMP implementation,port specific (HP-UX)

SKGXP

UDP

TCP

HMP

2

3

45

2 2

1

SKGXPmodule,3 alternativeversions

OSD Module: Example

A module that needs to call the operating system must be port specific. Calling an I/O

routine may vary in name, arguments, and other particulars between platforms, even

though they give the same functionality.

The skgxp module has an official upward API (1). Internally, there are some commonfunctions and one way of achieving the necessary communication function of the SKGXP.

The UDP option, for example, performs the required OS-related calls through the OS API

(3) that send, receive, check status, and so on, by using UDP packets. It also possibly has

some code to hide or simulate functions so that the common set (2) is maintained. Thefunctions are similar for the other protocol options.

The reference implementation is made to compile and work on all platforms, but the whole

module is additionally rewritten by most platform groups. As explained previously, a

platform group makes several versions by using different protocols. This is selected at link

time by using the appropriate library. The HMP module, shown in this example, is only

available on the HP platform



OSD Module: Example (continued)

Dependencies on the OSD Module

For the skgxp module, some OSD variants have additional interfaces callable fromhigher modules. The kcl module, for example, can call for a special memory map pointerfor the HMP protocol. Higher levels in the stack have #i f def #endi fbracketed callsto the extended sskgxph.




Summary

In this lesson, you should have learned about the: RAC architecture outl ine with internal references

Relationship between the RAC-related modulesand the Oracle code stack




References

Main sources for general RAC information:

RAC Web site

http://rac.us.oracle.com:7778

RAC Pack repository on OFO

http://files.oraclecorp.com/content/AllPublic/Workspaces/RAC%20Pack-Public/

WebIV

Check folder Server.HA.RAC


77/525


Cluster Layer

Cluster Monitor




Objectives

After completing this lesson, you should be able to: Describe the generic Cluster Manager (CM)

functionality

Outline the interaction between CM and RACcluster layers




RAC and Cluster Software

Node

Instance

CM

NM

ksi/ksq/kcl

GRD

CGS

Other

nodes(not

shown)

IP

C

Caches

Cluster Layer in RAC

The cluster layer is not part of the RAC instance. The Cluster Manager (CM) is part of the

cluster layer.

It has its own communication path with the peer cluster software on other nodes. It can

determine the status of other nodes in the cluster but does not maintain any consistent view.

Most of the synchronization and consistency is handled in the Node Monitor (NM).




Generic CM Functionality:Distributed Architecture

Local cluster manager daemons All daemons make up the Cluster Manager

One daemon elected as master node

Generic CM Functionality: Distributed Architecture

Every node in the cluster must have a local CM daemon(s) running. The set of all CM

daemons makes up the Cluster Manager. The CM daemons on all nodes communicate with

one another. The CM daemons on all nodes may elect a master node, which is responsible

for managing cluster state transitions.

Upon communication failure remaining CM daemons form a new cluster using an

established protocol and re-elect a new master if necessary.

The CM and the RAC cluster are distinct entities acting as physically distinct services. The

CM is responsible for cluster consistency. The CM detects and manages cluster statetransitions. The CM co-ordinates RAC cluster recovery brought about by cluster state

transitions.




Generic CM Functionality:Cluster State

State change Cluster Incarnation Number

Cluster Membership List

IDLM Membership List

Generic CM Functionality: Cluster State

A cluster is said to change state when one or more nodes join or leave the cluster. This

transition is complete when the cluster moves from a previous stable configuration to a

new one. Each stable configuration is identified by a number called the cluster incarnation

number. Every state change in the cluster monotonically increases the cluster incarnation

number

The set of all nodes in a cluster form a cluster membership list. The set of all nodes in the

cluster where the RAC IDLM is running form anIDLM membership list. Every node in a

cluster is identified by a node-IDprovided by the CM, which remains unchanged duringthe lifetime of a cluster. The IDLM uses this node-ID to identify and distinguish between

members in the IDLM membership list




Generic CM Functionality:Node Failure Detection

Node failure detection Communication failure detection

Generic CM Funct ionality: Node Failure Detection

To insure integrity of the cluster, the CM must detect node failures. The RAC cluster may

suspect node failure (for example, a communication failure with a node) in which it may:

Freeze activity and expect a message from the CM to start reconfiguration

Inform the CM of an error condition and await reconfiguration notification after a

new stable cluster state is established

If the CM and RAC cluster are to detect the same communication failures, CM should

monitor cluster health on the same physical circuit used by the RAC cluster (for example,

on HP use of HMP). Performance considerations may require the CM and RAC cluster touse separate virtual circuits.

If the CM and RAC cluster are using separate physical circuits, the CM should be aware of

the RAC clusters physical circuit and monitor for cluster health via the same circuit. The

CM may provide for physical circuit redundancy for failover and performance.

RAC Cluster reconfiguration is begun after a cluster has reached a new stable state.

CM must be able to handle nested state transitions and communicate these state

changes to the RAC cluster.

Nested cluster transitions interrupt any in-process RAC cluster reconfiguration.




Cluster Layer and Cluster Manager

RAC cluster registers theinstance in the CM.

Primarily the LMONprocess

Secondarily other I/Ocapable processes (DBWR,PQ-slaves, )

Obtains Node-ID fromcluster

Node

Instance

CM

NM

Cluster Layer and Cluster Manager

The Cluster Manager is a vendor- or Oracle-provided facility to communicate between all

the nodes in the cluster about node state. The CM uses a different protocol or channel. It

uses heartbeat and sanity checks to validate node status. The RAC proces

Documents

DSI408 Real Application Clusters Internals