DSI408 Real Application Clusters Internals

Embed Size (px)

Citation preview

  • 8/10/2019 DSI408 Real Application Clusters Internals

    1/525

    DSI408: Real Application C

    Internals

    Electronic Presentation

    D16333GC10

    Production 1.0

    April 2003

    D37990

  • 8/10/2019 DSI408 Real Application Clusters Internals

    2/525

    Copyright 2003, Oracle. All rights reserved.

    This documentation contains proprietary information of Oracle Corporation

    license agreement containing restrictions on use and disclosure and is als

    law. Reverse engineering of the software is prohibited. If this documentati

    Government Agency of the Department of Defense, then it is delivered wit

    following legend is applicable:

    Restricted Rights Legend

    Use, duplication or disclosure by the Government is subject to restrictions software and shall be deemed to be Restricted Rights software under Fed

    subparagraph (c)(1)(ii) of DFARS 252.227-7013, Rights in Technical Data

    (October 1988).

    This material or any portion of it may not be copied in any form or by any m

    prior written permission of the Education Products group of Oracle Corpor

    a violation of copyright law and may result in civil and/or criminal penalties

    If this documentation is delivered to a U.S. Government Agency not within

    Defense, then it is delivered with Restricted Rights, as defined in FAR 52

    General, including Alternate III (June 1987).

    The information in this document is subject to change without notice. If yo

    documentation, please report them in writing to Worldwide Education Serv500Oracle Parkway, Box SB-6, Redwood Shores, CA 94065. Oracle Corp

    that this document is error-free.

    Oracle and all references to Oracle Products are trademarks or registered

    Corporation.

    All other products or company names are used for identification purposes

    trademarks of their respective owners.

    Authors

    Xuan Cong-Bui

    John P. McHugh

    Michael Mller

    Technical Contributors andReviewers

    Michael Cebulla

    Lex de Haan

    Bill Kehoe

    Frank Kobylanski

    Roderick Manalac

    Sundar Matpadi

    Sri Subramaniam

    Harald van Breederode

    Jim Womack

    Publisher

    Glenn Austin

  • 8/10/2019 DSI408 Real Application Clusters Internals

    3/525

    DSI408: Real Application

    Clusters Internals

    Volume 1 - Student Guide

    D16333GC10

    Edition 1.0

    April 2003

    37988

  • 8/10/2019 DSI408 Real Application Clusters Internals

    4/525

    Copyright 2003, Oracle. All rights reserved.

    This documentation contains proprietary information of Oracle Corporation. It is

    provided under a license agreement containing restrictions on use and disclosure and

    is also protected by copyright law. Reverse engineering of the software is prohibited.

    If this documentation is delivered to a U.S. Government Agency of the Department of

    Defense, then it is delivered with Restricted Rights and the following legend is

    applicable:

    Restricted Rights Legend

    Use, duplication or disclosure by the Government is subject to restrictions for

    commercial computer software and shall be deemed to be Restricted Rights software

    under Federal law, as set forth in subparagraph (c)(1)(ii) of DFARS 252.227-7013,

    Rights in Technical Data and Computer Software (October 1988).

    This material or any portion of it may not be copied in any form or by any means

    without the express prior written permission of Oracle Corporation. Any other copying

    is a violation of copyright law and may result in civil and/or criminal penalties.

    If this documentation is delivered to a U.S. Government Agency not within the

    Department of Defense, then it is delivered with Restricted Rights, as defined in

    FAR 52.227-14, Rights in Data-General, including Alternate III (June 1987).

    The information in this document is subject to change without notice. If you find any

    problems in the documentation, please report them in writing to Education Products,

    Oracle Corporation, 500 Oracle Parkway, Box SB-6, Redwood Shores, CA 94065.Oracle Corporation does not warrant that this document is error-free.

    Oracle and all references to Oracle Products are trademarks or registered trademarks

    of Oracle Corporation.

    All other products or company names are used for identification purposes only, and

    may be trademarks of their respective owners.

    Authors

    Xuan Cong-Bui

    John P. McHugh

    Michael Mller

    Technical Contributors

    and Reviewers

    Michael Cebulla

    Lex de Haan

    Bill Kehoe

    Frank KobylanskiRoderick Manalac

    Sundar Matpadi

    Sri Subramaniam

    Harald van Breederode

    Jim Womack

    Publisher

    Glenn Austin

  • 8/10/2019 DSI408 Real Application Clusters Internals

    5/525

    DSI408: Real Application

    Clusters Internals

    Volume 2 - Student Guide

    D16333GC10

    Edition 1.0

    April 2003

    D37989

  • 8/10/2019 DSI408 Real Application Clusters Internals

    6/525

    Copyright 2003, Oracle. All rights reserved.

    This documentation contains proprietary information of Oracle Corporation. It is

    provided under a license agreement containing restrictions on use and disclosure and

    is also protected by copyright law. Reverse engineering of the software is prohibited.

    If this documentation is delivered to a U.S. Government Agency of the Department of

    Defense, then it is delivered with Restricted Rights and the following legend is

    applicable:

    Restricted Rights Legend

    Use, duplication or disclosure by the Government is subject to restrictions for

    commercial computer software and shall be deemed to be Restricted Rights software

    under Federal law, as set forth in subparagraph (c)(1)(ii) of DFARS 252.227-7013,

    Rights in Technical Data and Computer Software (October 1988).

    This material or any portion of it may not be copied in any form or by any means

    without the express prior written permission of Oracle Corporation. Any other copying

    is a violation of copyright law and may result in civil and/or criminal penalties.

    If this documentation is delivered to a U.S. Government Agency not within the

    Department of Defense, then it is delivered with Restricted Rights, as defined in

    FAR 52.227-14, Rights in Data-General, including Alternate III (June 1987).

    The information in this document is subject to change without notice. If you find any

    problems in the documentation, please report them in writing to Education Products,

    Oracle Corporation, 500 Oracle Parkway, Box SB-6, Redwood Shores, CA 94065.Oracle Corporation does not warrant that this document is error-free.

    Oracle and all references to Oracle Products are trademarks or registered trademarks

    of Oracle Corporation.

    All other products or company names are used for identification purposes only, and

    may be trademarks of their respective owners.

    Authors

    Xuan Cong-Bui

    John P. McHugh

    Michael Mller

    Technical Contributors

    and Reviewers

    Michael Cebulla

    Lex de Haan

    Bill Kehoe

    Frank KobylanskiRoderick Manalac

    Sundar Matpadi

    Sri Subramaniam

    Harald van Breederode

    Jim Womack

    Publisher

    Glenn Austin

  • 8/10/2019 DSI408 Real Application Clusters Internals

    7/525

    Preface

    I Course Overview DSI 408: RAC Internals

    Prerequisites I-2

    Course Overview I-3

    Practical Exercises I-5

    Section I: Introduction

    1 Introduction to RAC

    Objectives 1-2

    Why Use Parallel Processing? 1-3

    Scaleup and Speedup 1-5

    Scalability Considerations 1-7

    RAC Costs: Synchronization 1-9

    RAC Costs: Global Resource Directory 1-10

    RAC Costs: Cache Coherency 1-12RAC Terminology 1-14

    Terminology Translations 1-16

    Programmer Terminology 1-18

    History 1-19

    History Overview 1-20

    Internalizing Components 1-21

    Oracle7 1-22

    Oracle8 1-23

    Oracle8i 1-24

    Oracle9i 1-25

    Summary 1-26

    2 Introduction to RAC InternalsObjectives 2-2

    Simple RAC Diagram 2-3

    One RAC Instance 2-4

    Internal RAC Instance 2-5

    Oracle Code Stack 2-6

    RAC Component List 2-7

    Module Relation View 2-8

    Alternate Module Relation View 2-9

    Module, Code Stack, Process 2-10

    Operating System Dependencies (OSD) 2-11Platform-Specific RAC 2-12

    OSD Module: Example 2-13

    Summary 2-15

    References 2-16

    Contents

    ii i

  • 8/10/2019 DSI408 Real Application Clusters Internals

    8/525

    Section II: Architecture

    3 Cluster Layer: Cluster Monitor

    Objectives 3-2

    RAC and Cluster Software 3-3

    Generic CM Functionality: Distributed Architecture 3-4

    Generic CM Functionality: Cluster State 3-5

    Generic CM Functionality: Node Failure Detection 3-6

    Cluster Layer and Cluster Manager 3-7

    Oracle-Supplied CM 3-8

    Summary 3-9

    4 Cluster Group Services and Node Monitor

    Objectives 4-2

    RAC and CGS/GMS and NM 4-3

    Node Monitor (NM) 4-4RDBMS SKGXN Membership 4-5

    NM Groups 4-6

    NM Internals 4-7

    Node Membership 4-8

    Instance Membership Changes 4-10

    NM Membership Death 4-12

    Starting an Instance: Traditional 4-13

    Starting an Instance: Internal 4-14

    Stopping an Instance: Traditional 4-15

    Stopping an Instance: Internal 4-16

    NM Trace and Debug 4-17

    Cluster Group Services (CGS) 4-18Configuration Control 4-19

    Valid Members 4-20

    Membership Validation 4-23

    Membership Invalidation 4-24

    CGS Reconfiguration Types 4-26

    CGS Reconfiguration Protocol 4-27

    Reconfiguration Steps 4-28

    IMR-Initiated Reconfiguration: Example 4-30

    Code References 4-32

    Summary 4-33

    5 RAC Messaging SystemObjectives 5-2

    RAC and Messaging 5-3

    Typical Three-Way Lock Messages 5-4

    Asynchronous Traps 5-5

    AST and BAST 5-6

    Message Buffers 5-7

    Message Buffer Queues 5-8

    iv

  • 8/10/2019 DSI408 Real Application Clusters Internals

    9/525

    Messaging Deadlocks 5-9

    Message Traffic Controller (TRFC) 5-10TRFC Tickets 5-11

    TRFC Flow 5-13Message Traffic Statistics 5-15

    IPC 5-18

    IPC Code Stack 5-19

    Reference Implementation 5-20

    KSXP Wait Interface to KSL 5-21

    KSXP Tracing 5-22

    KSXP Trace Records 5-23

    SKGXP Interface 5-24

    Choosing an SKGXP Implementation 5-25

    SKGXP Tracing 5-26

    Possible Hang Scenarios 5-27

    Other Events for IPC Tracing 5-28Code References 5-29

    Summary 5-30

    6 System Commit Number

    Objectives 6-2

    System Commit Number 6-3

    Logical Clock and Causality Propagation 6-4

    Basics of SCN 6-5

    SCN Latching 6-7

    Lamport Implementation 6-8

    Lamport SCN 6-9

    Limitations on SCN Propagation 6-10max_ commi t _ pr opagat i on_ del ay 6-11

    Piggybacking SCN in Messages 6-12

    Periodic Synchronization 6-13

    SCN Generation in Earlier Versions of Oracle 6-14

    Code References 6-15

    Summary 6-16

    7 Global Resource Directory: Formerly the Distributed Lock Manager

    Objectives 7-2

    RAC and Global Resource Directory (GRD) 7-3

    DLM History 7-4

    DLM Concepts: Terminology 7-5

    DLM Concepts: Resources 7-6

    DLM Concepts: Locks 7-7

    DLM Concepts: Processes 7-8

    DLM Concepts: Shadow Resources 7-9

    DLM Concepts: Copy Locks 7-10

    Resource or Lock Mastering 7-11

    Basic Resource Structures 7-12

    v

  • 8/10/2019 DSI408 Real Application Clusters Internals

    10/525

    DLM Structures 7-13

    Lock Mode Changes 7-16

    Simple Lock Changes on a Resource 7-17

    Changes on a Resource with Deadlock 7-18DLM Functions 7-19

    DLM Functionality in Global Enqueue Service Daemon (LMD0) 7-20

    DLM Functionality in Global Enqueue Service Monitor (LMON) 7-22

    DLM Functionality in Global Cache Service Process (LMS) 7-23

    DLM Functionality in Other Processes 7-24

    Configuring GES Resources 7-25

    Configuring GES Locks 7-26

    Configuring GCS Resources 7-27

    Configuring GCS Locks 7-28

    Configuring DLM processes 7-29

    Logical to Physical Nodes Mapping 7-30

    Buckets to Logical Nodes Mapping 7-31

    Mapping for a New Node Joining the Cluster 7-32

    Remapping When Node Joins 7-34

    Mapping Broadcast by Master Node 7-35

    Master Node Determination for GES 7-36

    Master Node Determination for GCS 7-37

    Dump and Trace of Remastering 7-38

    DLM Functions 7-39kj ual Connection to DLM 7-40kj ual Flow 7-42kj psod Flow 7-43

    DML Enqueue Handling Flow: Example 7-44Step 1: P1 Locks Table in Share Mode 7-45

    Step 2: P2 Locks Table in Share Mode 7-46

    Step 3: P2 Does Rollback 7-47

    Step 4: P1 Locks Table in Exclusive Mode 7-48

    Step 5: P3 Locks Table in Share Mode 7-49

    Step 6: P1 Does Rollback 7-50

    Steps 1 and 2: Code Flow 7-51Step 1: kj usuc Flow Detail 7-52Step 2: kj usuc Flow Detail 7-54

    Step 3: Code Flow 7-55Step 3: kj uscl Flow Detail 7-56

    Step 4: Code Flow 7-57Step 4: kj uscv Flow Detail 7-58Step 5: kj uscv Flow Detail 7-60Step 6: kj uscl Flow Detail 7-61

    Code References 7-63

    Summary 7-64

    References and Further Reading 7-65

    vi

  • 8/10/2019 DSI408 Real Application Clusters Internals

    11/525

    8 Cache Coherency (Part One): Enqueues/Non-PCM

    Objectives 8-2

    Cache Coherency: Enqueues 8-3

    Enqueue Types 8-6Enqueue Structure 8-7

    Examining Enqueues 8-8

    Enqueues and DLM 8-9

    Source Tree for Non-PCM Lock Flow 8-10

    Lock Modes 8-11

    Lock Compatibility 8-12

    Deadlock Detection: The Classic Deadlock 8-13

    Deadlock Detection: A More General Example 8-15

    Deadlock Detection and Resolution 8-16

    Timeout-Based Deadlock Detection 8-17

    Deadlock Graph Printout 8-18

    Deadlock Flow 8-19

    Deadlock Flow: One Node 8-21

    Deadlock Flow: Two Nodes 8-22

    Parallel DML (PDML) Deadlocks 8-23

    Deadlock Detection Algorithm 8-24

    Deadlock Validation Steps 8-27

    Code References 8-28

    Summary 8-29

    9 Cache Coherency (Part Two): Blocks/PCM Locks

    Objectives 9-2

    Cache Coherency: Blocks 9-3

    Block Cache Contention 9-4

    Earlier Cache Coherency: Oracle8 Ping Protocol 9-5

    Earlier Cache Coherency: Oracle8i CR Server 9-6

    Earlier Cache Coherency: Oracle8i CR Server 9-7

    Oracle9i Cache Fusion Protocol 9-8

    GCS (PCM) Locks 9-9

    PCM Lock Attributes 9-10

    Lock Modes 9-11

    Lock Roles 9-12

    Past Image 9-13

    Local Lock Role 9-14

    Global Lock Role 9-15Block Classes 9-16

    Lock Elements (LE) 9-17

    Allocation of New LE 9-18

    Hash Chain of LE 9-19

    Block to LE Mapping 9-20

    Queues of LE for LMS 9-21

    LMSn Free of LE 9-22

    Cache Fusion Examples: Overview 9-23

    vii

  • 8/10/2019 DSI408 Real Application Clusters Internals

    12/525

    Cache Fusion: Example 1 9-25

    Cache Fusion: Example 2 9-26

    Cache Fusion: Example 3 9-27

    Cache Fusion: Example 4 9-28Cache Fusion: Example 5 9-29

    Cache Fusion: Example 6 9-30

    Cache Fusion: Example 7 9-31

    Cache Fusion: Example 8 9-32

    Cache Fusion: Example 9 9-33

    Cache Fusion: Example 10 9-34

    Cache Fusion: Example 11 9-35

    Views 9-36

    Parameters 9-39

    Summary 9-40

    10 Cache Fusion 1: CR ServerObjectives 10-2

    Cache Fusion: Consistent Read Blocks 10-3

    Consistent Read Review 10-4

    Getting a CR Buffer 10-5

    Getting a CR Buffer in Oracle9i Release 2 10-7

    CR Server in Oracle9i Release 2 10-8

    CR Requests 10-9

    Light Work Rule 10-11

    Fairness 10-12

    Statistics 10-13

    Wait Events 10-14Fixed Table X$KCLCRST Statistics 10-15

    CR Requestor-Side Algorithm 10-16

    CR Requestor-Side AST Delivery 10-21

    CR Requestor-Side CR Buffer Delivery 10-22

    CR Server-Side Algorithm 10-23

    Summary 10-27

    11 Cache Fusion 2: Current Block: XCUR

    Objectives 11-2

    Cache Fusion: Current Blocks 11-3

    PCM Locks and Resources 11-4

    Fusion: Long Example 11-5Initial State 11-7

    Step 1: Instance 3 Performs SELECT 11-8

    Lock Changes in Instance 3 11-9

    Lock Changes in Instance 2 11-10

    Step 2: Instance 2 Performs SELECT 11-11

    Lock Changes in Instance 2 11-12

    Step 3: Instance 2 Performs UPDATE 11-13

    Lock Changes in Instance 2 11-14

    viii

  • 8/10/2019 DSI408 Real Application Clusters Internals

    13/525

    Lock Changes in Instance 3 11-15

    Step 4: Instance 1 Performs UPDATE 11-16

    Lock Changes in Instance 2 11-17

    Lock Changes in Instance 1 11-18Step 5: Instance 3 Performs SELECT 11-19

    Lock Changes in Instance 3 11-20

    Step 6: Instance 1 Performs WRITE 11-21

    Lock Changes in Instance 2 11-22

    Lock Changes in Instance 1 11-23

    Tables and Views 11-24

    Summary 11-26

    12 Cache Fusion Recovery

    Objectives 12-2

    NonCache Fusion OPS and Database Recovery 12-3

    Cache Fusion RAC and Database Recovery 12-4Overview of Fusion Lock States 12-5

    Instance or Crash Recovery 12-6

    SMON Process 12-7

    First-Pass Log Read 12-8

    Block Written Record (BWR) 12-9

    BWR Dump 12-10

    Recovery Set 12-11

    Recovery Claim Locks 12-12IDLM Response to Recover yCl ai mLoc k Message on PCM Resource 12-13

    No Lock Held by Recovering Instance on the PCM Resource 12-14

    Recovery Claim Locks 12-15

    Second-Pass Log Read 12-17

    Large Recovery Set and Partial IR Lock Mode 12-19

    Lock Database Availability During Recovery 12-22

    Handling BASTs on Recovery Buffers 12-23

    IR of Nonfusion Blocks 12-24

    Failures During Instance Recovery 12-26

    Memory Contingencies 12-28

    Code References 12-29

    Summary 12-31

    Section III: Platforms

    13 Linux PlatformObjectives 13-2

    Linux RAC Architecture 13-3

    Storage: Raw Devices 13-4

    Extended Storage 13-5

    Linux Cluster Software 13-6

    OCMS 13-7

    OCMS Components 13-8

    ix

  • 8/10/2019 DSI408 Real Application Clusters Internals

    14/525

    WDD, NM, and CM Flow (Up to version 9.2.0.1) 13-9

    Watchdog Daemon 13-10

    Hangcheck, NM, and CM Flow (After version 9.2.0.2) 13-11

    Hangcheck Module 13-12

    Node Monitor (NM) 13-13

    Cluster Manager 13-14

    Linux Port-Specific Code 13-15

    Cluster Manager 13-16

    skgxpt and skgxpu 13-17

    Installing RAC on Linux 13-18

    Running RAC on Linux 13-21

    Starting CM 13-22

    Starting WDD 13-23

    Starting NM 13-24Starting CM 13-25

    Debugging 13-26

    Summary 13-27

    References 13-28

    14 HP-UX Platform

    Objectives 14-2

    HP-UX RAC Architecture 14-3

    HP-UX Cluster Software 14-4

    HP-UX Port-Specific Code 14-5

    SKGXP (UDP Implementation) 14-6SKGXP: Lowfat 14-7

    Installing RAC on HP-UX 14-8

    Running RAC on HP-UX 14-9

    Debugging on HP-UX 14-10

    Summary 14-11

    15 Tru64 Platform

    Objectives 15-2

    Tru64 RAC Architecture 15-3

    Shared Disk Systems 15-4

    Tru64 Cluster Software 15-5Tru64 Port-Specific Code 15-6

    Node Monitor: SKGXN 15-7

    IPC: SKGXP 15-8

    SKGXPM: RDG 15-9

    Installing RAC on Tru64 15-11

    Debugging on Tru64 15-12

    x

  • 8/10/2019 DSI408 Real Application Clusters Internals

    15/525

    Useful Tru64 Commands 15-13

    Summary 15-15

    16 AIX PlatformObjectives 16-2

    AIX RAC Architecture 16-3

    AIX SP Clusters 16-4

    AIX HACMP Clusters 16-5

    AIX Cluster Software 16-6

    AIX Cluster Layer 16-7

    AIX Port-Specific Code 16-8

    RAC on AIX Stack 16-9

    Node Monitor (NM) 16-10

    Installing RAC on AIX 16-12

    Debugging on AIX 16-14

    Summary 16-15

    References 16-16

    17 Other Platforms

    Objectives 17-2

    RAC Architecture: Solaris 17-3

    RAC Architecture: Windows 17-4

    RAC Architecture: OpenVMS 17-5

    Port-Specific Code 17-6

    Installing RAC 17-7

    Summary 17-8

    Section IV: Debug

    18 V$ and X$ Views and Events

    Objectives 18-2

    V$ and GV$ Views 18-3

    List of Views 18-4

    Old and New Views 18-5

    V$ Views for Lock Information 18-6

    X$ Tables 18-7

    Events 18-8

    19 KST and X$TRACE

    Objectives 19-2

    KST: X$TRACE 19-3

    KST Concepts 19-4

    KST Concepts 19-6

    Circular Buffer 19-7

    xi

  • 8/10/2019 DSI408 Real Application Clusters Internals

    16/525

    Data Structure k s t r c 19-8

    Trace Control Interfaces 19-9

    KST Initialization Parameters 19-10

    KST Trace Control Interfaces 19-12

    KST Fixed Table Views 19-14

    KST Trace Output 19-15

    KST Current Instrumentation 19-18

    KST Performance 19-19

    KST: Examples 19-20

    KST Sample Trace File 19-24

    KST Demonstration 19-25

    DIAG Daemon 19-26

    DIAG Daemon: Features 19-27

    DIAG Daemon: Design 19-29DIAG Daemon: Startup and Shutdown 19-33

    DIAG Daemon: Crash Dumping 19-34

    Summary 19-36

    20 ORADEBUG and Other Debugging Tools

    Objectives 20-2

    ORADEBUG 20-3

    Flash Freeze 20-5

    LKDEBUG 20-6

    NSDBX 20-7

    HANGANALYZE 20-8Summary 20-9

    References 20-10

    Appendix A: Practices

    Appendix B: Solutions

    xi i

  • 8/10/2019 DSI408 Real Application Clusters Internals

    17/525

    Copyright 2003, Oracle. All rights reserved.

    Course Overview

    DSI 408: RAC Internals

  • 8/10/2019 DSI408 Real Application Clusters Internals

    18/525DSI408: Real Application Clusters Internals I-2

    I-2 Copyright 2003, Oracle. All rights reserved.I-2

    Prerequisites

    Before taking this course, you should have: Taken DSI 401, 402, and 403 so that you know

    about the server internals on crashes, dumps,transactions, block handling, and recoverysystems

    Taken the Real Appl ication Clusters (RAC)administration course so that you know about the

    external view of RAC Performed at least one RAC installation and

    assisted in at least one RAC debugging case

    Prerequisites

    The prerequisites ensure that the course is useful to you, instead of being too hard, and that

    the instructor need not cover basic material.

    You must have your TAO account ready for examining source code.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    19/525DSI408: Real Application Clusters Internals I-3

    I-3 Copyright 2003, Oracle. All rights reserved.I-3

    Course Overview

    The course includes the following four sections: Introduction

    Architecture

    Platforms

    Debug

    Subjects that are not covered include:

    Utilities (srvctl, OCFS, HA)

    Performance tuning Pre-Oracle9i versions (OPS)

    Course Overview

    This course contains four sections. It is scheduled to take four days but does not require

    one day per section. Most of the time is spent on the Architecture section.

    Introduction

    The Introduction section provides a summary of the public RAC architecture and its

    accurate terminology. An overview of architecture changes between versions is also given.

    Architecture

    The Architecture section covers the theory of operation of RAC. The RAC code stack is

    examined from the bottom up. There are many references to the source code.

    Platforms

    The Platforms section covers the differences and architectural details of RAC

    implementation on different platforms. Installation issues and known gotchas are

    included.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    20/525DSI408: Real Application Clusters Internals I-4

    Course Overview (continued)

    Debug

    The Debug section provides a detailed explanation of the trace and dump mechanisms that

    are placed inside RAC for fault location. A number of practical exercises use these

    mechanisms.

    Subjects not Covered

    This course does not cover utility modules that are not part of the primary core RAC

    functionality. It also does not cover some of the external programs that RAC depends on.

    Performance is not covered as a separate topic. The knowledge from this course should be

    sufficient to identify performance bottlenecks that are purely relevant to RAC; otherwise,

    tuning is the same as for a single instance.

    For versions of Oracle Parallel Server, you should review earlier courses. In earlier courses,

    the differences between RAC and OPS are pointed out, whereas the RAC knowledge in

    this course is not applicable to OPS.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    21/525DSI408: Real Application Clusters Internals I-5

    I-5 Copyright 2003, Oracle. All rights reserved.I-5

    Practical Exercises

    The course includes practical exercises. Exercises run on a shared Solaris cluster.

    Practical Exercises

    The cluster hardware is shared between students and other classesthis prevents practices

    that involve node shutdown, or breaking the interconnect.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    22/525

  • 8/10/2019 DSI408 Real Application Clusters Internals

    23/525

    Copyright 2003, Oracle. All rights reserved.

    GES GCS

    ES GCS

    ES GCS

    Buffer Cache

    uffer Cache

    uffer Cache

    CGS

    GS

    GS

    SQL Layer

    QL Layer

    QL Layer

    GES GCS

    ES GCS

    ES GCS

    Buffer Cache

    uffer Cache

    uffer Cache

    CGS

    GS

    GS

    SQL Layer

    QL Layer

    QL Layer

    Node Monitor

    ode Monitor

    ode Monitor

    Node Monitor

    ode Monitor

    ode Monitor

    Cluster Manager

    luster Manager

    luster Manager

    I

    I

    P

    P

    C

    C

    I

    I

    P

    P

    C

    C

    Section IIntroduction

  • 8/10/2019 DSI408 Real Application Clusters Internals

    24/525

    Copyright 2003, Oracle. All rights reserved.

    Introduction to RAC

  • 8/10/2019 DSI408 Real Application Clusters Internals

    25/525DSI408: Real Application Clusters Internals I-10

    Copyright 2003, Oracle. All rights reserved.1-10

    Objectives

    After completing this lesson, you should be able to dothe following:

    Review the design objectives of Real ApplicationClusters (RAC)

    Relate Oracle9i RAC to its predecessors

  • 8/10/2019 DSI408 Real Application Clusters Internals

    26/525DSI408: Real Application Clusters Internals I-11

    Copyright 2003, Oracle. All rights reserved.1-11

    Why Use Parallel Processing?

    Scaleup: Increased throughput

    Speedup: Increased performance or fasterresponse

    Higher availability

    Support for a greater number of users

    Why Use Parallel Processing?

    Scaleup: Increased Throughput

    Parallel processing breaks a large task into smaller subtasks that can be performed

    concurrently. With tasks that grow larger over time, a parallel system that also grows (or

    scales up) can maintain a constant time for completing the same task.

    Speedup: Increased Performance

    For a given task, a parallel system that can scale up improves the response time for

    completing the same task.

    For decision support system (DSS) applications and parallel queries, parallelprocessing decreases the response time.

    For online transaction processing (OLTP) applications, speedup cannot be expected

    due to the overhead of synchronization. Depending on the precise circumstances, a

    decrease in performance can occur.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    27/525DSI408: Real Application Clusters Internals I-12

    Why Use Parallel Processing? (continued)

    Higher Availability

    Because each node running in the parallel system is isolated from other nodes, a single node

    failure or crash should not cause other nodes to fail. Other instances in the parallel server

    environment remain up and running.

    The operating systems failover capabilities and fault tolerance of the distributed clustersoftware are an important infrastructure component.

    Support for a Greater Number of Users

    Each node can support several users because each node has its own set of resources, such as

    memory, CPU, and so on. As nodes are added to the system, more users can also be added,

    allowing the system to continue to scale up.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    28/525DSI408: Real Application Clusters Internals I-13

    Copyright 2003, Oracle. All rights reserved.1-13

    Scaleup and Speedup

    Original system

    Hard-ware

    100% of taskTime

    Cluster system scaleup

    Up to200%

    oftask

    Up to300%oftask

    Hard-ware Time

    Hard-ware Time

    Hard-ware Time

    50% of task

    Cluster system speedup

    Hard-ware Time

    Hard-ware Time

    50% of task

    Scaleup and Speedup

    Scaleup

    Scaleup is the capability of providing continued increases in throughput in the presence of

    limited increases in processing capability while keeping the time constant:

    Scaleup = (volumeparallel) / (volumeoriginal) time for interprocess communication

    For example, if 30 users consume close to 100% of the CPU during their normal

    processing, adding more users would cause the system to slow down due to contention for

    limited CPU cycles. By adding CPUs, however, extra users can be supported without

    degrading performance.Speedup

    Speedup is the capability of providing continued increases in speed in the presence of

    limited increases in processing capability while keeping the task constant:

    Speedup = (timeoriginal) / (timeparallel) time for interprocess communication

    Speedup results in resource availability for other tasks. For example, if queries normally

    take 10 minutes to process, and running in parallel reduces the time to 5 minutes, then

    additional queries can run without introducing the contention that might occur if they were

    to run concurrently.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    29/525DSI408: Real Application Clusters Internals I-14

    Scaleup and Speedup (cont inued)

    Speedup (continued)

    Example 1: A particular application might takeNseconds to fully scan and produce a

    summary of a 1 GB table

    With scaleup, if the table doubles in size, then doubling hardware resources should allow

    the query to still complete inNseconds.

    With speedup, if the table does not grow in size, doubling the hardware resources should

    allow the query to complete inN/2 seconds.

    Example 2: A particular application might have 100 users, each getting a three-second

    response on queries.

    With scaleup, if the number of users doubles in size, then doubling hardware resources

    should allow response time to remain at three seconds.

    With speedup, if the number of users remains the same, doubling the hardware resources

    should reduce the response time. This occurs only if the three-second activity can be

    broken down into two separate activities that can run independently of each other.

    A Success Example of Scaleup

    The following testimonial is from the internal RAC mailing list. This was a response to

    a question about the ease of changing a single instance to an RAC system.

    Just yesterday, we tested with a customer a migration from single instance to two-node

    RAC on Solaris. They were using Veritas DBE/AC for the cluster system.

    These are the steps we took:

    1. Node 1 Server running 9i single instance at approx 80% CPU load.

    2. Connection through Transparent Application Failover with 40 retries and a delay offive seconds.

    3. Alter shared initialization file to set Cluster Database = true and add extra

    parameters for the second node (bdump location and so on).

    4. Shut down Database on Node 1.

    5. Start up Database on Node 2 using new initialization file.

    6. Start up Database on Node 1 using new initialization file.

    At this point we had 85% of users on Node 1 and 15% on Node 2.

    7. Run a script to disconnect sessions on Node 1 to allow them to load balance across

    to Node 2.At this point we had 50% of users on Node 1 and 50% on Node 2. The database was no

    longer highly loaded and we were able to add more (now load-balanced) users.

    The application was written in Java and was TAF-aware (i.e., it knew to retry transactions

    with certain warning messages). Once we added the second node, the TPMs per Node

    remained approximately the same so we had over 1.9 x improvement in TPMs, which was

    pretty good scaling.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    30/525DSI408: Real Application Clusters Internals I-15

    Copyright 2003, Oracle. All rights reserved.1-15

    Scalabili ty Considerations

    Hardware: Disk I/O

    Internode communication: High bandwidth andlow latency

    Operating system: Number of CPUs (for example,SMP)

    Cache Coherency and the Global Cache Service

    Database: Design

    Application: Design

    Scalability Considerations

    It is important to remember that if any of these six areas are not scalable (no matter how

    scalable the other areas are), parallel cluster processing may not be successful.

    Hardware scalability: High bandwidth and low latency offer the maximum scalability.

    A high amount of remote I/O may prevent system scalability, because remote I/O is

    much slower than local I/O.

    Bandwidth of the communication interface is the total size of messages that can be

    sent per second. Latency of the communication interface is the time required to place

    a message on the interconnect. It indicates the number of messages that can be put on

    the interconnect per unit of time.

    Operating system: Nodes with multiple CPUs and methods of synchronization in the

    OS can determine how well the system scales. Symmetric multiprocessing can

    process multiple requests to resources concurrently.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    31/525DSI408: Real Application Clusters Internals I-16

    Scalability Considerations (continued) The processes that manage local resource coordination in a cluster database are

    identical to the local resource coordination processes in single instance Oracle. This

    means that row and block level access, space management, system change number

    (SCN) creation, and data dictionary cache and library cache management are the

    same in Real Application Clusters as in single instance Oracle. If the resource is

    modified by more than one instance, then RAC performs further synchronization on aglobal level to permit shared access to this block across the cluster. Synchronization

    in this case requires intranode messaging as well as the preparation of consistent read

    versions of the block and the transmission of copies of the block between memory

    caches within the cluster database." (See Oracle9i Real Application Clusters

    Concepts Release 2 (9.2), Part Number A96597-01, Chapter 5,Real Application

    Clusters Resource Coordination.)

    Database scalability: Database scalability depends on how well the database is

    designed (for example, how the data files are arranged, how well the locks are

    allocated, and how well the objects are partitioned).

    Scalability of the application: Application design is one of the keys to taking

    advantage of the other elements of scalability. Regardless of how well the hardware

    and database scale, parallel processing does not work as desired if the application

    does not scale.

    A typical cause for the lack of scalability is one common shared resource that must be

    accessed often. This causes the otherwise parallel operations to serialize on this bottleneck.

    A high latency in the synchronization increases the cost of synchronization, counteracting

    the benefits of parallelization. This is a general limitation and not a RAC-specific

    limitation.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    32/525DSI408: Real Application Clusters Internals I-17

    Copyright 2003, Oracle. All rights reserved.1-17

    RAC Costs: Synchronization

    To scale, there is a cost in synchronization: Scalabili ty = Synchronization

    Less synchronization = Speedup and scaleup

    Synchronization is necessary to maintain cachecoherency in RAC.

    RAC Costs: Synchronization

    Synchronization is a necessary part of parallel processing, but for parallel processing to be

    advantageous, the cost of synchronization must be determined.

    Synchronization provides the coordination of concurrent tasks and is essential for parallel

    processing to maintain data integrity or correctness. Proper locking between disjoint SGAs

    (Oracle instances) must be maintained to ensure correct data. This is cache coherency.

    Partitioning can help reduce synchronization costs because there are fewer

    concurrent tasks (that is, fewer concurrent users modifying the same set of data).

    An application that modifies a small set of data can cause a high overhead forsynchronization if performed in disjoint SGAs.

    Contention occurs between instances using a single block or row, such as a table with

    one row that is used to generate sequence numbers.

    Two ways to synchronize:

    Locks: Latches, enqueues, locks

    Messages: Send/wait for messages

    Synchronization = Amount Cost Amount: How often do you need to synchronize?

    Cost: How expensive is it to synchronize?

  • 8/10/2019 DSI408 Real Application Clusters Internals

    33/525DSI408: Real Application Clusters Internals I-18

    Copyright 2003, Oracle. All rights reserved.1-18

    Levels of Syncronization

    Row-Level (Database) Oracle Row-Locking feature

    Maximize concurrency

    SCN coerency

    Local Cache Level (intra-instance) Every buffer in cache is protected by logical semaphores (spin latches)

    Access to buffers is synchronized

    Global Cache Fusion (inter-instance DLM) Every buffer in every cache is tracked by GCS

    cache coherency / cache consistency

    - CACHE BUFFERS CHAINS , CACHE BUFFER HANDLES , CACHEBUFFER HANDLES

    - Global Resource Directory managed by Global Cache Services (GCS) .

    (Old DLM in pre9i)- cache coherency

    The synchronization of data in multiple caches so that reading a memorylocation

    by way of any cache will return the most recent data written to that location

    by way

    of any other cache. Sometimes calledcache consistency.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    34/525DSI408: Real Application Clusters Internals I-19

    Copyright 2003, Oracle. All rights reserved.1-19

    Levels of Syncronization Row Level

    Block 100 Block 101

    Database

    Instance

    Global Cache

    (iDLM)

    fg2

    fg1

    Updat e row2

    Updat e row1

    Enqueues are local locks that serialize access to various resources. Thiswait event indicates a wait for a lock that is held by another session (orsessions) in an incompatible mode to the requested mode. See (about V$LOCK) for details of which lock modes are

    compatible with which. Enqueues are usually represented in the format"TYPE-ID1-ID2" where

    "TYPE" is a 2 character text string

    "ID1" is a 4 byte hexadecimal number

    "ID2" is a 4 byte hexadecimal number

  • 8/10/2019 DSI408 Real Application Clusters Internals

    35/525DSI408: Real Application Clusters Internals I-20

    Copyright 2003, Oracle. All rights reserved.1-20

    Levels of Syncronization Local Cache

    Block 100 Block 101

    Database

    Instance

    fg2

    fg1

    BCacheUpdater ow1

    Updater ow2

    Global Cache

    (iDLM)

  • 8/10/2019 DSI408 Real Application Clusters Internals

    36/525DSI408: Real Application Clusters Internals I-21

    Copyright 2003, Oracle. All rights reserved.1-21

    Levels of Syncronization Global Cache

    Block 100 Block 101

    Database

    Instance

    fg1fg1

    Updater ow1

    Updat er ow2

    BCache BCache

    Gl obal Resour ce Di r ector y

    Global Cache

    (iDLM)

    global resources

    Inter-instance synchronization mechanisms that provide cache coherency for

    Real

    Application Clusters. The term can refer to both Global Cache Service (GCS)

    resources and Global Enqueue Service (GES) resources.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    37/525DSI408: Real Application Clusters Internals I-22

    Copyright 2003, Oracle. All rights reserved.1-22

    We need a cache

    Block

    Database

    fg

    fg

    Block Block

    fgfg

    fg

    Serialize

    Serialize Sequencing

    operationsguaranteeconsistency of

    data

    But : Minimize thelevel ofconcurrency of thesystem

    And : time tocompletesequence ofoperationsdepends by theslower element :disks

    Serialization is the easiest method to manageconcurrency, But, conversely cost in term ofsystem througput

    Evolutions of

    Oracleminimize theset of tasksthat areserialized

    Give a set of Tasks: [T1,T2,Tn] that arrive at the times [t1

  • 8/10/2019 DSI408 Real Application Clusters Internals

    38/525DSI408: Real Application Clusters Internals I-23

    Copyright 2003, Oracle. All rights reserved.1-23

    Coerency

    The systems reach amaximum level of

    concurrency

    fg1fg1

    sel ectr ow1

    BCache BCache

    Start SC#900

    Start SC#1010

    scn: 900

    sel ectr ow2

    Scn: 1010

    Block 100

    scn: 800

    SS

    Res: 1, 0x100

    Ex: ALTER SYSTEM DUMP DATAFILE 5 BLOCK 4690;

    ALTER SYSTEM DUMP DATAFILE {'filename'}|{filenumber}

    |---BLOCK MIN {blockno} BLOCK MAX {blockno}|-->

    |---BLOCK {blockno}-----------------------|

    Note : blockdump report the BC block if block is CURRENT/Dirty in

    current instance

    alter session set events 'immediate trace name BUFFER level ';

  • 8/10/2019 DSI408 Real Application Clusters Internals

    39/525DSI408: Real Application Clusters Internals I-24

    Copyright 2003, Oracle. All rights reserved.1-24

    Coerency costs of locks

  • 8/10/2019 DSI408 Real Application Clusters Internals

    40/525DSI408: Real Application Clusters Internals I-25

    Copyright 2003, Oracle. All rights reserved.1-25

    Fixed*/Releasable 1:M lock model (static)

    Block 100 Block 101

    Database

    Instance

    Block 102 Block 103 Block 104

    (*)starting 9i

    removed fixed-

    locking mode

    Global Cache

    (iDLM)

    GC_FILES_TO_LOCKS = 1=100:2=0:3=1000:4-5=0EACH

    GC_FILES_TO_LOCKS ={ file_list= lock_count[! blocks][EACH][:...]}

    PCM l ock names

    type is always BL (because PCM locks are buffer locks)

    ID1 is the block class (described in Classes of Blocks)

    ID2 For fixed locks,ID2 is the lock element (LE) index number obtained by hashing the block address

    (see the GV$LOCK_ELEMENT/ GV$GC_ELEMENT fixed view) For releasable locks,ID2 is the database address ofthe block.

    Non PCM l ocks

    CF Controlfile Transaction IV Library Cache Invalidation

    CI Cross-Instance Call Invocation L[A-P] Library Cache Lock

    DF Datafile N[A-Z] Library Cache Pin

    DL Direct Loader Index Creation Q[A-Z] Row Cache

    DM Database Mount PF Password File

    DX Distributed Recovery PR Process Startup

    FS File Set PS Parallel Slave Synchronization

    KK Redo Log Kick RT Redo Thread

    IN Instance Number SC System Commit Number

    IR Instance Recovery SM SMON

    IS Instance State SN Sequence Number

    MM Mount Definition SQ Sequence Number Enqueue

    MR Media Recovery SV Sequence Number Value

    ST Space Management Transaction TT Temporary Table

  • 8/10/2019 DSI408 Real Application Clusters Internals

    41/525DSI408: Real Application Clusters Internals I-26

    Copyright 2003, Oracle. All rights reserved.1-26

    False Pinging

    Block 100 Block 101

    Database

    Instance

    Block 102 Block 103 Block 104

    fg1

    updat i ngBCache

    dba: 10 dba:103 dba: 105

    LE: 23

    Global Cache

    (iDLM)

    Another instance need access to dba:100, the owning instance must ping all the dirty blocksthat are covered by LE

  • 8/10/2019 DSI408 Real Application Clusters Internals

    42/525DSI408: Real Application Clusters Internals I-27

    Copyright 2003, Oracle. All rights reserved.1-27

    Releasable 1:1 lock model (dynamic)

    Block 100 Block 101

    Database

    Instance

    Block 102 Block 103 Block 104

    fg1

    BCache

    dba:101 dba:103 dba:105

    LE: 100 LE: 105

    updat i ng

    Global Cache

    (iDLM)

    break on GC_ELEMENT_NAME

    select inst_id,GC_ELEMENT_NAME,CLASS,MODE_HELD

    from gv$gc_element where GC_ELEMENT_NAME>20970000

    order by GC_ELEMENT_NAME;

    INST_ID GC_ELEMENT_NAME CLASS MODE_HELD

    ---------- --------------- ---------- ----------

    1 20971522 0 5

    1 20971523 0 5

    1 20971913 0 3

    1 20971914 0 3

    1 20976209 0 3

    2 0 3

    1 20976210 0 0

    2 0 5

    --

    V SPLIT ==> DBA (Hex) = File#,Block# (Hex

    File#,Block#)

  • 8/10/2019 DSI408 Real Application Clusters Internals

    43/525DSI408: Real Application Clusters Internals I-28

    Copyright 2003, Oracle. All rights reserved.1-28

    Scalability

    ScaleupScal eup i s t he capabi l i t y t o pr ovi de cont i nued i ncr eases i n

    t hr oughput i n the pr esence of l i mi t ed i ncr eases i n pr ocessi ngcapabi l i t y whi l e keepi ng t i me const ant :

    Scal eup = ( vol ume par al l el ) / ( vol ume or i gi nal )

    SpeedupSpeedup i s t he capabi l i t y t o pr ovi de cont i nued i ncr eases i n speed i n

    t he pr esence of l i mi t ed i ncreases i n pr ocessi ng capabi l i t y, whi l ekeepi ng t he t ask const ant :

    Speedup = ( t i me or i gi nal ) / ( t i me par al l el )

  • 8/10/2019 DSI408 Real Application Clusters Internals

    44/525DSI408: Real Application Clusters Internals I-29

    Copyright 2003, Oracle. All rights reserved.1-29

    RAC Costs: Global Resource Directory

    Single instance: Synchronization of concurrenttasks and access to shared resources

    Global Resource Directory (GRD) to recordinformation about how resources are used withina cluster database. The Global Cache Service(GCS) and Global Enqueue Service (GES) managethe information in this directory. Each instance

    maintains part of the global resource directory inthe System Global Area (SGA).

    RAC Costs: Global Resource Directory

    In single-instance environments, locking coordinates access to a common resource, such as

    a row in a table. Locking prevents two processes from changing the same resource (or row)

    at the same time.

    In RAC environments, internode synchronization is critical because it maintains proper

    coordination between processes on different nodes, preventing them from changing the

    same resource at the same time. Internode synchronization guarantees that each instance

    sees the most recent version of a block in its buffer cache.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    45/525DSI408: Real Application Clusters Internals I-30

    RAC Costs: Global Resource Directory (cont inued)

    Resource coordination within Real Application Clusters occurs at both an instance level

    and at a cluster database level. Instance level resource coordination within Real

    Application Clusters is referred to as local resource coordination. Cluster level

    coordination is referred to as global resource coordination.

    The processes that manage local resource coordination in a cluster database are identical to

    the local resource coordination processes in single instance Oracle. This means that rowand block level access, space management, system change number (SCN) creation, and

    data dictionary cache and library cache management are the same in Real Application

    Clusters as in single instance Oracle.

    If the resource is modified by more than one instance, then RAC performs further

    synchronization on a global level to permit shared access to this block across the cluster.

    Synchronization in this case requires intranode messaging as well as the preparation of

    consistent read versions of the block and the transmission of copies of the block between

    memory caches within the cluster database." (See Oracle9i Real Application Clusters

    Concepts Release 2 (9.2), Part Number A96597-01, Chapter 5, Real Application ClustersResource Coordination.)

    Note: Global Cache Service (GCS) and Global Enqueue Service (GES) do not interfere

    with row-level locking and vice versa. Row-level locking is a transaction feature.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    46/525DSI408: Real Application Clusters Internals I-31

    Copyright 2003, Oracle. All rights reserved.1-31

    RAC Costs: Cache Coherency

    Cache coherency is the technique of keeping mult iplecopies of an object consistent between dif ferentOracle instances.

    RAC Costs: Cache Coherency

    Maintaining cache coherency is an important part of a cluster. Cache coherency is the

    technique of keeping multiple copies of an object consistent between different Oracle

    instances (or disjoint caches) on different nodes.

    Global cache management ensures that access to a master copy of a data block in an SGA

    is coordinated with the copy of the block in other SGAs.

    Therefore, the most recent copy of a block in all SGAs contains all changes that are made

    to that block by any instance in the system, regardless of whether those changes have been

    committed on the transaction level. Full redo protection of the block changes is maintained.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    47/525DSI408: Real Application Clusters Internals I-32

    Copyright 2003, Oracle. All rights reserved.1-32

    RAC Costs: Cache Coherency

    Node 1

    Instance A

    SGA

    GES/GCS

    Node 2

    Instance B

    SGA

    GES/GCS

    Node 3

    Instance C

    SGA

    GES/GCS

    RAC Costs: Cache Coherency (continued)

    The cost (or overhead) of cache coherency is the need before any access to a specific

    shared resource to first check with the other instances whether this particular access is

    permitted. The algorithms optimize the need to coordinate on each and every access, but

    some overhead is incurred.

    The GCS tracks the locations, modes, and roles of data blocks. The GCS therefore also

    manages the access privileges of various instances in relation to resources. Oracle uses the

    GCS for cache coherency when the current version of a data block is in one instance's

    buffer cache and another instance requests that block for modification. If an instance readsa block in exclusive mode, then in subsequent operations multiple transactions within the

    instance can share access to a set of data blocks without using the GCS. This is true,

    however, only if the block is not transferred out of the local cache. If the block is

    transferred out of the local cache, then the GCS updates the Global Resource Directory

    that the resource has a global role; whether the resources mode converts from exclusive to

    another mode depends on how other instances use the resource.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    48/525

  • 8/10/2019 DSI408 Real Application Clusters Internals

    49/525DSI408: Real Application Clusters Internals I-34

    RAC Terminology (continued)

    Data buffer cache blocks are the most obvious and most heavily used global resource.

    There are other data item resources that are global in the cluster, such as transaction

    enqueues and database data structures. The data buffer cache blocks are handled by the

    Global Cache Service (GCS), and Parallel Cache Management (PCM). The nondata

    block resources are handled by Global Enqueue Services (GES), also called Non-Parallel Cache Management (non-PCM).

    The Global Resource Manager (GRM) keeps the lock information valid and correct

    across the cluster.

    From the module skgxn. h: Node: An i ndi vi dual comput er wi t h one or mor e CPUs, some

    memor y, and access t o di sk st or age ( gener al l y capabl e ofr unni ng an i nst ance of OPS) .

    Cl ust er : A col l ect i on of l oosel y coupl ed nodes t hat

    suppor t a par al l el Or acl e dat abase. Cl ust er Member shi p: The set of act i ve nodes i n acl ust er . These ar e t he nodes t hat ar e "al i ve" and haveaccess t o shar ed r esour ces ( t hat i s, shar ed di sk) . Nodest hat ar e not i n t he cur r ent cl ust er member shi p must nothave access t o shared resour ces.

    I nst ance: Di st r i but ed ser vi ces t ypi cal l y ar e made up ofsever al i dent i cal component s, one on each node of acl ust er . One of t hese component s wi l l be cal l ed an" i nst ance. " For exampl e, an OPS dat abase wi l l have an

    Or acl e i nst ance r unni ng on each node. Pr ocess: For t he pur poses of t hi s i nt er f ace, a pr ocess

    i s a uni t of execut i on. On some oper at i ng syst ems, t hi smay be equi val ent t o an OS pr ocess . On ot hers, i t may beequi val ent t o an OS t hr ead. A pr ocess i s consi der edt ermi nat ed when i t can no l onger execut e, pendi ng OSr equest s ar e compl et ed/ cancel ed, and any pr ocess- l ocalr esour ces are r el eased.

    Note that the older OPS terms are used in the code, but the terms are also valid for RAC.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    50/525DSI408: Real Application Clusters Internals I-35

    Copyright 2003, Oracle. All rights reserved.1-35

    Terminology Translations

    Terminology depends on the speaker Product managers to sales or marketing

    Support, technical teams, development

    Terminology depends on the version

    Older terms tend to stay in code

    Variable names and prefixes reflect the older name

    Newer names reflect newer application or

    functionality

    Terminology Translations

    RAC = OPS. OPS is the older term. See the History slide (#19) in this lesson.

    Row Cache = Dictionary Cache. Row Cache is the older term. It is the SGA area to cache

    database dictionary information. It is a global resource.

    Distributed Lock Manager (DLM) = Global Resource Manager (GRM). DLM is the older

    term; GRM has slightly more functionality. The terms are used for any locking system that

    can handle several processes, typically (but not necessarily) on several nodes.

    DLM = IDLM = UDLM. The DLM term is a very general term, but also refers to the

    external operating systemsupplied DLM used by Oracle7. IDLM refers to the IntegratedDLM introduced in Oracle8. UDLM is the Universal DLM, that is, the reference

    implementation of a DLM made on the Solaris platform. It is often called by its codereference skgxn- v2.

    Some of the RAC processes have retained their old names but are described with a

    different purpose:

    LMON: Global Enqueue Service Monitor, previously Lock Monitor

    LMD: Global Enqueue Service Daemon, previously Lock Monitor Daemon

    LMS: Global Cache Service Processes, previously Lock Manager Services

  • 8/10/2019 DSI408 Real Application Clusters Internals

    51/525DSI408: Real Application Clusters Internals I-36

    Terminology Translations (continued)

    Terminology in This Course

    This course reflects the mixed usage of similar terms and aligns more with the terminology

    of code than with the externalized names.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    52/525DSI408: Real Application Clusters Internals I-37

    Copyright 2003, Oracle. All rights reserved.1-37

    Programmer Terminology

    Client or user: calling code

    Callback: routine to execute when the calledprogram has new information

    Programmer Terminology

    Inside the code, comments often refer to the programmers point of view.

    Client and User are used interchangeably, and refer to the calling code.

    Client code can register interest in a service by giving a pointer to a data structure that is to

    be updated or a routine that is to be called, when the service has completed the required

    action.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    53/525DSI408: Real Application Clusters Internals I-38

    Copyright 2003, Oracle. All rights reserved.1-38

    History

    Real Application Clusters (RAC) is the currentproduct.

    RAC has some simi larity to Oracle Parallel Server(OPS)

    Has same end-user capability; a clustered database

    Scales better because of better internal handling ofcache coherency

    Has some internal, fundamental changes in theglobal cache

    History

    Oracle Parallel Server (OPS) historically had a bad reputation; it was not scalable. Most

    applications ran slower on an OPS system than on a single instance. There was a need to

    carefully determine which instance performed DML on which tables or (more accurately)

    on which blocks. With RAC this need has been eliminated, resulting in true scalability.

    Although RAC borrows much code from OPS, the official policy is not to mention that

    RAC is an evolved version of OPS. Oracle does not want the bad reputation of OPS to

    adversely affect the reputation of RAC in the market. Internally (in the code), the OPS

    heritage in RAC is evident.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    54/525DSI408: Real Application Clusters Internals I-39

    Copyright 2003, Oracle. All rights reserved.1-39

    History Overview

    OPS 6 was not in production and was availableonly on limi ted platforms.

    OPS 7 was platform generic, relying on externalDLM.

    OPS 8 had Integrated Distr ibuted Lock Manager.

    OPS 8i had Cache Fusion Stage 1.

    RAC 9i has Cache Fusion Stage 2.

    The database layout for dif ferent versions has notchanged.

    History Overview

    Some components have undergone changes in scope and name. The system that ensures

    that access to a block is coherent is the Global Cache Manager in Oracle9i. In Oracle8i and

    Oracle8, this was the Integrated Distributed Lock Manager. Earlier it was an external

    operating systemsupplied service that the Oracle processes called. The Cluster Group

    Service of Oracle9i and Oracle8i was the Group Membership Services module in Oracle8

    and (before that) part of the external Distributed Lock Manager.

    Although there have been many changes to the architecture in the instance, the database

    structure has changed only marginally. Separate redo threads and undo spaces are stillused.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    55/525DSI408: Real Application Clusters Internals I-40

    Copyright 2003, Oracle. All rights reserved.1-40

    Internalizing Components

    RDBMS

    DLM APIDLM,CM&

    Op.Sys.

    RDBMS

    IDLM

    CM&

    Op.Sys.

    Callbacks,enqueues

    Local state inSGA memory

    No localstate ininstance

    Simulatedcallback,enqueue

    translation

    Oracle7 Oracle8

    Internalizing Components

    The development of RAC has internalized more operating system components for each

    version. As an example, the diagram on the slide shows the internalization of the

    Distributed Lock Manager (DLM) in the development of Oracle7 to Oracle8. Instead of

    calling the external operating system whenever any lock status needed checking by the

    DLM API module, the IDLM module in the Oracle server only needs to examine its SGA.

    The RDBMS routines did not in principle need to reflect the change.

    The earlier versions had the DLM external, which limited the functionality (lowest

    common denominator effect) that the Oracle server could rely on, and the need to passdata to external services. Data transfer used pipes or network communication to the

    external processes; control for I/O completion used Asynchronous Trap (AST)

    mechanisms, polling mechanisms, or blocked waits. Internal communication inside the

    Oracle servereven between the various background processescan use the common

    SGA memory area that includes latches and enqueues.

    This is merely illustrative and is not an accurate summary of the changes made.

    The Oracle8 to Oracle9i development similarly internalized the GMS interface (that is, the

    Node Monitor (NM) functionality), relying on only the Cluster Manager (CM) interface

    routines.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    56/525DSI408: Real Application Clusters Internals I-41

    Copyright 2003, Oracle. All rights reserved.1-41

    Oracle7

    The differences between a non-OPS server and anOPS-enabled Oracle server were few:

    Database structure changes

    Separate redo per instance

    Separate undo per instance

    Addition of LCK process in instance

    Oracle7

    OPS in Oracle7 consisted of the database structural changes for cluster operation (as in all

    versions) and the addition of the LCK process that communicated with the external DLM.

    The instances not only coordinated global cache coherency through the DLM but also used

    the DLM as the communication channel for registering into the OPS cluster.

    The method for sending the SCN or other messages was platform specific.

    External DLM

    The external DLM usage had the following characteristics:

    It had to be running before any instance started.

    Resources and locks had to be adequately configured.

    Death of the DLM on a node implied death of all its clients on the node.

    OPS/DLM diagnostics had to have port-specific lock dumps.

    Internode parallel query code had to be port specific.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    57/525DSI408: Real Application Clusters Internals I-42

    Copyright 2003, Oracle. All rights reserved.1-42

    Oracle8

    First stage in internalizing cluster communications:

    Oracles own lock manager in Oracle server

    New communication path for clusterwidemessages

    New background processes LMD and LMON

    Cluster state communication through externalGroup Membership Service (GMS)

    Oracle8

    The internal DLM meant that resource allocation was inside the Oracle server. Diagnostic

    lock dumps no longer needed to be port specific. The Oracle server, version 8 (and later),

    started communicating with the cluster services of the operating system. The interface

    consisted of the GMS that was an Oracle-specified API. The GMS functionality included:

    Supplying each instance with the current set of registered members, clusterwide

    Notifying other members when a member joins or leaves

    Automatically deregistering dead processes/instances from their groups

    Interfacing with the node monitor for cluster events

  • 8/10/2019 DSI408 Real Application Clusters Internals

    58/525DSI408: Real Application Clusters Internals I-43

    Copyright 2003, Oracle. All rights reserved.1-43

    Oracle8i

    Cache Fusion Stage 1 Read/write blocks sent via interconnect and not

    through the disk

    CR server process BSP

    More cluster communication functions as part ofOracle server code

    GMS functionality spl it into Cluster Group Services

    (CGS) and Node Monitor (NM) in the skgxnv2 Lock Manager structures in shared pool

    Oracle8i

    The Cache Fusion Stage 1 satisfied some types of block requests across the cluster

    communication paths (rather than via disk) and made use of the messaging services.

    The Oracle8 GMS has been split into OSD and Oracle kernel components. Node monitorOSD skgxn is extended from monitoring a single client per node to arbitrarily namedprocess groups. The rest of the GMS functionality is moved into Oracle as CGS. A

    distributed name service is added to CGS.

    LMON executes most of the CGS functionality:

    Joins the skgxnprocess group representing the instances of the specified group Connects to other members and performs synchronization to ensure that all of them

    have the same view of group membership

  • 8/10/2019 DSI408 Real Application Clusters Internals

    59/525DSI408: Real Application Clusters Internals I-44

    Copyright 2003, Oracle. All rights reserved.1-44

    Oracle9i

    Cache Fusion Stage 2 Write/write blocks handled concurrently

    GCS and GES instead of IDLM

    Enhanced instance availablity

    Instance Member Reconfiguration (IMR)

    New recovery features

    Enhanced messaging for inter-instance

    communication

    Oracle9i

    The remainder of this course is based on Oracle9i.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    60/525DSI408: Real Application Clusters Internals I-45

    Copyright 2003, Oracle. All rights reserved.1-45

    Summary

    In this lesson, you should have learned how to: Determine whether to use RAC in application

    design

    Describe RAC improvements over its predecessor

  • 8/10/2019 DSI408 Real Application Clusters Internals

    61/525

    Copyright 2003, Oracle. All rights reserved.

    Introduction to RAC Internals

  • 8/10/2019 DSI408 Real Application Clusters Internals

    62/525DSI408: Real Application Clusters Internals I-47

    Copyright 2003, Oracle. All rights reserved.2-47

    Objectives

    After completing this lesson, you should be able to dothe following:

    Outline the RAC architecture with internalreferences

    Relate the RAC-related modules to the Oraclecode stack

  • 8/10/2019 DSI408 Real Application Clusters Internals

    63/525DSI408: Real Application Clusters Internals I-48

    Copyright 2003, Oracle. All rights reserved.2-48

    Simple RAC Diagram

    Node

    Instance(SGA,processes)

    Node

    Instance(SGA,processes)

    Node

    Instance(SGA,processes)

    Clusterdisk/filesystem

    High-speed interconnect

    Simple RAC Diagram

    The node contains more than just the instance. It includes the operating system, network

    stacks for various protocols, disk software, and a number of Oracle noninstance processes:

    Listener, Intelligent Agent, and the foreground/shadow server processes.

    The instance has its usual complement of background processes (more so with the RAC

    configuration). They connect to the disk system, the network, and the high-speed

    interconnect.

    The cluster disk or file system may be mirrored, RAID-based, SAN/Fiber-based, or JBOD

    (just a bunch of disks). If it is a clusterwide file system, it can contain the Oracle homecode. The clusterwide disks can be host-managed (that is, the controller is part of the node)

    but are serviced to the cluster and equivalent to clusterwide disks. Local disks are of little

    interest to RAC but are used for noncommon files where the common disks are raw disks.

    Note: There are some issues with node-specific files of the Intelligent Agent or passwordfile orapwwhen using a cluster file system. The solution varies with the platform and theCFS that are used.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    64/525DSI408: Real Application Clusters Internals I-49

    Copyright 2003, Oracle. All rights reserved.2-49

    One RAC Instance

    SGA contains (but is not limitedto):

    Library, row, and buffer caches

    Global Resource Directory

    Other background processes are:

    LGWR, SMON, and so on

    PQ, Jobs, and so on

    Dispatchers and servers Foreground processes not shown

    Node

    Instance

    CM

    LMON

    DIAG

    LMDLMS

    LCK

    DBW0 PMON

    SGA

    One RAC Instance

    This is the traditional view of an instance and its background processes. All processes are,however, the same programor acl e. exe or or acl ejust instantiated with differentstartup parameters (see source opi r i p and WebIV Note:33174.1). On Windows, this ismore apparent; there is clearly only one Oracle process showing in the Task Manager, but

    with a number of threads.

    All caches in the SGA are either global and must be coherent across all instances, or they

    are local. The library, row (also called dictionary), and buffer caches are global. The large

    and Java pool buffers are local. For RAC, the Global Resource Directory is global in itselfand also used to control the coherency.

    The LMON process communicates with its partner process on the remote nodes. Other

    processes may have message exchanges with peer processes on the other nodes (for

    example, PQ). The LMS and LMD processes, for example, may directly receive requests

    from remote processes.

    The Cluster Monitor (CM) system communicates with the other CMs on other nodes and is

    not part of the Oracle RAC instance. But it is a necessary component.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    65/525DSI408: Real Application Clusters Internals I-50

    Copyright 2003, Oracle. All rights reserved.2-50

    Internal RAC Instance

    kqlm: Library cache (fusion)

    kqr: Dictionary/row cache

    kcl: Buffer cache

    ksi: Instance locks

    kjb: Global Cache Service

    kju: Global Enqueue

    Service

    CGS: Cluster Group Services

    NM: Node Monitor

    IPC: InterprocessCommunication

    Node

    Instance

    CM

    NM skgxn.v2

    ksi

    GCS kjb/GES kju

    CGS kjxg

    kcl

    s

    k

    g

    x

    p

    I

    PC

    kqlkqr

    kqlm

    Internal RAC Instance

    This is an internal view of some of the instance code stack and the RAC-relevant sections

    and modules.

    The NM layer is the communication layer to the CM. The IPC services facilitate other

    process to process communication on different instances.

    The CGS maintains the state of the RAC-cluster, knowing which instances are in the

    cluster and which are not. Contrast this with the node availability.

    The GRD is the data structure that stores Global Enqueue and Global Cache objects; it is

    aware of every clusterwide resource. Resources are typically a buffer element, like a databuffer, or a data file, but can also be abstract entities, such as an enqueue or NM resource.

    The three buffer caches are used by the various user foreground processes by callinghandling routines (kql m, l qr , kcl ) for allocation, deallocation, and locking. Thehandling routines maintain coherency by using kcl . The data buffer cache is the sole user

    of the GCS.

    Note: Other skg-interfaces, such as skgf r (disk I/O), are not shown.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    66/525DSI408: Real Application Clusters Internals I-51

    Copyright 2003, Oracle. All rights reserved.2-51

    Oracle Code Stack

    User Program Interface

    Oracle Program Interface

    Kernel Execut ion Layer

    Kernel Distr ibuted Execution Layer

    Network Program Interface

    Kernel Security Layer

    Kernel Query Layer

    Recursive Program Interface

    Kernel Access Layer

    Kernel Data LayerKernel Transaction Layer

    Kernel Cache Layer

    UPI

    OPIKernel Compilation Layer KK

    KX

    K2

    NPI

    KZ

    KQ

    RPI

    KA

    KDKT

    KC

    Kernel Services Layer KSKernel Lock Management Layer KJ

    Kernel Generic Layer KG

    Operating System Dependencies S

    Oracle Call Interface OCI

    Oracle Code Stack

    The first few characters of the routine and structure names indicate which layer in the code

    stack they come from.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    67/525DSI408: Real Application Clusters Internals I-52

    Copyright 2003, Oracle. All rights reserved.2-52

    RAC Component List

    This course examines the following RAC componentlist:

    Cluster Layer and Cluster Manager (CM)

    Node Monitor (NM)

    Cluster Group Services (CGS)

    Global Cache Service and Global Enqueue Service(GCS and GES)

    Interprocess Communication (IPC)

    Cache Fusion in the GCS

    Cache Fusion Recovery

    RAC Component Lis t

    This course examines the components listed in the slide. This is the stack, with the most

    fundamental module listed first (with some exceptions).

  • 8/10/2019 DSI408 Real Application Clusters Internals

    68/525DSI408: Real Application Clusters Internals I-53

    Copyright 2003, Oracle. All rights reserved.2-53

    Module Relation View

    ORACLE

    CGS/IMRDLM (GRD) NMIPC

    KSXP

    SKGXP

    SKGXNGCS DRM/FRGES

    Module Relation View

    GCS: Global Cache Service, or PCM locks

    GES: Global Enqueue Service, or non-PCM locks

    DRM/FR: Dynamic Resource Mastering/Fast Reconfiguration. Only partially activated in

    a standard Oracle9i Release 2 installation.

    IMR: Instance Membership Recovery. LMON handles instance death and split brain (two

    networks).

    KSXP: Multiplexing service (multithreaded layer). Allows DLM to do a lazy send;ksxp informs client after send is completed.

    NM:Node Monitor. Instances joining and leaving the cluster

    IPC: Interprocess Communication. There is usually a choice of underlying protocols to

    use, depending on the platform and hardware. The default is UDP (light; consumes no

    resources/connections) memory mapped I/O (enhanced to IPC interface used by cache

    fusion) versus port-based communication.

    CGS: Cluster Group Service. Handles the sync up the bitmap. Also a name service for

    publishing and querying configuration data. CGS in Oracle9i is changed from earlier

    versions to speed up the reconfiguration.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    69/525DSI408: Real Application Clusters Internals I-54

    Copyright 2003, Oracle. All rights reserved.2-54

    Alternate Module Relation View

    CGS

    Clientcode

    kcl ksq ksi

    DLM

    PQ KSXP SKGXP

  • 8/10/2019 DSI408 Real Application Clusters Internals

    70/525DSI408: Real Application Clusters Internals I-55

    Copyright 2003, Oracle. All rights reserved.2-55

    Module, Code Stack, Process

    The same code is present in all foreground andbackground processes.

    Modules may be constrained to run in a specificprocess.

    Module, Code Stack, Process

    Although the running Oracle server consists of several processes (both foreground and

    background), remember that this is the same program that runs in all processes. Processes

    are limited to performing a set of functions, and thus some code is active in only some

    processes. Thus there is no LMON program module, but some routines in the KJB source

    modules have a comment stating that the function runs only in the LMON process. This is

    confusing to remember when one process calls another process when examining code.

    Cross process calls require a message or posting, and execution may have to wait until the

    called process starts executing; in other words, a context switch must occur.

    On the Windows platform, there is only one process. The various Oracle server processes

    are implemented as threads inside this program.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    71/525DSI408: Real Application Clusters Internals I-56

    Copyright 2003, Oracle. All rights reserved.2-56

    Operating System Dependencies(OSD)

    Code that must be separate for each platform istypically col lected in OSD modules.

    Generic version: Runs on development system

    Reference version: Classic version ported to allplatforms

    Platform version: Optimized and specialized;several versions may exist .

    OSD code is bracketed with #ifdef #endif insome modules.

    Operating System Dependencies (OSD)

    This applies to many other Oracle server products or functions but is much more visible

    with RAC.

    If the platform dependency is small, it may be bracketed by the #i f def #endi fconstruction; otherwise, a common routine is called in an OSD module, which is

    appropriately rewritten for each platform. Such modules are generic. For example, refer tothe skgxnr . c module.

    For some OSD modules, there may be more than one version. For example, the IPC

    implementation has a number of protocols to be used. One OSD module with the sameinterface is written for each protocol. Only one module is linked to the Oracle server, thus

    deciding the IPC protocol to be used.

    Where several implementations are possible, a reference module is constructed. This is

    runable on all platforms and is the lowest common denominator. It proves functionality

    and is used to verify the correct functionality of the other specialized version of the

    module. However, it may not be used.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    72/525DSI408: Real Application Clusters Internals I-57

    Copyright 2003, Oracle. All rights reserved.2-57

    Platform-Specif ic RAC

    These are kernelroutines, so the namesstart with K.

    Service routines startwith KS.

    OSD routines start withS or SS.

    OSD code is written bythe porting groups.

    Cache KC*

    GES and GCS KJ*

    Generic Layer KG*(common functions)

    Platform Specific CodeOSD S*

    Operating SystemRoutines

    Higher layers

    SQL, Transaction, Data

    Service KS*

    Platform-Specific RAC

    Many RAC problems are platform specific. The Operating System Dependency (OSD)

    layer therefore must be examined for the platform concerned. The subdirectory is calledsosd or osds .

    This cannot be examined in TAO with cscope; you need the vobs access.

    OSD code is partially available at/ expor t / home/ ssuppor t / 920/ r dbms/ sr c/ ser ver / osds.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    73/525DSI408: Real Application Clusters Internals I-58

    Copyright 2003, Oracle. All rights reserved.2-58

    OS routines

    OSD Module: Example

    skgxp.hGeneric interface

    skgxp.c

    Referenceimplementation

    sskgxpu.c

    UDP implementation,

    port-specific sskgxph.c

    HMP implementation,port specific (HP-UX)

    SKGXP

    UDP

    TCP

    HMP

    2

    3

    45

    2 2

    1

    SKGXPmodule,3 alternativeversions

    OSD Module: Example

    A module that needs to call the operating system must be port specific. Calling an I/O

    routine may vary in name, arguments, and other particulars between platforms, even

    though they give the same functionality.

    The skgxp module has an official upward API (1). Internally, there are some commonfunctions and one way of achieving the necessary communication function of the SKGXP.

    The UDP option, for example, performs the required OS-related calls through the OS API

    (3) that send, receive, check status, and so on, by using UDP packets. It also possibly has

    some code to hide or simulate functions so that the common set (2) is maintained. Thefunctions are similar for the other protocol options.

    The reference implementation is made to compile and work on all platforms, but the whole

    module is additionally rewritten by most platform groups. As explained previously, a

    platform group makes several versions by using different protocols. This is selected at link

    time by using the appropriate library. The HMP module, shown in this example, is only

    available on the HP platform

  • 8/10/2019 DSI408 Real Application Clusters Internals

    74/525DSI408: Real Application Clusters Internals I-59

    OSD Module: Example (continued)

    Dependencies on the OSD Module

    For the skgxp module, some OSD variants have additional interfaces callable fromhigher modules. The kcl module, for example, can call for a special memory map pointerfor the HMP protocol. Higher levels in the stack have #i f def #endi fbracketed callsto the extended sskgxph.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    75/525DSI408: Real Application Clusters Internals I-60

    Copyright 2003, Oracle. All rights reserved.2-60

    Summary

    In this lesson, you should have learned about the: RAC architecture outl ine with internal references

    Relationship between the RAC-related modulesand the Oracle code stack

  • 8/10/2019 DSI408 Real Application Clusters Internals

    76/525DSI408: Real Application Clusters Internals I-61

    Copyright 2003, Oracle. All rights reserved.2-61

    References

    Main sources for general RAC information:

    RAC Web site

    http://rac.us.oracle.com:7778

    RAC Pack repository on OFO

    http://files.oraclecorp.com/content/AllPublic/Workspaces/RAC%20Pack-Public/

    WebIV

    Check folder Server.HA.RAC

  • 8/10/2019 DSI408 Real Application Clusters Internals

    77/525

    Copyright 2003, Oracle. All rights reserved.

    Cluster Layer

    Cluster Monitor

  • 8/10/2019 DSI408 Real Application Clusters Internals

    78/525DSI408: Real Application Clusters Internals I-63

    Copyright 2003, Oracle. All rights reserved.3-63

    Objectives

    After completing this lesson, you should be able to: Describe the generic Cluster Manager (CM)

    functionality

    Outline the interaction between CM and RACcluster layers

  • 8/10/2019 DSI408 Real Application Clusters Internals

    79/525DSI408: Real Application Clusters Internals I-64

    Copyright 2003, Oracle. All rights reserved.3-64

    RAC and Cluster Software

    Node

    Instance

    CM

    NM

    ksi/ksq/kcl

    GRD

    CGS

    Other

    nodes(not

    shown)

    IP

    C

    Caches

    Cluster Layer in RAC

    The cluster layer is not part of the RAC instance. The Cluster Manager (CM) is part of the

    cluster layer.

    It has its own communication path with the peer cluster software on other nodes. It can

    determine the status of other nodes in the cluster but does not maintain any consistent view.

    Most of the synchronization and consistency is handled in the Node Monitor (NM).

  • 8/10/2019 DSI408 Real Application Clusters Internals

    80/525DSI408: Real Application Clusters Internals I-65

    Copyright 2003, Oracle. All rights reserved.3-65

    Generic CM Functionality:Distributed Architecture

    Local cluster manager daemons All daemons make up the Cluster Manager

    One daemon elected as master node

    Generic CM Functionality: Distributed Architecture

    Every node in the cluster must have a local CM daemon(s) running. The set of all CM

    daemons makes up the Cluster Manager. The CM daemons on all nodes communicate with

    one another. The CM daemons on all nodes may elect a master node, which is responsible

    for managing cluster state transitions.

    Upon communication failure remaining CM daemons form a new cluster using an

    established protocol and re-elect a new master if necessary.

    The CM and the RAC cluster are distinct entities acting as physically distinct services. The

    CM is responsible for cluster consistency. The CM detects and manages cluster statetransitions. The CM co-ordinates RAC cluster recovery brought about by cluster state

    transitions.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    81/525DSI408: Real Application Clusters Internals I-66

    Copyright 2003, Oracle. All rights reserved.3-66

    Generic CM Functionality:Cluster State

    State change Cluster Incarnation Number

    Cluster Membership List

    IDLM Membership List

    Generic CM Functionality: Cluster State

    A cluster is said to change state when one or more nodes join or leave the cluster. This

    transition is complete when the cluster moves from a previous stable configuration to a

    new one. Each stable configuration is identified by a number called the cluster incarnation

    number. Every state change in the cluster monotonically increases the cluster incarnation

    number

    The set of all nodes in a cluster form a cluster membership list. The set of all nodes in the

    cluster where the RAC IDLM is running form anIDLM membership list. Every node in a

    cluster is identified by a node-IDprovided by the CM, which remains unchanged duringthe lifetime of a cluster. The IDLM uses this node-ID to identify and distinguish between

    members in the IDLM membership list

  • 8/10/2019 DSI408 Real Application Clusters Internals

    82/525DSI408: Real Application Clusters Internals I-67

    Copyright 2003, Oracle. All rights reserved.3-67

    Generic CM Functionality:Node Failure Detection

    Node failure detection Communication failure detection

    Generic CM Funct ionality: Node Failure Detection

    To insure integrity of the cluster, the CM must detect node failures. The RAC cluster may

    suspect node failure (for example, a communication failure with a node) in which it may:

    Freeze activity and expect a message from the CM to start reconfiguration

    Inform the CM of an error condition and await reconfiguration notification after a

    new stable cluster state is established

    If the CM and RAC cluster are to detect the same communication failures, CM should

    monitor cluster health on the same physical circuit used by the RAC cluster (for example,

    on HP use of HMP). Performance considerations may require the CM and RAC cluster touse separate virtual circuits.

    If the CM and RAC cluster are using separate physical circuits, the CM should be aware of

    the RAC clusters physical circuit and monitor for cluster health via the same circuit. The

    CM may provide for physical circuit redundancy for failover and performance.

    RAC Cluster reconfiguration is begun after a cluster has reached a new stable state.

    CM must be able to handle nested state transitions and communicate these state

    changes to the RAC cluster.

    Nested cluster transitions interrupt any in-process RAC cluster reconfiguration.

  • 8/10/2019 DSI408 Real Application Clusters Internals

    83/525DSI408: Real Application Clusters Internals I-68

    Copyright 2003, Oracle. All rights reserved.3-68

    Cluster Layer and Cluster Manager

    RAC cluster registers theinstance in the CM.

    Primarily the LMONprocess

    Secondarily other I/Ocapable processes (DBWR,PQ-slaves, )

    Obtains Node-ID fromcluster

    Node

    Instance

    CM

    NM

    Cluster Layer and Cluster Manager

    The Cluster Manager is a vendor- or Oracle-provided facility to communicate between all

    the nodes in the cluster about node state. The CM uses a different protocol or channel. It

    uses heartbeat and sanity checks to validate node status. The RAC proces