efrit.tistory.comefrit.tistory.com/attachment/[email protected] · iv Deploying Oracle 10g RAC on AIX V5 with GPFS 3.2.4 Installing Oracle RAC option using OUI

ibm.com/redbooks

Front cover

Deploying Oracle 10gRAC on AIX V5 withGPFS

Octavian LascuMustafa Mah

Michel PassetHarald Hammershøi

SeongLul SonMaciej Przepiórka

Understand clustering layers that help harden your configuration

Learn System p virtualization and advanced GPFS features

Deploy disaster recovery and test scenarios

http://www.redbooks.ibm.com/


International Technical Support Organization

Deploying Oracle 10g RAC on AIX V5 with GPFS

April 2008

SG24-7541-00

© Copyright International Business Machines Corporation 2008. All rights reserved.Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP ScheduleContract with IBM Corp.

First Edition (April 2008)

This edition applies to Version 3, Release 1, Modification 6 of IBM General Parallel File system (product number 5765-G66), Version 5, Modification 3 of IBM High Availability Cluster Multi-Processing (product number 5765-F62), Version 5, Release 3, Technology Level 6 of AIX (product number 5765-G03), and Oracle CRS Version 10 Release 2 and Oracle RAC Version 10 Release 2.

Note: Before using this information and the product it supports, read the information in “Notices” on page vii.

Contents

Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiTrademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixThe team that wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixBecome a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiComments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Part 1. Concepts and configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Chapter 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1 Why clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Architectural considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 RAC and Oracle Clusterware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 IBM GPFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Configuration options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3.1 RAC with GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.2 RAC with automatic storage management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.3 RAC with HACMP and CLVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Chapter 2. Basic RAC configuration with GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1 Basic scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.1 Server hardware configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.1.2 Operating system configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.1.3 Adding the user and group for Oracle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.1.4 Enabling remote command execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.1.5 System configuration parameters and network options . . . . . . . . . . . . . . . . . . . . 282.1.6 GPFS configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.1.7 Special consideration for GPFS with Oracle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.2 Oracle 10g Clusterware installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.3 Oracle 10g Clusterware patch set update. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702.4 Oracle 10g database installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762.5 Networking considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762.6 Considerations for Oracle code on shared space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 882.7 Dynamic partitioning and Oracle 10g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

2.7.1 Dynamic memory changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 902.7.2 Dynamic CPU allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Part 2. Configurations using dedicated resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Chapter 3. Migration and upgrade scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1033.1 Migrating a single database instance to GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

3.1.1 Moving JFS2-based ORACLE_HOME to GPFS. . . . . . . . . . . . . . . . . . . . . . . . . 1053.1.2 Moving JFS2-based datafiles to GPFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053.1.3 Moving raw devices database files to GPFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

3.2 Migrating from Oracle single instance to RAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1063.2.1 Setting up the new node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1073.2.2 Add the new node to existing (single node) GPFS cluster . . . . . . . . . . . . . . . . . 1073.2.3 Installing and configuring Oracle Clusterware using OUI . . . . . . . . . . . . . . . . . . 107

© Copyright IBM Corp. 2008. All rights reserved. iii

3.2.4 Installing Oracle RAC option using OUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1083.2.5 Configure database for RAC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1083.2.6 Configuring Transparent Application Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . 1153.2.7 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

3.3 Adding a node to an existing RAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1173.3.1 Add the node to Oracle Clusterware. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1173.3.2 Adding a new instance to existing RAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1213.3.3 Reconfiguring the database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1223.3.4 Final verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

3.4 Migrating from HACMP-based RAC clusterto GPFS using RMAN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

3.4.1 Current raw devices with HACMP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1233.4.2 Migrating data files to GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1243.4.3 Migrating the temp tablespace to GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1273.4.4 Migrating the redo log files to GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1273.4.5 Migrating the spfile to GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1283.4.6 Migrating the password file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1293.4.7 Removing Oracle Clusterware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1293.4.8 Removing HACMP filesets and third-party clusterware information . . . . . . . . . . 1303.4.9 Reinstalling Oracle Clusterware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1303.4.10 Switch link two library files and relink database . . . . . . . . . . . . . . . . . . . . . . . . 1313.4.11 Starting listeners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1323.4.12 Adding a database and instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

3.5 Migrating from RAC with HACMP cluster to GPFS using dd . . . . . . . . . . . . . . . . . . . 1333.5.1 Logical volume type and the dd copy command. . . . . . . . . . . . . . . . . . . . . . . . . 1343.5.2 Migrate control files to GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1353.5.3 Migrate data files to GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

3.6 Upgrading from HACMP V5.2 to HACMP V5.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1373.7 GPFS upgrade from 2.3 to 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

3.7.1 Upgrading using the mmchconfig and mmchfs commands . . . . . . . . . . . . . . . . 1393.7.2 Upgrading using mmexportfs, cluster recreation, and mmimportfs. . . . . . . . . . . 140

3.8 Moving OCR and voting disks from GPFS to raw devices . . . . . . . . . . . . . . . . . . . . . 1443.8.1 Preparing the raw devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1443.8.2 Moving OCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1473.8.3 Moving CRS voting disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Part 3. Disaster recovery and maintenance scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Chapter 4. Disaster recovery scenario using GPFS replication . . . . . . . . . . . . . . . . . 1534.1 Architectural considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

4.1.1 High availability: One storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1544.1.2 Disaster recovery: Two storage subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

4.2 Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1574.2.1 SAN configuration for the two production sites . . . . . . . . . . . . . . . . . . . . . . . . . . 1574.2.2 GPFS node configuration using three nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 1584.2.3 Disk configuration using GPFS replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1624.2.4 Oracle 10g RAC clusterware configuration using three voting disks . . . . . . . . . 166

4.3 Testing and recovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1734.3.1 Failure of a GPFS node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1744.3.2 Recovery when the GPFS node is back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1774.3.3 Loss of one storage unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1774.3.4 Fallback after the GPFS disks are recovered . . . . . . . . . . . . . . . . . . . . . . . . . . . 1814.3.5 Site disaster (node and disk failure) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

iv Deploying Oracle 10g RAC on AIX V5 with GPFS

4.3.6 Recovery after the disaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1834.3.7 Loss of one Oracle Clusterware voting disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1834.3.8 Loss of a second Oracle Clusterware (CRS) voting disk . . . . . . . . . . . . . . . . . . 184

Chapter 5. Disaster recovery using PPRC over SAN . . . . . . . . . . . . . . . . . . . . . . . . . . 1855.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1865.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

5.2.1 Storage and PPRC configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1875.2.2 Recovering from a disaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1905.2.3 Restoring the original configuration (primary storage in site A) . . . . . . . . . . . . . 192

Chapter 6. Maintaining your environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1956.1 Database backups and cloning with GPFS snapshots . . . . . . . . . . . . . . . . . . . . . . . . 196

6.1.1 Overview of GPFS snapshots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1966.1.2 GPFS snapshots and Oracle Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1996.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

6.2 GPFS storage pools and Oracle data partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2066.2.1 GPFS 3.1 storage pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2076.2.2 GPFS 3.1 filesets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2096.2.3 GPFS policies and rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2096.2.4 Oracle data partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2126.2.5 Storage pools and Oracle data partitioning example . . . . . . . . . . . . . . . . . . . . . 213

Part 4. Virtualization scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

Chapter 7. Highly available virtualized environments . . . . . . . . . . . . . . . . . . . . . . . . . 2277.1 Virtual networking environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

7.1.1 Configuring EtherChannel with NIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2317.1.2 Testing NIB failover. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

7.2 Disk configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2347.2.1 External storage LUNs for Oracle 10g RAC data files . . . . . . . . . . . . . . . . . . . . 2347.2.2 Internal disk for client LPAR rootvg. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

7.3 System configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

Chapter 8. Deploying test environments using virtualized SAN. . . . . . . . . . . . . . . . . 2418.1 Totally virtualized simple architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

8.1.1 Disk configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2438.1.2 Network configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2438.1.3 Creating virtual adapters (VIO server and clients) . . . . . . . . . . . . . . . . . . . . . . . 2448.1.4 Configuring virtual resources in VIO server . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

Contents v

Part 5. Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

Appendix A. EtherChannel parameters on AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

Appendix B. Setting up trusted ssh in a cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

Appendix C. Creating a GPFS 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

Appendix D. Oracle 10g database installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

Appendix E. How to cleanly remove CRS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287IBM Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287How to get IBM Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

vi Deploying Oracle 10g RAC on AIX V5 with GPFS

Notices

This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.

The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.

Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk.

IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.

Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.

© Copyright IBM Corp. 2008. All rights reserved. vii

Trademarks

BM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:http://www.ibm.com/legal/copytrade.shtml

The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both:

Redbooks (logo) ®eServer™AIX 5L™AIX®Blue Gene®DS4000™DS6000™DS8000™

Enterprise Storage Server®General Parallel File System™GPFS™HACMP™IBM®POWER™POWER3™POWER4™

POWER5™Redbooks®System p™System p5™System Storage™Tivoli®TotalStorage®

The following terms are trademarks of other companies:

Oracle, JD Edwards, PeopleSoft, Siebel, and TopLink are registered trademarks of Oracle Corporation and/or its affiliates.

Snapshot, and the Network Appliance logo are trademarks or registered trademarks of Network Appliance, Inc. in the U.S. and other countries.

InfiniBand, and the InfiniBand design marks are trademarks and/or service marks of the InfiniBand Trade Association.

Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Linux is a trademark of Linus Torvalds in the United States, other countries, or both.

Other company, product, or service names may be trademarks or service marks of others.

viii Deploying Oracle 10g RAC on AIX V5 with GPFS

http://www.ibm.com/legal/copytrade.shtml

Preface

This IBM Redbooks publication will help you architect, install, tailor, and configure Oracle® 10g RAC on System p™ clusters running AIX®. We describe the architecture and how to design, plan, and implement a highly available infrastructure for Oracle database using the IBM® General Parallel File System™ V3.1.

This book gives a broad understanding of how Oracle 10g RAC can use and benefit from virtualization facilities embedded in System p architecture, and how to efficiently use the tremendous computing power and available characteristics of the POWER5™ hardware and AIX 5L™ operating system.

This book also helps you design and create a solution to migrate your existing Oracle 9i RAC configurations to Oracle 10g RAC by simplifying configurations and making them easier to administer and more resilient to failures.

This book also describes how to quickly deploy Oracle 10g RAC test environments, and how to use some of the built-in disaster recovery capabilities of the IBM GPFS™ and storage subsystems to make you cluster resilient to various failures.

The team that wrote this book

This book was produced by a team of specialists from around the world working at the International Technical Support Organization, Austin Center.

Octavian Lascu is a project leader at the International Technical Support Organization, Poughkeepsie. He writes extensively and teaches IBM classes worldwide on all areas of AIX and Linux® Clustering. His areas of expertise include AIX, UNIX®, high availability and high performance computing, systems management, and application architecture. He holds a Master’s degree in Electronic Engineering from Polytechnical Institute in Bucharest, Romania. Before joining the ITSO six years ago, Octavian worked in IBM Global Services, Romania, as a Software and Hardware Services Manager. He has worked for IBM since 1992.

Mustafa Mah is an Advisory Software Engineer working for IBM System and Technology Group in Poughkeepsie, New York. He currently provides problem determination and technical assistance in the IBM General Parallel File System (GPFS) to clients on IBM System p, System x, and Blue Gene® clusters. He previously worked as an application developer for the IBM System and Technology Group supporting client fulfillment tools. He holds a Bachelor of Science in Electrical Engineering from the State University of New York in New Paltz, New York, and a Master of Science in Software Development from Marist College in Poughkeepsie, New York.

Michel Passet is a benchmark manager at the PSSC Customer Center in Montpellier, France. He manages benchmarks on an AIX System p environment. He has over twenty years of experience in IT, especially with AIX and Oracle. His areas of expertise include designing highly available and disaster resilient infrastructures for worldwide clients. He has been a speaker at international conferences for five years. He has written other IBM Redbooks publications extensively and published white papers. He presents briefings in the

© Copyright IBM Corp. 2008. All rights reserved. ix

Montpellier Executive Briefing Center and teaches education courses. He also provides on-site technical support in the field. He holds a degree in Computer Science Engineering.

Harald Hammershøi is an IT specialist, currently working as an IT Architect in the Danish APMM account architect core team. He has over 16 years of experience in the IT industry. He holds a “Civilingeniør i system konstruktion” (Danish equivalence of M.Sc. EE) from Aalborg University in Denmark. His areas of expertise include architecture, performance tuning, optimization, high availability, and troubleshooting on various database platforms, especially on Oracle 9i and 10g RAC installations.

SeongLul Son is a Senior IT specialist working at IBM Korea. He has eleven years of experience in the IT industry and his expertise includes networking, e-learning, System p virtualization, HACMP™, and GPFS with Oracle. He has written extensively about GPFS implementation, database migration, and Oracle in a virtualized environment in this publication. He also co-authored the AIX 5L Version 5.3 Differences Guide and AIX 5L and Windows® 2000: Solutions for Interpretability IBM Redbooks® publications in previous residencies.

Maciej Przepiorka is an IT Architect with the IBM Innovation Center in Poland. His job is to provide IBM Business Partners and clients with IBM technical consulting and equipment. His areas of expertise include technologies related to IBM System p servers running AIX, virtualization, and information management systems, including Oracle databases (architecture, clustering, RAC, performance tuning, optimization, and problem determination). He has over 12 years of experience in the IT industry and holds an M.Sc. Eng. degree in Computer Science from Warsaw University of Technology, Faculty of Electronics and Information Technology.

Authors: Mustafa (insert), Michel, SeongLul (SL), Harald, Octavian, and Maciej (Mike)

Thanks to the following people for their contributions to this project:

Oracle/IBM Joint Solution Center in Montpellier, France, for reviewing the draft

x Deploying Oracle 10g RAC on AIX V5 with GPFS

Oracle Romania

Dino QuinteroIBM Poughkeepsie

Andrei SocolicIBM Romania

Cristian StanciuIBM Romania

Christian Allan SchmidtIBM Denmark

Rick PiaseckiIBM Austin

Jonggun ShinGoodus Inc., Korea

Renee JohnsonITSO Austin

The authors of the previous edition of this book, Deploying Oracle9i RAC on eServer Cluster 1600 with GPFS, SG24-6954, published in October 2003:

� Octavian Lascu� Vigil Carastanef� Lifang Li� Michel Passet� Norbert Pistoor� James Wang

Become a published author

Join us for a two- to six-week residency program! Help write a book dealing with specific products or solutions, while getting hands-on experience with leading-edge technologies. You will have the opportunity to team with IBM technical professionals, Business Partners, and Clients.

Your efforts will help increase product acceptance and customer satisfaction. As a bonus, you will develop a network of contacts in IBM development labs, and increase your productivity and marketability.

Find out more about the residency program, browse the residency index, and apply online at:

ibm.com/redbooks/residencies.html

Comments welcome

Your comments are important to us!

Preface xi

http://www.redbooks.ibm.com/residencies.html

http://www.redbooks.ibm.com/residencies.html

We want our books to be as helpful as possible. Send us your comments about this book or other IBM Redbooks publications in one of the following ways:

� Use the online Contact us review IBM Redbooks publication form found at:

ibm.com/redbooks

� Send your comments in an e-mail to:

[email protected]

� Mail your comments to:

IBM Corporation, International Technical Support OrganizationDept. HYTD Mail Station P0992455 South RoadPoughkeepsie, NY 12601-5400

xii Deploying Oracle 10g RAC on AIX V5 with GPFS



http://www.redbooks.ibm.com/contacts.html

Part 1 Concepts and configurations

Part one introduces clustering concepts and discusses various cluster types, such as:

� High availability� Load balancing� Disaster recovery

Part 1

© Copyright IBM Corp. 2008. All rights reserved. 1

2 Deploying Oracle 10g RAC on AIX V5 with GPFS

Chapter 1. Introduction

This chapter provides an overview of the infrastructure and clustering technologies that you can use to deploy a highly available, load balancing database environment using Oracle 10g RAC and IBM System p, running AIX and IBM General Parallel File System (GPFS). We also provide information about various other storage management techniques.

1


1.1 Why clustering

A cluster1 is a group of computers connected together using a form of network. The computers are managed by specially designed software. This software makes the cluster appear as a single entity and is used for running an application distributed among nodes in the cluster.

In general, clusters are used to provide higher performance and availability than what a single computer or application can deliver. They are typically more cost-effective than solutions based on single computers of similar performance2.

According to their purpose, clusters are generally classified as:

� High availability

Cluster components are redundant, and a node failure has limited impact on provided service.

� Load sharing and balancing

An application is cluster aware and distributes its tasks (load) on multiple nodes.

� High performance computing

Using special libraries and programming techniques, numerical intensive computing tasks are parallel when using the computing power of multiple nodes managed by a clustering infrastructure.

However, in most cases, a clustering solution provides more than one benefit; for example, a high availability cluster can also provide load balancing for the same application.

Today’s commercial environments require that their applications are available 24x7x365. For commercial environments, high availability and load balancing are key features for IT infrastructure. Applications must be able to work with hardware and operating systems to deliver according to the agreed upon service level.

1.2 Architectural considerations

This section describes the clustering infrastructure that we use for this project. Although this publication contains other clustering and storage methods, we concentrate on Oracle Clusterware.

1.2.1 RAC and Oracle Clusterware

This section is based on Oracle Clusterware and Oracle Real Application Clusters, B14197-03, and the Administration and Deployment Guide, 10g Release 2 (10.2), B14197-04.

The idea of having multiple instances accessing the same physical data files is traced back to Oracle 7 (actually, on virtual memory system (VMS). It started back in 1998 on Oracle 6). Oracle 7 was developed to scale horizontally, when a single system modification program (SMP) server did not provide adequate performance.

1 According to Oxford University Press’ American Dictionary of Current English, a cluster is: “A number of things of the same sort gathered together or growing together”.

2 Performance calculated using standardized benchmark programs.


However, scalability was dependent on application partitioning due to an issue with using physical input/output (I/O) to exchange information, known as pinging. Thus, when an instance requested to update a block, which was modified by another instance but not yet written to disk, the block was written to disk by the instance that had modified it (block cleaning), and then it was read by the instance requesting it.

Due to locking granularity, false pinging can occur for blocks that are already cleaned. Oracle 8 introduces fine grain locking, which eliminated the false pinging.

Oracle 8.1, Parallel Server introduced the cache fusion mechanism for consistent reads (that is, exchanging data blocks through an interconnect network to avoid a physical/disk I/O read operation).

Starting with Oracle 9i Real Application Clusters (RAC), consistent read and current read operations use the cache fusion mechanism.

In Oracle 10g, the basic cluster functionality and the database Real Application Clusters (RACs) are split into two products:

� The basic cluster functionality is now Oracle Clusterware (10.2 and forward, Cluster Ready Services (CRS) in 10.1).

� CRS is now a component of Oracle Clusterware. Most Oracle Clusterware commands reflect the former name, CRS.

Oracle 10g RAC uses Oracle Clusterware for the infrastructure to bind multiple servers so that they can operate as a single system.

Oracle Clusterware is a cluster management solution that is integrated with Oracle database. The Oracle Clusterware is also a required component when using RAC. In addition, Oracle Clusterware enables both single-instance Oracle databases and RAC databases to use the Oracle high availability infrastructure.

In the past, Oracle RAC configurations required vendor specific clusterware. With Oracle Clusterware (10.2), vendor specific clusterware is no longer required. However, Oracle Clusterware can coexist with vendor clusterware, such as High-Availability Cluster Multi-Processing (HACMP). The integration between Oracle Clusterware and Oracle database means that Oracle Clusterware has inherent knowledge of the relationships among RAC instances, automatic storage management (ASM) instances, and listeners. It knows which sequence to start and stop all components.

Chapter 1. Introduction 5

Oracle Clusterware componentsFigure 1-1 shows a diagram of the major functional components that are provided by Oracle Clusterware.

Figure 1-1 Oracle Clusterware functional components

The major components of Oracle Clusterware are:

� Group membership

Cluster Synchronization Service (CSS) manages the cluster configuration.

� Process Monitor Daemon

The Process Monitor Daemon (OPROCD) is locked in memory to monitor the cluster and provide I/O fencing. OPROCD performs its checks, puts itself to sleep, and if awakened beyond the expected time, OPROCD reboots the node. Thus, an OPROCD failure results in Oracle Clusterware restarting the node.

� High Availability Framework

Cluster Ready Services (CRS and RACG) manage the high availability operations within the cluster, such as start, stop, monitor, and failover operations.

� Event management

Event Management (EVM) is a background process that publishes events that are created by CRS.

Group membership(Topology)

Event management

Interconnect (IP)

Process monitor(Watchdog)

HA framework

OS (Kernel and libraries)

halt/reset

ActionsVirtual IP addressMonitoring other applicationStarting/restarting applications,and so on

Oracle Clusterware


� Virtual IP address

Virtual IP address (VIP) is used for the application access to avoid transmission control protocol (TCP) timeout from a dead node. To be correct, the VIP is not actually a component, it is just a CRS resource that is being maintained.

The Oracle Clusterware requires two components from the platform: shared storage and an IP interconnect. Shared storage is required for voting disks to record node membership information and for the Oracle Cluster Registry (OCR) for cluster configuration information (repository).

Oracle Clusterware requires that each node is connected to a dedicated high speed (preferably low latency) IP network3.

We highly recommend that the interconnect is inaccessible to nodes (systems) that are not part of the cluster (not managed by Oracle Clusterware).

The Oracle Clusterware shared file is stored on the OCR file (can be a raw disk). There are no strict rules for OCR placement, such as there are on Oracle database pfile/spfile. Oracle has to record the location of the OCR disk/file on each cluster node.

On the AIX systems, this location is stored in the /etc/oracle/ocr.loc file.

In Figure 1-2, the component processes are grouped, and access to the OCR and voting disks is shown for one node.

The VIP address is handled as a CRS resource, just as other resources, such as a database, an instance, a listener, and so on. It does not have a dedicated process.

Figure 1-2 Oracle Clusterware component relationship4

Oracle recommends that you configure redundant network adapters to prevent interconnect components from being a single point of failure.

3 In certain configurations Oracle may also support InfiniBand® using RDS (Reliable Datagram Socket) protocol.4 (*)The oclsomon daemon is not mentioned in the 10.2 documentation, but it is running in 10.2.0.3. According to 11g

documentation, oclsomon is monitoring css (to detect if css hangs).

Event management

evmdevmd.bin

evmlogger

Highavailability

crds.bin

OCR Votingdisk

Oprocdoclsomon(*)

Processmonitor

Groupmembership

init.cssdocssd

ocssd.bin


Here are a few examples of what happens in typical situations:

� Listener failure

When CRS detects that a registered component, such as the listener is not responding, CRS tries to restart this component. CRS, by default, tries to restart this component five times.

� Interconnect failure

If interconnect is lost for one or more nodes (split brain), CSS resolves this failure through the voting disks. The surviving subcluster is the:

– Subcluster with the largest number of nodes

– Subcluster that contains the node with the lowest number

� Node malfunction

If the OPROCD process is unable to become active within the expected time, CRS reboots the node.

Oracle RAC componentsOracle RAC provides a mechanism for consistently buffering updates in multiple instances.

Single-instance Oracle databases have a one-to-one relationship between the Oracle database and the instance. RAC environments, however, have a one-to-many relationship between the database and instances; that is, in RAC environments, multiple instances have access to one database. The combined processing power of the multiple servers provides greater throughput and scalability than a single server.

RAC is the Oracle database option that provides a single system image for multiple servers to access one Oracle database. In RAC, each Oracle instance usually runs on a separate server (OS image).

You can use Oracle 10g RAC for both horizontal scaling (scale out in Oracle terms) and for high availability where client connections from a malfunctioning node are taken over by the remaining nodes in RAC.

RAC instances use two processes to ensure that each RAC database instance obtains the block that it needs to satisfy a query or transaction: the Global Cache Service (GCS) and the Global Enqueue Service (GES).

The GCS and GES maintain status records for each data file and each cached block using a Global Cache Directory (GCD). The GCD contents are distributed across all active instances and are part of the SGA.


Figure 1-3 shows the instance components for a three-node Oracle RAC.

Figure 1-3 Oracle RAC components

An instance is defined as the shared memory (SGA) and the associated background processes. When running in a RAC, the SGA has an additional member, the Global Cache Directory (GCD), and an additional background process for the GCS and GES services.

The GCD maintains information for each block being cached:

� Which instance has the block cached� Which GC mode each instance grants for the block

GC mode can be NULL, shared, or exclusive. A NULL mode means that another instance has this block in exclusive mode. Exclusive mode means the instance has the privilege to update the block.

The GCS and GES use the private interconnect for exchanging control messages and for actually exchanging data when performing Cache Fusions. Cache Fusion is a data block transfer on the interconnect. This type of a transfer occurs when one instance needs access to a data block that is already cached by another instance, thus avoiding physical I/O. GCS modes are cached on the blocks. If an instance needs to update a block that is already granted exclusive mode, additional interconnect traffic is not required.

Databuffers

dbwr

mmon

lgwr

pmon

.

.

.

Ims

IckO

Imd

Imon

Instance 1

Some of thestandard

backgroundprocesses

RAC specificbackgroundprocesses

GCD

Redo logbuffer

Sharedpool

Otherpools

Databuffers

dbwr

mmon

lgwr

pmon

.

.

.

Ims

IckO

Imd

Imon

Instance 2

Some of thestandard

backgroundprocesses


SGA

Instance n

GCD

Redo logbuffer

Sharedpool

Otherpools

Databuffers

dbwr

mmon

lgwr

pmon

.

.

.

Some of thestandard

backgroundprocesses

Ims

IckO

Imd

Imon


SGASGAGCD

Redo logbuffer

Sharedpool

Otherpools

.....


The basic concept for updates is that when an instance wants to update a data block, it must get exclusive mode granted on that block from the GRD, which means that at any given time, only one instance is able to update any data block. And therefore, if interconnect is lost, no instance can gain exclusive mode granted on any block, until the cluster recovers interconnect capability between the nodes.

However, in a multinode RAC, in a scenario where the interconnect network failing on certain nodes results in subclusters (split brain configuration), the subclusters all consider themselves survivors. This scenario is avoided by Oracle Clusterware by use of the voting disks.

1.2.2 IBM GPFS

GPFS, released in 1998 as a commercial product, was initially running on IBM Parallel Systems (PSSP) environment. The basic principle of GPFS is to provide concurrent file system access from multiple nodes without the single server limitation observed with network file systems (NFS).

GPFS has two major components: a GPFS daemon, running on all cluster nodes, providing cluster management and membership and disk over-the-network access, and a kernel extension (the file system device driver) that provides file system access to the applications.

GPFS provides cluster topology and membership management based on built-in heartbeat and quorum decision mechanisms. Also, at the file system level, GPFS provides concurrent and consistent access using locking mechanisms and a file system descriptor quorum.

Because GPFS is Portable Operating System Interface (POSIX) compliant, most applications work in a predefined manner; however, in certain cases, applications must be recompiled to fully benefit from the concurrent mechanism provided by GPFS.

In addition to concurrent access, GPFS also provides availability and reliability through replication and metadata logging, as well as advanced functions, such as information life cycle management, access control lists, quota management, multi-clustering, and disaster recovery support. Caching, as well as direct I/O, is supported.

Oracle RAC uses GPFS for concurrent access to Oracle database files. For database administrators, GPFS is easy to use and manage compared to other concurrent storage mechanisms (concurrent raw devices, ASM). It provides almost the same performance level as raw devices. The basic requirement for Oracle 10g RAC is that all disks used by GPFS (and by Oracle) are directly accessible from all nodes (each node must have a host bus adapter (HBA) connected to the shared storage and access the same logical unit numbers (LUNs)).

1.3 Configuration options

Oracle RAC is a clustering architecture that is based on shared storage (disk) architecture.

There are several options available to implement Oracle 10g RAC on advanced interactive executive (AIX) in terms of storage and data file placement. Prior to Oracle 10g, for configurations running on AIX, only two possibilities existed: GPFS file systems or raw devices on concurrent logical volume managers (CLVMs). In Oracle 10g release one, Oracle introduced its own disk management layer named Automatic Storage Management (ASM).


This chapter assesses three storage configuration scenarios and describes advantages of each of them. In all three cases, important basic hardware requirements must be fulfilled. All Oracle RAC nodes must have access to the same shared disk subsystem.

1.3.1 RAC with GPFS

GPFS is a high performance shared-disk file system that provides data access from all nodes in a cluster environment. Parallel and serial applications can access the files on the shared disk space using standard UNIX file system interfaces. The same file can be accessed concurrently from multiple nodes (single name space). GPFS is designed to provide high availability through logging and replication. It can be configured for failover from both disk and server malfunctions.

GPFS greatly simplifies the installation and administration of Oracle 10g RAC. Because it is a shared file system, all database files can be placed in one common directory, and database administrators can use the file system as a typical journaled file system (JFS)/JFS2. Allocation of new data files or resizing existing files does not require system administrator intervention. Free space on GPFS is seen as a traditional file system that is easily monitored by administrators.

Moreover, with GPFS we can keep a single image of Oracle binary files and share them between all cluster nodes. This single image applies both to Oracle database binaries (ORACLE_HOME) and Oracle Clusterware binary files. This approach simplifies maintenance operations, such as applying patch sets and one-off patches, and keeps all sets of log files and installation media in one common space.

For clients running Oracle Applications (eBusiness Suite) with multiple application tier nodes, it is also possible and convenient to use GPFS as shared APPL_TOP file system.

In GPFS Version 2.3, IBM introduces cluster topology services within GPFS. Thus, for GPFS configuration, other clustering layers, such as HACMP or RSCT, are no longer required.

Advantages of running Oracle 10g RAC on a GPFS file system are:

� Simplified installation and configuration

� Possibility of using AUTOEXTEND option for Oracle datafiles, similar to JFS/JFS2 installation

� Ease of monitoring free space for data files storage

� Ease of running cold backups and restores, similar to a traditional file system

� Capability to place Oracle binaries on a shared file system, thus making the patching process easier and faster

You can locate every single Oracle 10g RAC file type (for database and clusterware products) on the GPFS, which includes:

� Clusterware binaries� Clusterware registry files� Clusterware voting files� Database binary files� Database initialization files (init.ora or spfile)� Control files� Data files� Redo log files� Archived redo log files� Flashback recovery area files


You can use the GPFS to store database backups. In that case, you can perform the restore process from any available cluster node. You can locate other non-database-related files on the same GPFS as well.

Figure 1-4 shows a diagram of Oracle RAC on GPFS architecture. All files related to Oracle Clusterware and database are located on the GPFS.

Figure 1-4 RAC on GPFS basic architecture

Although it is possible to place Oracle Clusterware configuration disk data (OCR) and voting disk (quorum) data on the GPFS, we generally do not recommend it. In case of GPFS configuration manager node failure, failover time and I/O freeze during its reconfiguration might be too long for Oracle Clusterware, and nodes might be evicted from the cluster. Figure 1-5 shows the recommended architecture with CRS devices outside the GPFS.

Figure 1-5 RAC on GPFS basic architecture

We discuss detailed information about GPFS installation and configuration in the following sections of this book.

Note: CRS config and vote devices located on the GPFS.

Note: CRS config and vote devices located on raw physical volumes.

Oracle Database instanceOracle Database instance

Oracle CRSOracle CRS

IBM GPFS IBM GPFS

Oracle RAC Node A

Oracle RAC Node B

Shared storage

Oracle CRSOracle CRSGPFS

Oracle Database instance

GPFS

Oracle Database instance

Oracle RAC Node A

Oracle RAC Node B

Shared storage


GPFS requirementsGPFS V3.1 supports both AIX 5L and Linux nodes in a homogeneous or heterogeneous cluster. The minimum hardware requirements for GPFS on AIX 5L are IBM POWER3™ or a newer processor, 1 GB of memory, and one of the following shared disk subsystems:

� IBM TotalStorage® DS6000™ using either Subsystem Device Driver (SDD) or Subsystem Device Driver Path Control Module (SDDPCM)

� IBM TotalStorage DS8000™ using either SDD or SDDPCM

� IBM TotalStorage DS4000™ Series

� IBM TotalStorage ESS (2105-F20 or 2105-800 with SDD or AIX 5L Multi-Path I/O (MPIO) and SDDPCM)

� IBM TotalStorage Storage Area Network (SAN) Volume Controller (SVC) V1.1, V1.2, and V2.1

� IBM 7133 Serial Disk System (all disk sizes)

� Hitachi Lightning 9900 (Hitachi Dynamic Link Manager required)

� EMC Symmetrix DMX Storage Subsystems (Fibre Channel (FC) attachment only)

For a complete list of GPFS 3.1 software and hardware requirements, visit GPFS 3.1 documentation and FAQs on the following Web page:

http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfsbooks.html

Each disk subsystem requires a specific set of device drivers for proper operation while attached to a host running GPFS.

1.3.2 RAC with automatic storage management

Automatic storage management (ASM) was introduced with Version 10g of Oracle database. It is a storage layer between physical devices (hdisk devices inside AIX) and database files. ASM is a management method for raw (character) device layer with simplified administration. ASM’s major goal is to give raw device performance to the database and facilitate file system ease of administration. You can use ASM in single instance and cluster (RAC) environments. See Figure 1-6 on page 14.

Important: The previous list is not exhaustive or up-to-date for all of the versions supported. When using RAC, always check with both Oracle and the storage manufacturer for the latest support and compatibility list.

Note: For the minimum software versions and patches that are required to support Oracle products on IBM AIX, read Oracle Metalink bulletin 282036.1.

Note: CRS and ASM use raw physical volumes.


http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfsbooks.html

Figure 1-6 RAC on ASM basic architecture

With AIX, each LUN has a raw device file in the /dev directory, such as /dev/rhdisk0. For an ASM environment, this raw device file for a LUN is assigned to the oracle user. An ASM instance manages these device files. In a RAC cluster, one ASM instance is created per RAC node.

Collections of these disk devices are assigned to ASM to form ASM disk groups. For each ASM disk group, a level of redundancy is defined, which might be normal (mirrored), high (three mirrors), or external (no mirroring). When normal or high redundancy is used, disks can be organized in failure groups to ensure that data and its redundant copy do not both reside on disks that are likely to fail together.

Figure 1-7 on page 15 shows dependencies between disk devices, failure groups, and disk groups within ASM. Within disk group ASMDG1, data is mirrored between failure groups one and two. For performance reasons, ASM implements the Stripe And Mirror Everything (SAME) strategy across disk groups, so that data is distributed across all disk devices.

Important: In AIX, for each hdisk, there are two devices created in /dev directory: hdisk and rhdisk. The hdisk device is a block type device, and rhdisk is a character (sequential) device. For Oracle Clusterware and database, you must use character devices:

root@austin1:/> ls -l /dev |grep hdisk10brw------- 1 root system 20, 11 Sep 14 19:35 hdisk10crw------- 1 root system 20, 11 Sep 14 19:35 rhdisk10


Oracle Database instance Oracle Database instance

Oracle RAC Node A

Oracle RAC Node B

Shared storage

ASM ASM


Figure 1-7 ASM disk groups and failure groups

Example 1-1 shows a result of the lspv command on the AIX server; hdisk2, hdisk3, hdisk4, hdisk5, and hdisk6 do not have PVID signatures and are not assigned to any volume group. They look like unused hdisks, but they might also belong to an ASM disk group.

Important: Assigning ASM used hdisks to a volume group or setting the PVID results in data corruption.

ASM does not rely on any AIX mechanism to manage disk devices. No PVID, volume group label, or hardware reservation can be assigned to an hdisk device belonging to an ASM disk group. AIX reports ASM disks as not belonging to a volume group (unused disks). This raises a serious security problem.

rhdisk10rhdisk16

rhdisk11 rhdisk12

rhdisk13 rhdisk14 rhdisk15

rhdisk18

rhdisk17

rhdisk19

ASMDG1 (normal redundancy)ASMDG2

(external redundancy)

Failure group 1

Failure group 2

rhdisk20

rhdisk22

rhdisk21

rhdisk23


Example 1-1 AIX lspv command result

root@austin1:/> lspvhdisk0 0022be2ab1cd11ac rootvg activehdisk1 00cc5d5caa5832e0 Nonehdisk2 none Nonehdisk3 none Nonehdisk4 none Nonehdisk5 none Nonehdisk6 none Nonehdisk7 none nsd_tb1hdisk8 none nsd_tb2hdisk9 none nsd_tb3hdisk10 none nsd01hdisk11 none nsd02hdisk12 none nsd03hdisk13 none nsd04hdisk14 none nsd05hdisk15 none nsd06

The same problem exists with Oracle Clusterware disks when they reside outside of the file system or any LVM. It is not obvious if they are used by Oracle or available.

The following file types can be located on ASM:

� Database spfile� Control files� Data files� Redo log files� Archived log files� Flashback recovery area� RMAN backups

ASM manages storage only for database files. Oracle binaries, OCR, and voting disks cannot be located on ASM disk groups. If shared binaries are desired, you must use a clustered file system, such as GPFS.

Detailed ASM installation and configuration on the AIX operating system is covered in CookBook V2 - Oracle RAC 10g Release 2 with ASM on IBM System p running AIX V5 (5.2/5.3) on SAN Storage by Oracle/IBM Joint Solutions Center at:

http://www.oracleracsig.com/

1.3.3 RAC with HACMP and CLVM

This configuration is suitable for system administrators familiar with AIX LVM and who prefer to store the Oracle database files on devices. This configuration is one of the most popular options to deploy Oracle 9i RAC clusters. It offers very good performance, because direct access to disk is provided, but it requires additional system administration (AIX LVM) and HACMP knowledge.

In Oracle 9i RAC, HACMP is used to provide cluster topology services and shared disk access and to maintain high availability for the interconnect network for Oracle instances.

Note: For the minimum software versions and patches that are required to support Oracle products on IBM AIX, check the Oracle Metalink bulletin 282036.1.



Oracle 10g RAC has its own layer to manage cluster interconnect, Oracle Clusterware (CRS). In this case, HACMP is only used to provide concurrent LVM functionality.

This is the major drawback of this approach, because administrators have to maintain both clusterware products within the same environment, and most of HACMP core functionality, providing services high-availability and failover, is not used at all.

HACMP provides Oracle 10g RAC with the infrastructure for concurrent access to disks. Although HACMP provides concurrent access and a disk locking mechanism, this mechanism is only used to open the files (raw devices) and for managing hardware disk reservation. Oracle database, instead, provides its own data block locking mechanism for concurrent data access, integrity, and consistency.

Volume groups are varied on all the nodes (under the control of RSCT), thus ensuring short failover time in case one node loses the disk or network connection. This type of concurrent access can only be provided for RAW logical volumes (devices).

Oracle datafiles use the raw devices located on the shared disk subsystem. In this configuration, you must define an HACMP resource group to handle the concurrent volume groups.

There are two options when using HACMP and CLVM with Oracle RAC. Oracle Clusterware devices are located on concurrent (raw) logical volumes provided by HACMP (Figure 1-8) or on separate physical disk devices or LUNs. You must start HACMP services on all nodes before Oracle Clusterware services are activated.

Figure 1-8 Oracle CRS using HACMP and CLVM

When using physical raw volumes (Figure 1-9), Oracle Clusterware and HACMP are not dependent on each other; however, both products have to be up and running before the database startup.

Note: CRS devices are located on concurrent logical volumes provided by HACMP and CLVM.

Note: CRS devices are located on raw physical volumes. CRS does not make use of any extended HACMP functionality.

Oracle Database instanceOracle Database instance


IBM HACMP / CLVM IBM HACMP / CLVM

Oracle RAC Node A

Oracle RAC Node B

Shared storage


Figure 1-9 Oracle CRS with HACMP and CLVM

For both sample scenarios, if HACMP is configured before Oracle, CRS uses HACMP node names and numbers.

The drawback of this configuration option stems from the fairly complex administrative tasks, such as maintaining datafiles, Oracle code, and backup and restore operations.

Detailed Oracle RAC installation and configuration on AIX operating system is covered in CookBook V1 - Oracle RAC 10g Release 2 on IBM System p running AIX V5 with SAN Storage by Oracle/IBM Joint Solutions Center, January 2006, at:


Note: For the minimum software versions and the patches that are required to support Oracle products on IBM AIX, read Oracle Metalink bulletin 282036.1.


Shared storage

HACMP / CLVM HACMP / CLVM


Oracle Database instance Oracle Database instance

Oracle RAC Node A

Oracle RAC Node B

Shared storage

HACMP / CLVM HACMP / CLVM



Chapter 2. Basic RAC configuration with GPFS

This chapter describes the most common configuration used for running Oracle 10g RAC with IBM GPFS on IBM System p hardware. Oracle 10g RAC is a shared storage cluster architecture. All nodes in the cluster access the same physical database files. We describe a two node cluster configuration that is used for implementing Oracle 10g RAC on IBM System p platforms:

� Hardware configuration� Operating system configuration� GPFS configuration� Oracle 10g CRS installation� Oracle 10g database installation

2


2.1 Basic scenario

In this scenario, our goal is to create a two node Oracle RAC cluster with Storage Area Network (SAN)-attached shared storage and an EtherChannel that is used for a RAC interconnect network.

The diagram in Figure 2-1 shows the test environment that we use for this scenario.

Figure 2-1 Two node cluster configuration

2.1.1 Server hardware configuration

The hardware platform that we use for our tests (diagram in Figure 2-1) consists of the following components.

NodesWe implement a configuration consisting of two nodes (logical partitions (LPARs)) in two IBM System p5™ p570s. Each LPAR has four processors and 16 GB of random access memory (RAM).

NetworksEach node is connected to two networks:

� We use one “private” network for RAC inteconnect and GPFS metadata traffic, configured as Etherchannel with two Ethernet interfaces on each node. Communication protocol is IP.

� One public network (Ethernet or IP) storage

The storage (DS4800) connects to a SAN switch (2109-F32) via two 2 GB Fibre Channel paths. Each node has one 2 Gb 64-bit PCI-x Fibre Channel (FC) adapter. Figure 2-1 shows an overview of the configuration that was used in our environment.

austin1 austin2

ent2 ent2ent3 ent3

hdisk0rootvg

austin2_interconn10.1.100.32

austin2192.168.100.32


austin1192.168.100.31

hdisk1 hdisk3

hdisk2

hdisk22

DS4800

rootvghdisk0

hdisk1alt_rootvg alt_rootvg

RAC interconnectPublic network

fcs0 fcs0

austin1_vip192.168.100.31

austin2_vip192.168.100.32

Note: ent3 is an Etherchannel based on ent0 and ent1


We implement a SAN architecture consisting of:

� One IBM 2019-F32 storage-attached network (SAN) switch� One IBM TotalStorage DS4800 with 640 GB disk space formatted in a RAID5 array

2.1.2 Operating system configuration

The operating system that we use in our test environment is AIX 5.3 Technology Level 06. In a cluster environment, we highly recommend that all the software packages are at the same level on all cluster nodes. Using different versions can result in different node behavior and unexpected operations.

You must prepare the host operating system before installing and configuring Oracle Clusterware. In addition to OS prerequisites (software packages), Oracle Clusterware requires:

� Configuring network IP addresses� Name resolution� Enabling remote command execution� Oracle user and group

This section presents these tasks, as well as other requirements.

Operating system requirementsThe Oracle Clusterware and database requires a specific minimum level operating system. Before you proceed, make sure that you have the latest requirements and compatibilities. Table 2-1 shows the additional packages that are needed to install and satisfy Oracle prerequisites.

Table 2-1 AIX minimum requirements for Oracle 10g RAC

AIX V5.2 ML 04 or later AIX V5.3 ML 02 or later

bos.adt.basebos.adt.libbos.adt.libmbos.perf.libperfstatbos.perf.perfstatbos.perf.proctoolsrsct.basic.rtersct.compat.clients.rtexlC.aix50.rte 7.0.0.4 or 8.xxxxlC.rte 7.0.0.1 or 8.xxx

bos.adt.basebos.adt.libbos.adt.libmbos.perf.libperfstatbos.perf.perfstatbos.perf.proctoolsrsct.basic.rtersct.compat.clients.rtexlC.aix50.rte 7.0.0.4 or 8.xxxxlC.rte 7.0.0.1 or 8.xxx

bos.adt.profa

bos.cifs_fs

a. See the following information in the shaded Tip box.

Tip: If bos.adt.prof and bos.cifs_fs filesets are missing, the Oracle installation verification utility complains about this during CRS installation. However, these files are not required for Oracle, and this error message can be ignored at this point. See Oracle Metalink doc ID: 340617.1 at:

http://metalink.oracle.com

Note: You need an Oracle Metalink ID to access this document.

Chapter 2. Basic RAC configuration with GPFS 21


Etherchannel for RAC interconnectOracle Clusterware and GPFS do not provide built-in protection for network interface failure. For high availability, we set up a link aggregation Ethernet interface (in this case, Etherchannel) for RAC and GPFS interconnect. Etherchannel requires at least two (up to a maximum of eight) Ethernet interfaces connected to the same physical switch (which must also support the Etherchannel protocol).

For better availability, we recommend that you set up separate Etherchannel interfaces for Oracle interconnect and GPFS. However, it is possible to use the same Etherchannel interface for both Oracle interconnect and GPFS metadata traffic. For more information, refer to 2.5, “Networking considerations” on page 76.

Example 2-1 shows the list of Ethernet interfaces (ent0 and ent1) that we use to set up the Etherchannel interface.

Example 2-1 Verifying the Ethernet adapters to be used for Etherchannel

root@austin1:/> lsdev -Cc adapterent0 Available 03-08 2-Port 10/100/1000 Base-TX PCI-X Adapter (14108902)ent1 Available 03-09 2-Port 10/100/1000 Base-TX PCI-X Adapter (14108902)ent2 Available 06-08 10/100 Mbps Ethernet PCI Adapter II (1410ff01)fcs0 Available 05-08 FC Adaptersisscsia0 Available 04-08 PCI-X Ultra320 SCSI Adaptervsa0 Available LPAR Virtual Serial Adapterroot@austin1:/>

To configure an Etherchannel interface, use the SMIT fastpath: smitty etherchannel. Example 2-2 shows the SMIT panel.

Example 2-2 The smitty Etherchannel menu

EtherChannel / IEEE 802.3ad Link Aggregation

Move cursor to desired item and press Enter.

List All EtherChannels / Link Aggregations Add An EtherChannel / Link Aggregation Change / Show Characteristics of an EtherChannel / Link Aggregation Remove An EtherChannel / Link Aggregation Force A Failover In An EtherChannel / Link Aggregation

F1=Help F2=Refresh F3=Cancel F8=ImageF9=Shell F10=Exit Enter=Do

Example 2-3 on page 23 shows RAC and GPFS interconnect, ent0, and ent1 interfaces.

We decided to use the same Etherchannel interface for Oracle Clusterware, Oracle RAC, and GPFS interconnect. This configuration is possible, because GPFS does not use significant communication bandwidth.

Important: Make sure that the network interfaces’ names and numbers (for example, en2 and en3 in our case) are identical on all nodes that are part of the RAC cluster. This consistency is an Oracle RAC requirement.


Example 2-3 Selecting the Ethernet interfaces to be used for Etherchannel

EtherChannel / IEEE 802.3ad Link Aggregation

Move cursor to desired item and press Enter.

List All EtherChannels / Link Aggregations Add An EtherChannel / Link Aggregation Change / Show Characteristics of an EtherChannel / Link Aggregation Remove An EtherChannel / Link Aggregation +--------------------------------------------------------------------------+ ¦ Available Network Interfaces ¦ ¦ ¦ ¦ Move cursor to desired item and press F7. ¦ ¦ ONE OR MORE items can be selected. ¦ ¦ Press Enter AFTER making all selections. ¦ ¦ ¦ ¦ > ent0 ¦ ¦ > ent1 ¦ ¦ ent2 ¦

¦ ¦ ¦ F1=Help F2=Refresh F3=Cancel ¦ ¦ F7=Select F8=Image F10=Exit ¦F1¦ Enter=Do /=Find n=Find Next ¦F9+--------------------------------------------------------------------------+

Example 2-4 on page 24 shows the system management interface tool (SMIT) window through which we choose the Etherchannel interface parameters. We use the round_robin load balancing mode and default values for all other fields. To see details about configuring Etherchannel, refer to Appendix A, “EtherChannel parameters on AIX” on page 255.


Example 2-4 Configuring Etherchannel parameters

Add An EtherChannel / Link Aggregation

Type or select values in entry fields.Press Enter AFTER making all desired changes.

[Entry Fields] EtherChannel / Link Aggregation Adapters ent0,ent1 + Enable Alternate Address no + Alternate Address [] + Enable Gigabit Ethernet Jumbo Frames no + Mode round_robin + Hash Mode default + Backup Adapter + Automatically Recover to Main Channel yes + Perform Lossless Failover After Ping Failure yes + Internet Address to Ping [] Number of Retries [] +# Retry Timeout (sec) [] +#

F1=Help F2=Refresh F3=Cancel F4=ListF5=Reset F6=Command F7=Edit F8=ImageF9=Shell F10=Exit Enter=Do

Next, we configure the IP address for the Etherchannel interface using the SMIT fastpath smitty chinet. We select the previously created interface from the list (en3 in our case) and fill in the required fields, as shown in Example 2-5 on page 25.


Example 2-5 Configuring the IP address over an Etherchannel interface

Change / Show a Standard Ethernet Interface


[Entry Fields] Network Interface Name en3 INTERNET ADDRESS (dotted decimal) [10.1.100.31] Network MASK (hexadecimal or dotted decimal) [255.255.255.0] Current STATE up + Use Address Resolution Protocol (ARP)? yes + BROADCAST ADDRESS (dotted decimal) [] Interface Specific Network Options ('NULL' will unset the option) rfc1323 [] tcp_mssdflt [] tcp_nodelay [] tcp_recvspace [] tcp_sendspace [] Apply change to DATABASE only no +


2.1.3 Adding the user and group for Oracle

To install and use Oracle software, we must create a user and a group. We create user oracle and group dba on both nodes that are part of the cluster.

Use the SMIT commands, smitty mkuser and smitty mkgroup, to create the user and the group. We use the command line, as shown in Example 2-6.

Example 2-6 Creating user and group for Oracle

root@austin1:/> mkgroup -A id=300 dbaroot@austin1:/> mkuser id=300 pgrp=dba groups=staff oracleroot@austin1:/> id oracleuid=300(oracle) gid=300(dba) groups=1(staff)root@austin1:/> rsh austin2 id oracleuid=300(oracle) gid=300(dba) groups=1(staff)root@austin1:/>

Optionally, you can create the oinstall group. This group is the Oracle inventory group. If this group exists, it owns the Oracle code files. This group is a secondary group for the oracle user (besides the staff group).

Note: The user and group id must be the same on both nodes.


Oracle user profile setupWe create a user profile for user oracle and store it in oracle’s home directory. Example 2-7 shows the contents of the profile. We use the /orabin directory for Oracle code and the /oradata for database files.

Example 2-7 User oracle profile

{austin1:oracle}/home/oracle -> cat .profileexport PATH=/usr/bin:/etc:/usr/sbin:/usr/ucb:$HOME/bin:/usr/bin/X11:/sbin:.export PS1='{'$(hostname)':'$LOGIN'}$PWD -> 'set -o vi#export ORACLE_SID=RAC1export ORACLE_SCOPE=/orabinexport ORACLE_HOME=/orabin/crs#export ORACLE_HOME=/orabin/ora102export ORACLE_CRS=/orabin/crsexport ORACLE_CRS_HOME=/orabin/crsexport ORA_CRS_HOME=/orabin/crsexport LD_LIBRARY_PATH=/orabin/crs/lib:/orabin/crs/lib32export PATH=$ORACLE_HOME/bin:$PATHexport AIXTHREAD_SCOPE=Sexport NLS_LANG=american_america.we8iso8859p1export NLS_DATE_FORMAT='YYYY-MM-DD HH24:MI:SS'export TEMP=/tmpexport TMP=/tmpexport TMPDIR=/tmpumask 022

Note: If a process runs with process-wide contention scope (the default) or with system-wide contention scope, use the AIXTHREAD_SCOPE environment variable. When using system-wide contention scope, there is a one-to-one mapping between the user thread and a kernel thread.

On UNIX systems, Oracle applications are primarily multi-process and single-threaded. One of the mechanisms that enables this multi-process system to operate effectively is the AIX post/wait mechanism:

� Thread_post()� Thread_post_many()� Thread_wait()

This mechansim operates most efficiently with Oracle applications when using system-wide thread contention scope (AIXTHREAD_SCOPE=S). In addition, as of AIX V5.2, system-wide thread contention scope also significantly reduces the amount of memory that is required for each Oracle process. For these reasons, we recommend to always export AIXTHREAD_SCOPE=S before starting Oracle processes.


Name resolutionThe IP labels that are used for Oracle public and private nodes must be resolved identically (regardless the method used: flat files, DNS, or NIS) on all nodes in the cluster. For our environment, we use name resolutions using flat files (/etc/hosts). We created a list that includes:

� Public IP labels (names used for public IP networks)� Private IP labels (names used for interconnect networks)� VIP labels � Corresponding numerical IP addresses

This file is used for populating:

� /etc/hosts� /etc/hosts.equiv� Oracle and root users~/.rhosts files

Example 2-8 shows the sample /etc/hosts file that we use for our environment. The IP labels and addresses in this scenario are in bold characters.

Example 2-8 Sample /etc/hosts file

root@austin1:/> cat /etc/hosts127.0.0.1 loopback localhost # loopback (lo0) name/address

# Public network192.168.100.31 austin1 192.168.100.32 austin2

# Oracle RAC + GPFS interconnect network10.1.100.31 austin1_interconnect10.1.100.32 austin2_interconnect

#Virtual IP for oracle192.168.100.131 austin1_vip192.168.100.132 austin2_vip

# Others servers, switches and storage192.168.100.1 gw8810 # Linux192.168.100.20 nim8810 # Nim server p550192.168.100.21 p550_lpar2192.168.100.10 switch1192.168.100.11 switch2192.168.100.12 switch3192.168.100.241 2109_f32

192.168.100.251 ds4800_c1192.168.100.252 ds4800_c2

192.168.100.231 hmc_p5192.168.100.232 hmc_p6


Virtual IP (VIP) in Oracle Clusterware provides client connection high availability. When a cluster node fails, the VIP associated with it is automatically failed over to one of the surviving nodes using the following procedure:

1. When a public network fails, the VIP mechanism detects the failure and generates a Fast Application Notification (FAN) event.

2. ORA-3113 error or equivalent is sent to clients subscribing to a FAN.

3. For the subsequent connection requests, Oracle client software parses the tnsnames.ora address list skipping the missing cluster nodes, thus avoiding a client connection to wait for TCP/IP timeouts (which often take 10 minutes) to expire the connection.

2.1.4 Enabling remote command execution

Oracle Clusterware (root user) and RAC communication (oracle user) require remote command execution to coordinate commands between cluster nodes. You must set up remote command execution in such a way that it does not require any user intervention (password prompt secure shell, (ssh)key acceptance, ssh key passphrase).

You can use either ssh or standard remote shell (rsh). If ssh is already configured, Oracle automatically uses ssh as a remote execution. Otherwise, rsh is used. To keep it simple, we use rsh in our test environment.

GPFS also requires remote command execution without user interaction between cluster nodes (as root). GPFS also supports using ssh or rsh. You can specify the remote command execution when creating the GPFS cluster (the mmcrcluster command).

For rsh, rcp, and rlogin, you must set up user equivalence for the oracle and root accounts. We set up equivalency editing the /etc/hosts.equiv files on each cluster node and also in root and oracle home directory $HOME/.rhosts files as shown in Example 2-9.

Example 2-9 /etc/hosts.equiv and ~/.rhosts

austin1austin2austin1_interconnectaustin2_interconnect

2.1.5 System configuration parameters and network options

In addition to IP and name resolution configuration, you must also configure certain system and network operational parameters.

Change the parameter Maximum number of PROCESSES allowed per user to 2048 or greater:

root@austin1:/> chdev -l sys0 -a maxuproc=2048

Important: Oracle remote command execution fails if there are any intermediate messages (including banners) during the authentication phase. For example, if you are using rsh with two authentication methods (kerberos and system), and kerberos authentication fails, even though the system authentication works correctly, the intermediate kerberos failing message received by Oracle will result in Oracle remote command execution failure.


Also, Oracle recommends that you configure the user file, CPU, data, and stack limits in /etc/security/limits as shown in Example 2-10.

Example 2-10 Changing limits in /etc/security/limits

root: fsize = -l cpu = -1 data = -l stack = -l

oracle: fsize = -l cpu = -1 data = -l stack = -l

Table 2-2 on page 30 shows the TCP/IP stack parameters minimum recommended values for Oracle installation. For production database systems, Oracle recommends that you tune these values to optimize system performance.

Refer to your operating system documentation for more information about tuning TCP/IP parameters.

Tip: For production systems, this value must be at least 128 plus the sum of PROCESSES and PARALLEL_MAX_SERVERS initialization parameters for each database running on the system.


Table 2-2 Configuring network options

2.1.6 GPFS configuration

IBM GPFS provides file system services to parallel and serial applications. GPFS allows parallel applications simultaneous access to the same files, or different files, from any node that has the GPFS file system mounted while managing a high level of control over all file system operations.

In our configuration, we use two GPFSs, one for Oracle data files and the other for Oracle binary files. For Oracle Cluster Repository (OCR) and CRS voting disks that are required for Oracle Clusterware installation, we use raw devices (disks).

Installing GPFSWe use GPFS V3.1 for our test environment. We have installed the filesets and verified the packages by using the lslpp command on each node as shown in Example 2-11.

Example 2-11 Verifying GPFS filesets installation

root@austin1:/> lslpp -l |grep gpfs gpfs.base 3.1.0.6 COMMITTED GPFS File Manager gpfs.msg.en_US 3.1.0.5 COMMITTED GPFS Server Messages - U.S. gpfs.base 3.1.0.6 COMMITTED GPFS File Manager gpfs.docs.data 3.1.0.1 COMMITTED GPFS Server Manpages and

Preparing node and disk descriptor filesWe have created the files shown in Example 2-12 on page 31 for our test environment. These are one node descriptor file and three disk descriptor files.

Parameter Recommended value on all nodes

ipqmaxlen 512

rfc1323 1

sb_max 1310720

tcp_recvspace 65536

tcp_sendspace 65536

udp_recvspace 655360a

a. The recommended value of this parameter is 10 times the value of the udp_sendspace parameter. The value must be less than the value of the sb_max parameter.

udp_sendspace 65536b

b. This value is suitable for a default database installation. For production databases, the minimum value for this parameter is 4 KB plus the value of the database DB_BLOCK_SIZE initialization parameter multiplied by the value of the DB_MULTIBLOCK_READ_COUNTinitialization parameter: (DB_BLOCK_SIZE * DB_MULTIBLOCK_READ_COUNT) + 4 KB

Note: Certain parameters are set at interface (en*) level (check with lsattr -El en*).

Note: We decided for this configuration to avoid a situation where GPFS and Oracle Clusterware interfere during node recovery process.


Example 2-12 Node and disk descriptor files for GPFS

root@austin1:/etc/gpfs_config> ls -l gpfs*-rw-r--r-- 1 root system 128 Sep 14 19:35 gpfs_disks_orabin-rw-r--r-- 1 root system 256 Sep 14 19:36 gpfs_disks_oradata-rw-r--r-- 1 root system 156 Sep 14 19:36 gpfs_disks_tb-rw-r--r-- 1 root system 72 Sep 14 18:52 gpfs_nodes

When creating the GPFS cluster, you must provide a file containing a list of node descriptors, one per line, for each node to be included in the cluster, as shown in Example 2-13. Because this is a two node configuration, both nodes are quorum and manager nodes.

Example 2-13 Sample of GPFS node file

root@austin1:/etc/gpfs_config> cat gpfs_nodesaustin1_interconnect:quorum-manageraustin2_interconnect:quorum-manager

Node rolesThe node roles are:

quorum | nonquorumThis designation specifies whether or not the node is included in the pool of nodes from which quorum is derived. The default is nonquorum. You must designate at least one node as a quorum node.

manager | client Indicates whether a node is part of the node pool from which configuration managers, file system managers, and the token manager can be selected. The special functions of the file system manager consume extra CPU time.

Prepare each physical disk for GPFS Network Shared Disks (NSDs)1 using the mmcrnsd command, as shown in Example 2-14 on page 32. You can create NSDs on physical disks (hdisk or vpath devices in AIX).

In our testing environment, because both nodes are directly attached to storage, we are not going to assign any NSD server into the disk description file. However, you must create NSDs anyway, because it is required to create a file system (unless you are using VSDs), regardless of whether you use NSD servers.

Notes: Additional considerations:

� Configuration manager: There is only one configuration manager per cluster. The configuration manager is elected out of the pool of quorum nodes. The role of configuration manager is to select a file system manager node out of the pool of manager nodes. It also initiates and controls node failure recovery procedures. The configuration manager also determines if the quorum rule is fulfilled.

� File system manager: There is one file system manager per mounted file system. The file system manager is responsible for file system configuration (adding disks, changing disk availability, and mounting and unmounting file systems) and managing disk space allocation, token management, and security services.

1 Network Shared Disk is a concept that represents the way that the GPFS file system device driver accesses a raw disk device regardless of whether the disk is locally attached (SAN or SCSI) or is attached to another GPFS node (via network).


Example 2-14 Sample of GPFS disk description file

root@austin1:/etc/gpfs_config> cat gpfs_disks_orabinhdisk14:::dataAndMetadata:1:nsd05hdisk15:::dataAndMetadata:2:nsd06

root@austin1:/etc/gpfs_config> cat gpfs_disks_oradatahdisk10:::dataAndMetadata:1:nsd01hdisk11:::dataAndMetadata:1:nsd02hdisk12:::dataAndMetadata:2:nsd03hdisk13:::dataAndMetadata:2:nsd04

The Disk descriptor file has the following format:

DiskName:PrimaryNSDServer:BackupNSDServer:DiskUsage:FailureGroup:DesiredName:StoragePool

DiskName The block device name appearing in /dev for the disk that you are defining as an NSD. If a PrimaryServer node is specified, DiskName must be the /dev name for the disk device on the primary NSD server node.

PrimaryNSDServer The name of the primary NSD server node. If this field is omitted, the disk is assumed to be SAN-attached to all nodes in the cluster. If not, all nodes in the cluster have access to the disk, or if the file system that the disk belongs to is accessed by other GPFS clusters, PrimaryServer must be specified.

BackupNSDServer The name of the backup NSD server node. If the PrimaryServer is specified and this field is omitted, it is assumed that you do not want failover in the event that the PrimaryServer fails. If the BackupServer is specified and the PrimaryServer is not specified, the command fails. The host name or IP address must refer to the communications adapter over which the GPFS daemons communicate. Alias interfaces are not allowed. Use the original address or a name that is resolved by the host command to that original address.

DiskUsage Specify a disk usage or accept the default. This field is ignored by the mmcrnsd command and is passed unchanged to the output descriptor file produced by the mmcrnsd command. Possible values are:

dataAndMetadata dataAndMetadata indicates that the disk contains both data and metadata. This is the default.

dataOnly dataOnly indicates that the disk contains data and does not contain metadata.

metadataOnly metadataOnly indicates that the disk contains metadata and does not contain data.

descOnly descOnly indicates that the disk contains no data and no metadata. This disk is used solely to keep a copy of the file system descriptor and can be used as a third failure group in certain disaster recovery configurations.

Tip: If you use many small files and the file system metatdata is dynamic, separating data and metadata improves performance. However, if large files are mostly used, and there is little metadata activity, separating data from metadata does not improve performance.


FailureGroup FailureGroup is a number identifying the failure group to which this disk belongs. GPFS uses this information during data and metadata placement to assure that no two replicas of the same block are written in such a way as to become unavailable due to a single failure.

DesiredName Specify a name for the NSD.

StoragePool StoragePool specifies the name of the storage pool to which the NSD is assigned (if desired). If this name is not provided, the default is system. Only the system pool can contain metadataOnly, dataAndMetadata, or descOnly disks.

Example 2-15 shows the disk description file for tiebreaker disks.

Example 2-15 Sample disk description file used for creating tiebreaker NSDs

root@austin1:/etc/gpfs_config> cat gpfs_disks_tbhdisk7:::::nsd_tb1hdisk8:::::nsd_tb2hdisk9:::::nsd_tb3

Cluster quorumGPFS cluster quorum must be maintained for the GPFS file systems to remain available. If the quorum semantics are broken, GPFS performs the recovery in an attempt to achieve quorum again. GPFS can use one of two methods for determining quorum:

� Node quorum� Node quorum with tiebreaker disks

Table 2-3 explains the difference between node quorum and node quorum with tiebreakerDisks.

Table 2-3 Difference between node quorum and node quorum with tiebreaker disks

Tip: We recommend that you define DesiredName, because you can use meaningful names that make system administration easier (see Example 2-21, nsd_tb1 is used as a tiebreaker, because of the “tb” in the suffix). If a desired name is not specified, the NSD is assigned a name according to the convention: gpfsNNnsd where NN is a unique nonnegative integer (for example, gpfs01nsd, gpfs02nsd, and so on).

Note: The disk descriptor file shown in Example 2-15 does not specify the diskUsage, because we use these NSDs for cluster quroum (tie breakers), and they will not be part of any file systems.

Node quorum Node quorum with tiebreaker disksa

a. See the following tip.

� Quorum is defined as one plus half of the explicitly defined quorum nodes in the GPFS cluster.

� There are no default quorum nodes; you must specify which nodes have this role.

� GPFS does not limit the number of quorum nodes.

� There is a maximum of eight quorum nodes. � You must include the primary and secondary

cluster configuration servers as quorum nodes.

� You can have an unlimited number of non-quorum nodes.


Configuring GPFSBefore you start setting up a GPFS cluster, verify that the remote command execution is working properly between all GPFS nodes via the interfaces that are used for GPFS metadata traffic (austin1_interconnect and austin2_interconnect).

Creating the clusterTo create the GPFS cluster, we use the mmcrcluster that is shown in Example 2-16. We use rsh/rcp for remote command/copy programs (default option), so we do not need to specify these parameters.

Example 2-16 Creating a GPFS cluster

root@austin1:/etc/gpfs_config> mmcrcluster -N gpfs_nodes -p austin1_interconnect \> -s austin2_interconnect -C austin_cluster -AWed Sep 12 11:50:20 CDT 2007: 6027-1664 mmcrcluster: Processing node austin1_interconnectWed Sep 12 11:50:22 CDT 2007: 6027-1664 mmcrcluster: Processing node austin2_interconnectmmcrcluster: Command successfully completedmmcrcluster: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.

The following list gives a short explanation for the mmcrcluster command parameters shown in Example 2-16.

-N NodeFile NodeFile specifies the file containing the list of node descriptors (see Example 2-13 on page 31), one per line, to be included in the GPFS cluster.

-p PrimaryServer PrimaryServer specifies the primary GPFS cluster configuration server node used to store the GPFS configuration data.

-s SecondaryServer SecondaryServer specifies the secondary GPFS cluster configuration server node used to store the GPFS cluster data. We suggest that you specify a secondary GPFS cluster configuration server to prevent the loss of configuration data in the event that your primary GPFS cluster configuration server goes down. When the GPFS daemon starts up, at least one of the two GPFS cluster configuration servers must be accessible.

Tip: You can use tiebreaker disks with GPFS 2.3 or later. With node (only) quorum as a quorum method, you must have at least three nodes to maintain quorum, because quorum is defined as one plus half of the explicitly defined quorum nodes. Otherwise, with two quorum nodes only, if one quorum node goes down, a GPFS cluster will go into “arbitrating” state due to insufficient quorum nodes, rendering all file systems in the cluster unavailable.

Most Oracle RAC with GPFS configurations are two node clusters; therefore, you must set up node quorum with tiebreaker disks. A GPFS cluster can survive and maintain file systems available with one quorum node and one available tiebreaker disk in this configuration. You can have one, two, or three tiebreaker disks. However, we recommend that you use an odd number of tiebreaker disks (three).


-C ClusterName Clustername specifies a name for the cluster. If the user-provided name contains dots, it is assumed to be a fully qualified domain name. Otherwise, to make the cluster name unique, the domain of the primary configuration server will be appended to the user-provided name. If the -C flag is omitted, the cluster name defaults to the name of the primary GPFS cluster configuration server.

-A This parameter specifies that GPFS daemons automatically start when nodes come up. The default is not to start daemons automatically.

To check the current configuration information for the GPFS cluster, use the mmlscluster command (see Example 2-17).

Example 2-17 Checking current GPFS configuration

root@austin1:/etc/gpfs_config> mmlscluster

GPFS cluster information======================== GPFS cluster name: austin_cluster.austin1_interconnect GPFS cluster id: 720967500852369612 GPFS UID domain: austin_cluster.austin1_interconnect Remote shell command: /usr/bin/rsh Remote file copy command: /usr/bin/rcp

GPFS cluster configuration servers:----------------------------------- Primary server: austin1_interconnect Secondary server: austin2_interconnect

Node Daemon node name IP address Admin node name Designation----------------------------------------------------------------------- 1 austin1_interconnect 10.1.100.31 austin1_interconnect quorum-manager 2 austin2_interconnect 10.1.100.32 austin2_interconnect quorum-manager

Creating the NSDsExample 2-18 shows how to create network shared disks (NSDs) using the previously created disk descriptor files (see Example 2-14 on page 32 and Example 2-15 on page 33). We use these NSDs for creating file systems or configuring tiebreaker disks.

Example 2-18 Creating NSDs

root@austin1:/etc/gpfs_config> mmcrnsd -F gpfs_disks_orabinmmcrnsd: Processing disk hdisk14mmcrnsd: Processing disk hdisk15mmcrnsd: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.

root@austin1:/etc/gpfs_config> mmcrnsd -F gpfs_disks_oradatammcrnsd: Processing disk hdisk10mmcrnsd: Processing disk hdisk11mmcrnsd: Processing disk hdisk12


mmcrnsd: Processing disk hdisk13mmcrnsd: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.

root@austin1:/etc/gpfs_config> mmcrnsd -F gpfs_disks_tbmmcrnsd: Processing disk hdisk7mmcrnsd: Processing disk hdisk8mmcrnsd: Processing disk hdisk9mmcrnsd: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.

Use the mmlsnsd command to display the current NSD information, as shown in Example 2-19.

Example 2-19 Checking current NSDs

root@austin1:/etc/gpfs_config> mmlsnsd

File system Disk name Primary node Backup node--------------------------------------------------------------------------- (free disk) nsd01 (directly attached) (free disk) nsd02 (directly attached) (free disk) nsd03 (directly attached) (free disk) nsd04 (directly attached) (free disk) nsd05 (directly attached) (free disk) nsd06 (directly attached) (free disk) nsd_tb1 (directly attached) (free disk) nsd_tb2 (directly attached) (free disk) nsd_tb3 (directly attached)

Upon successful completion of the mmcrnsd command, the disk descriptor files are rewritten to contain the created NSD names in place of the device name, as shown in Example 2-20. This is done to prepare the disk descriptor files for subsequent usage for creating GPFS file systems (mmcrfs or mmadddisk commands).

Example 2-20 Node description files after creating NSDs

root@austin1:/etc/gpfs_config> cat gpfs_disks_orabin# hdisk14:::dataAndMetadata:1:nsd05nsd05:::dataAndMetadata:1::# hdisk15:::dataAndMetadata:2:nsd06nsd06:::dataAndMetadata:2::

root@austin1:/etc/gpfs_config> cat gpfs_disks_oradata# hdisk10:::dataAndMetadata:1:nsd01nsd01:::dataAndMetadata:1::# hdisk11:::dataAndMetadata:1:nsd02nsd02:::dataAndMetadata:1::# hdisk12:::dataAndMetadata:2:nsd03nsd03:::dataAndMetadata:2::# hdisk13:::dataAndMetadata:2:nsd04nsd04:::dataAndMetadata:2::

Now all of the NSDs are defined, and you can see the mapping of physical disks to GPFS NSDs using the command shown in Example 2-21 on page 37.


Example 2-21 Mapping physical disks to GPFS NSDs

root@austin1:/> mmlsnsd -a -m

Disk name NSD volume ID Device Node name Remarks--------------------------------------------------------------------------------nsd01 C0A8641F46EB28F7 /dev/hdisk10 austin1_interconnect directly attached nsd02 C0A8641F46EB28F8 /dev/hdisk11 austin1_interconnect directly attached nsd03 C0A8641F46EB28F9 /dev/hdisk12 austin1_interconnect directly attached nsd04 C0A8641F46EB28FA /dev/hdisk13 austin1_interconnect directly attached nsd05 C0A8641F46EB28E2 /dev/hdisk14 austin1_interconnect directly attached nsd06 C0A8641F46EB28E3 /dev/hdisk15 austin1_interconnect directly attached nsd_tb1 C0A8641F46EB2906 /dev/hdisk7 austin1_interconnect directly attached nsd_tb2 C0A8641F46EB2907 /dev/hdisk8 austin1_interconnect directly attached nsd_tb3 C0A8641F46EB2908 /dev/hdisk9 austin1_interconnect directly attached

root@austin1:/> lspvhdisk0 0022be2ab1cd11ac rootvg activehdisk1 00cc5d5caa5832e0 Nonehdisk2 none Nonehdisk3 none Nonehdisk4 none Nonehdisk5 none Nonehdisk6 none Nonehdisk7 none nsd_tb1hdisk8 none nsd_tb2hdisk9 none nsd_tb3hdisk10 none nsd01hdisk11 none nsd02hdisk12 none nsd03hdisk13 none nsd04hdisk14 none nsd05hdisk15 none nsd06

Changing cluster quorum (adding tiebreaker disks)Before you start the GPFS cluster, you must configure tiebreaker disks. If you want to change cluster configuration later, you must stop the GPFS daemon on all nodes; thus, it is better to change the cluster quorum at this point.

Configure the tiebreakerDisks by using the mmchconfig command as shown in Example 2-22 on page 38. Then, run the mmlsconfig command to check if they are on the list.


Example 2-22 Configuring tiebreakerDisks

root@austin1:/etc/gpfs_config> mmchconfig tiebreakerDisks="nsd_tb1;nsd_tb2;nsd_tb3"Verifying GPFS is stopped on all nodes ...mmchconfig: Command successfully completedmmchconfig: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.

root@austin1:/etc/gpfs_config> mmlsconfigConfiguration data for cluster austin_cluster.austin1_interconnect:-------------------------------------------------------------------clusterName austin_cluster.austin1_interconnectclusterId 720967500852369612clusterType lcautoload yesuseDiskLease yesmaxFeatureLevelAllowed 906tiebreakerDisks nsd_tb1;nsd_tb2;nsd_tb3[austin1_interconnect]takeOverSdrServ yes

File systems in cluster austin_cluster.austin1_interconnect:------------------------------------------------------------(none)

Creating GPFS file systemsIn order to create a file system, the GPFS cluster must be up and running. Start the GPFS cluster with mmstartup -a, as shown in Example 2-23. This command starts GPFS on all nodes in the cluster. To start only the local node, just run mmstartup without -a option.

Example 2-23 Starting up GPFS cluster

root@austin1:/etc/gpfs_config> mmstartup -aWed Sep 12 11:57:24 CDT 2007: 6027-1642 mmstartup: Starting GPFS ...root@austin1:/etc/gpfs_config>

To create file systems, run the mmcrfs command. We have created two file systems: /orabin for oracle binaries (Example 2-24 on page 39) and /oradata for oracle database files (Example 2-25 on page 39).


Example 2-24 Creating the /orabin file system

root@austin1:/etc/gpfs_config> mmcrfs /orabin orabin -F gpfs_disks_orabin -A yes \> -B 512k -M2 -m2 -R2 -r2 -n 4 -N 50000

GPFS: 6027-531 The following disks of orabin will be formatted on node austin1: nsd05: size 10485760 KB nsd06: size 10485760 KBGPFS: 6027-540 Formatting file system ...GPFS: 6027-535 Disks up to size 25 GB can be added to storage pool 'system'.Creating Inode FileCreating Allocation MapsClearing Inode Allocation MapClearing Block Allocation MapGPFS: 6027-572 Completed creation of file system /dev/orabin.mmcrfs: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.

Example 2-25 Creating the /oradata file system

root@austin1:/etc/gpfs_config> mmcrfs /oradata oradata -F gpfs_disks_oradata -A yes \> -B 512k -M2 -m2 -R2 -r2 -n 4

GPFS: 6027-531 The following disks of oradata will be formatted on node austin2: nsd01: size 10485760 KB nsd02: size 10485760 KB nsd03: size 10485760 KB nsd04: size 10485760 KBGPFS: 6027-540 Formatting file system ...GPFS: 6027-535 Disks up to size 51 GB can be added to storage pool 'system'.Creating Inode FileCreating Allocation MapsClearing Inode Allocation MapClearing Block Allocation MapGPFS: 6027-572 Completed creation of file system /dev/oradata.mmcrfs: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.

The following list gives a short explanation for the mmcrfs command parameters shown in Example 2-25:

/oradata This parameter is the mount point directory of the GPFS.

oradata This parameter is the name of the file system to be created, because it will appear in /dev directory. File system names do not need to be fully-qualified; orada is as acceptable as /dev/oradata. However, file system names must be unique within a GPFS cluster. Do not specify an existing entry in /dev.

Tip: We recommend that you have 50000 inodes for a file system that is used for Oracle binaries if you plan to install Oracle Clusterware and the database in this file system. The mmcrfs -N option is for the maximum number of files in the file system. This value defaults to the size of the file system divided by 1M. Therefore, we intentionally used mmcrfs -N 50000 for the /orabin file system.


-F Disk description file This parameter specifies a file containing a list of disk descriptors, one per line. You can use the rewritten DiskDesc file created by the mmcrnsd command.

-A yes This parameter indicates the file system mounts automatically when the GPFS daemon starts (this is the default). Other options are: no - manual mount, and automount - when the file system is first accessed.

-B BlockSize This parameter is the size of data blocks. This parameter must be 16 KB, 64 KB, 256 KB (the default), 512 KB, or 1024 KB (1 MB is also acceptable). Specify this value with the character K or M, for example, 512K.

-M MaxMetadataReplicas This parameter is the default maximum number of copies of inodes, directories, and indirect blocks for a file. Valid values are 1 and 2 but cannot be less than DefaultMetadataReplicas. The default is 1.

-m DefaultMetadataReplicas This parameter is the default number of copies of inodes, directories, and indirect blocks for a file. Valid values are 1 and 2 but cannot be greater than the value of MaxMetadataReplicas. The default is 1.

-R MaxDataReplicas This parameter is the default maximum number of copies of data blocks for a file. Valid values are 1 and 2 but cannot be less than DefaultDataReplicas. The default is 1.

-r DefaultDataReplicas This parameter is the default number of copies of each data block for a file. Valid values are 1 and 2 but cannot be greater than MaxDataReplicas. The default is 1.

-n NumNodes This parameter is the estimated number of nodes that mounts the file system. This value is used as a best estimate for the initial size of several file system data structures. The default is 32. When you create a GPFS file system, you might want to overestimate the number of nodes that mount the file system. GPFS uses this information for creating data structures that are essential for achieving maximum parallelism in file system operations. Although a large estimate consumes additional memory, underestimating the data structure allocation can reduce the efficiency of a node when it processes parallel requests, such as the allotment of disk space to a file. If you cannot predict the number of nodes that mounts the file system, apply the default value. If you are planning to add nodes to your system, specify a number larger than the default. However, do not make estimates that are unrealistic. Specifying an excessive number of nodes can have an adverse effect on buffer operations.

Tip: In an Oracle with GPFS environment, we generally recommend a GPFS block size of 512 KB. Using 256 KB block size is recommended when there is significant file activity other than Oracle, or there are many small files not belonging to the database. A block size of 1 MB is recommended for file systems of 100 TB or larger. See Oracle Metalink doc ID: 302806.1 at:


Note: You need an Oracle Metalink ID to access this note.



-N NumInodes This parameter is the maximum number of files in the file system. This value defaults to the size of the file system at creation, divided by 1 M, and can be specified with a suffix, for example 8 K or 2 M. This value is also constrained by the formula:

maximum number of files = (total file system space/2) / (inode size + subblock size)

-v {yes | no} Verify that specified disks do not belong to an existing file system. The default is -v yes. Specify -v no only when you want to reuse disks that are no longer needed for an existing file system. If the command is interrupted for any reason, you must use the -v no option on the next invocation of the command.

Example 2-26 shows the file system information. You can see block size, maximum number of inodes, number of replicas, and so on.

Example 2-26 Checking the file system information

root@austin1:/etc/gpfs_config> mmlsfs orabinflag value description---- -------------- ----------------------------------------------------- -s roundRobin Stripe method -f 16384 Minimum fragment size in bytes -i 512 Inode size in bytes -I 16384 Indirect block size in bytes -m 2 Default number of metadata replicas -M 2 Maximum number of metadata replicas -r 2 Default number of data replicas -R 2 Maximum number of data replicas -j cluster Block allocation type -D posix File locking semantics in effect -k posix ACL semantics in effect -a -1 Estimated average file size -n 4 Estimated number of nodes that will mount file system -B 524288 Block size -Q none Quotas enforced none Default quotas enabled

Tip: This -n NumNodes value cannot be changed after the file system has been created.

Tip: For file systems that will perform parallel file creates, if the total number of free inodes is not greater than 5% of the total number of inodes, there is the potential for slowdown in file system access. Take this into consideration when changing your file system.


-F 51200 Maximum number of inodes -V 9.03 File system version. Highest supported version: 9.03 -u yes Support for large LUNs? -z no Is DMAPI enabled? -E yes Exact mtime mount option -S no Suppress atime mount option -K whenpossible Strict replica allocation option -P system Disk storage pools in file system -d nsd05;nsd06 Disks in file system -A yes Automatic mount option -o none Additional mount options -T /orabin Default mount point

Mounting the file systemsMount GPFS using the mmmount command as shown in Example 2-27. Starting with GPFS V3.1, two new commands, mmmount and mmumount, are shipped. These commands can be used to mount and to unmount GPFS on multiple nodes without using the OS mount and umount commands.

Example 2-27 Mounting all GPFS file systems on both nodes

root@austin1:/etc/gpfs_config> mmmount all -aWed Sep 12 12:17:08 CDT 2007: 6027-1623 mmmount: Mounting file systems ...root@austin1:/etc/gpfs_config>

You can check the mounted file systems using the mmlsmount (GPFS) and mount (system) commands, as shown in Example 2-28 on page 43.


Example 2-28 Displaying mounted GPFS file systems

root@austin1:/> mmlsmount all_local -L

File system orabin is mounted on 2 nodes: 10.1.100.31 austin1_interconnect 10.1.100.32 austin2_interconnect

File system oradata is mounted on 2 nodes: 10.1.100.31 austin1_interconnect 10.1.100.32 austin2_interconnect

root@austin1:/> mount node mounted mounted over vfs date options-------- --------------- --------------- ------ ------------ --------------- /dev/hd4 / jfs2 Oct 03 11:10 rw,log=/dev/hd8 /dev/hd2 /usr jfs2 Oct 03 11:10 rw,log=/dev/hd8 /dev/hd9var /var jfs2 Oct 03 11:10 rw,log=/dev/hd8 /dev/hd3 /tmp jfs2 Oct 03 11:10 rw,log=/dev/hd8 /dev/hd1 /home jfs2 Oct 03 11:11 rw,log=/dev/hd8 /proc /proc procfs Oct 03 11:11 rw /dev/hd10opt /opt jfs2 Oct 03 11:11 rw,log=/dev/hd8 /dev/fslv00 /oracle jfs2 Oct 03 11:11 rw,log=/dev/hd8 /dev/orabin /orabin mmfs Oct 03 11:13 rw,mtime,atime,dev=orabin /dev/oradata /oradata mmfs Oct 03 11:13 rw,mtime,atime,dev=oradata

To check the available space in a GPFS file system, use the mmdf command, as shown in Example 2-29. The system df command can display inaccurate information about GPFS file systems; thus, we recommend using the mmdf command. This command displays information, such as free blocks, that is presented by failure group and storage pool.

Example 2-29 Checking the /orabin file system (mmfd)

root@austin1:/etc/gpfs_config> mmdf orabindisk disk size failure holds holds free KB free KBname in KB group metadata data in full blocks in fragments---------------- -------- -------- ----- -------------------- -------------------Disks in storage pool: systemnsd05 10485760 1 yes yes 10446848 (100%) 688 ( 0%)nsd06 10485760 2 yes yes 10446848 (100%) 688 ( 0%) ------------- -------------------- -------------------(pool total) 20971520 20893696 (100%) 1376 ( 0%)

============= ==================== ===================(total) 20971520 20893696 (100%) 1376 ( 0%)


Inode Information-----------------Number of used inodes: 4010Number of free inodes: 47190Number of allocated inodes: 51200Maximum number of inodes: 51200

2.1.7 Special consideration for GPFS with Oracle

Using GPFS with Oracle requires special considerations:

� RAID consideration: If using RAID devices, configure a single LUN for each RAID device. Do not create LUNs across RAID devices for use by GPFS, because creating LUNs across RAID devices ultimately results in significant performance degradation. GPFS stripes data and metadata across multiple LUNs (RAIDs) using its own optimized method.

� Block size: For file systems holding large Oracle databases, set the GPFS file system block size through the mmcrfs command using the -B option:

– We generally suggest 512 KB.

– We suggest 256 KB if there is activity other than Oracle using the same file system and many small files exist, which are not in the database (files that belong to Oracle Applications, for example).

– We suggest 1 MB for file systems 100 TB or larger.

– The large block size makes the allocation of space for the databases manageable and does not affect performance when Oracle is using the Asynchronous I/O (AIO) and Direct I/O (DIO) features of AIX.

– Set the Oracle database block size equal to the LUN segment size or a multiple of the LUN segment size.

� Set the GPFS worker threads through the mmchconfig -prefetchThreads command to allow the maximum parallelism of the Oracle AIO threads:

– On a 64-bit AIX kernel, the setting can be as large as 548. The GPFS prefetch threads must be adjusted accordingly through the mmchconfig -prefetchThreads command.

Tip: Difference between blocks and fragments in GPFS:

� Block: The block is the largest amount of data that can be accessed in a single I/O operation.

� GPFS divides each block into 32 subblocks.

� Subblock: The subblock is the smallest unit of disk space that can be allocated.

� Fragments: Fragments consist of one or more subblocks.

� Files smaller than one block size are stored in fragments.

� Large files are stored in a number of full blocks plus zero more subblocks to hold the data at the end of the file.

� For a block size of 256 KB, GPFS reads as much as 256 KB of data in a single I/O operation, and small files can occupy as little as 8 KB (256 K/32) of disk space.


– When requiring GPFS sequential I/O, set the prefetch threads between 50 and 100 (the default is 64), and set the worker threads to have the remainder. However, remember the following formula:

• prefetchThreads < 548• worker1Threads • prefetchThreads + worker1Threads =< 550 (in the 64 bit environment)

� The number of AIX AIO kprocs to create is approximately the same as the GPFS worker1Threads setting:

– The AIX AIO maxservers setting is the number of kprocs PER CPU. We suggest to set this value slightly larger than worker1Threads divided by the number of CPUs.

– Set the Oracle read-ahead value to prefetch one or two full GPFS blocks. For example, if your GPFS block size is 512 KB, set the Oracle blocks to either 32 or 64 16 KB blocks.

� Do not use the dio option on the mount command, because using the dio option forces DIO when accessing all files. Oracle automatically uses DIO to open database files on GPFS.

� When running Oracle RAC 10g R1, we suggest that you increase the value for OPROCD_DEFAULT_MARGIN to at least 500 to avoid possible random reboots of nodes.

From a GPFS perspective, even 500 milliseconds might be too low in situations where node failover can take up to one minute or two minutes to resolve. However, if during node failure, the surviving node is already performing direct IO to the oprocd control file, the surviving node has the necessary tokens and indirect block cached and therefore does not have to wait during failover.

� Oracle databases requiring high performance usually benefit from running with a pinned Oracle SGA, which is also true when running with GPFS, because GPFS uses DIO, which requires that the user I/O buffers (in the SGA) are pinned. GPFS normally pins the I/O

Tip:

prefetchThreads is for large sequential file I/O, whereas worker1Threads is for random, small file I/O.

prefetchThreads controls the maximum possible number of threads dedicated to prefetching data for files that are read sequentially or to handle sequential write-behind (default: 72).

worker1Threads is primarily used for random read or write requests that cannot be prefetched, random I/O requests, or small file activity. worker1Threads controls the maximum number of concurrent file operations at any one instant. If there are more requests than that, the excess will wait until a previous request has finished (default: 48, maximum: 548).

These changes through the mmchconfig command take effect upon restart of the GPFS daemon.

Note: The Oracle Clusterware I/O fencing daemon has its margin defined in two places in the /etc/init.cssd, and the values are 500 and 100 respectively. Because it is defined twice in the same file, the latter value of 100 is used; thus, we recommend that you remove the second (100) value.


buffers on behalf of the application, but if Oracle has already pinned the SGA, GPFS recognizes this has been done and does not duplicate the pinning, which saves additional system resources.

Pinning the SGA on AIX 5L requires the following three steps:

a. /usr/sbin/vmo -r -o v_pinshm=1

b. /usr/sbin/vmo -r -o maxpin%=percent_of_real_memory

Where percent_of_real_memory = ((size of SGA / size of physical memory) *100) + 3

c. Set LOCK_SGA parameter to TRUE in the init.ora file.

2.2 Oracle 10g Clusterware installation

Before installing the Oracle code and creating the database, change the ownership and permission of the GPFS to oracle user and dba group, as shown in Example 2-30.

Example 2-30 Changing ownership and permission of GPFS file systems

root@austin1:/> ls -ld /ora*drwxr-xr-x 4 oracle dba 16384 Sep 16 00:49 /orabindrwxr-xr-x 4 oracle dba 16384 Sep 16 01:33 /oradata

We use raw disks for OCR and voting disk for Oracle Clusterware. During the Oracle Clusterware installation, you are prompted to provide two OCR disks and three CRS voting (vote) disks. Even though it is possible to install the Oracle Clusterware with one OCR disk and one voting (vote) disk, we encourage you to have multiple OCR disks and voting (vote) disks for availability. In Example 2-31, we select rhdisk2 and rhdisk3 for OCR disks. We use rhdisk4, rhdisk5, and rhdisk6 as voting (vote) disks.

Example 2-31 Selecting raw physical disks for OCR and CRS voting (vote) disks

root@austin1:/> ls -l /dev/rhdisk*crw------- 1 root system 20, 3 Sep 14 17:45 /dev/rhdisk2crw------- 1 root system 20, 4 Sep 14 17:45 /dev/rhdisk3crw------- 1 root system 20, 5 Sep 14 17:45 /dev/rhdisk4crw------- 1 root system 20, 6 Sep 14 17:45 /dev/rhdisk5crw------- 1 root system 20, 7 Sep 14 17:45 /dev/rhdisk6

Creating special device files for OCR and CRS voting disksWe create special files (using the mknod command) for OCR and voting (vote) disks, as shown in Example 2-32 on page 47. Then, change ownership and permission for those files. You must run these commands on both nodes:

mknod SpecialFileName { b | c } Major# Minor#b indicates the special file is a block-oriented device. c indicates the special file is a character-oriented device.

Note: See Oracle Metalink doc ID: 302806.1 at:


You need an Oracle Metalink ID to access this note.



Example 2-32 Creating special files for ocr and vote disks

root@austin1:/> mknod /dev/ocrdisk1 c 20 3root@austin1:/> mknod /dev/ocrdisk2 c 20 4

root@austin1:/> mknod /dev/votedisk1 c 20 5root@austin1:/> mknod /dev/votedisk2 c 20 6root@austin1:/> mknod /dev/votedisk3 c 20 7

root@austin1:/> chown oracle.dba /dev/ocr*root@austin1:/> chown oracle.dba /dev/vote*

root@austin1:/> chmod 660 /dev/ocr*root@austin1:/> chmod 660 /dev/vote*

root@austin2:/> mknod /dev/ocrdisk1 c 36 3root@austin2:/> mknod /dev/ocrdisk2 c 36 4

root@austin2:/> mknod /dev/votedisk1 c 36 5root@austin2:/> mknod /dev/votedisk2 c 36 6root@austin2:/> mknod /dev/votedisk3 c 36 7

root@austin2:/> chown oracle.dba /dev/ocr*root@austin2:/> chown oracle.dba /dev/vote*

root@austin2:/> chmod 660 /dev/ocr*root@austin2:/> chmod 660 /dev/vote

Verify and change the reservation_policy to no_reserve on the disks (rhdisk2, rhdisk3, rhdisk4, rhdisk5, and rhdisk6 on both nodes) that are used for OCR and CRS voting (vote) disks as shown in Example 2-33. Run these commands on both nodes.

Example 2-33 Verifying and changing reservation policy

root@austin1:/> lsattr -El hdisk2PR_key_value none Persistant Reserve Key Value Truecache_method fast_write Write Caching method Falseieee_volname 600A0B800011A6620000007C0002944E IEEE Unique volume name Falselun_id 0x0000000000000000 Logical Unit Number Falsemax_transfer 0x100000 Maximum TRANSFER Size Trueprefetch_mult 1 Multiple of blocks to prefetch on read Falsepvid none Physical volume identifier False

Important: The major and minor numbers might not be same on all nodes. However, there is nothing wrong. You can still use those different numbers to create special files.


q_type simple Queuing Type Falsequeue_depth 10 Queue Depth Trueraid_level 5 RAID Level Falsereassign_to 120 Reassign Timeout value Truereserve_policy single_path Reserve Policy Truerw_timeout 30 Read/Write Timeout value Truescsi_id 0x661600 SCSI ID Falsesize 128 Size in Mbytes Falsewrite_cache yes Write Caching enabled False

root@austin1:/> chdev -l hdisk2 -a reserve_policy=no_reservehdisk2 changedroot@austin1:/> lsattr -El hdisk2PR_key_value none Persistant Reserve Key Value Truecache_method fast_write Write Caching method Falseieee_volname 600A0B800011A6620000007C0002944E IEEE Unique volume name Falselun_id 0x0000000000000000 Logical Unit Number Falsemax_transfer 0x100000 Maximum TRANSFER Size Trueprefetch_mult 1 Multiple of blocks to prefetch on read Falsepvid none Physical volume identifier Falseq_type simple Queuing Type Falsequeue_depth 10 Queue Depth Trueraid_level 5 RAID Level Falsereassign_to 120 Reassign Timeout value Truereserve_policy no_reserve Reserve Policy Truerw_timeout 30 Read/Write Timeout value Truescsi_id 0x661600 SCSI ID Falsesize 128 Size in Mbytes Falsewrite_cache yes Write Caching enabled False


On the both nodes, run the AIX command /usr/sbin/slibclean as root to clean all unreferenced libraries from memory. Also, check if /tmp file system has enough free space, about 500 MB on each node, as shown in Example 2-34.

Example 2-34 Run slibclean and verify free space on /tmp

root@austin1:/> /usr/sbin/slibclean

root@austin1:/> df -gFilesystem GB blocks Free %Used Iused %Iused Mounted on/dev/hd2 2.25 0.10 96% 41949 62% /usr/dev/hd9var 0.06 0.05 28% 494 5% /var/dev/hd3 1.06 1.03 4% 57 1% /tmp/dev/hd1 0.50 0.49 2% 76 1% /home

root@austin2:/> /usr/sbin/slibclean

root@austin2:/> df -gFilesystem GB blocks Free %Used Iused %Iused Mounted on/dev/hd2 2.31 0.16 94% 41921 51% /usr/dev/hd9var 0.06 0.04 31% 494 5% /var/dev/hd3 1.06 0.78 27% 833 1% /tmp/dev/hd1 0.44 0.44 1% 28 1% /home

Installing Oracle CRS code

You need a graphical user interface (GUI) to run OUI (Oracle universal installer). Export DISPLAY to the appropriate value, change directory to the directory that contains the Oracle install packages, and run installer as the oracle user, as shown in Figure 2-2 on page 50.

Important: Oracle Clusterware uses interface number (en3, for example) to define the interconnect network to be used. It is mandatory that this interface number is the same on all the nodes in the cluster. You have to enforce this requirement prior to installing Oracle Clusterware.


Figure 2-2 Run OUI from the installation directory

You are asked if rootpre.sh has run, as shown in Figure 2-3 on page 51. Make sure to execute Disk1/rootpre/rootpre.sh as root user on each node. After running rootpre.sh on both nodes, type <y> to proceed to the next step.

Note: If you have the Oracle code on a CDROM mounted on one of the nodes, you need to NFS export the CDROM mounted directory and mount it on the other node. You can also remote copy files to the other node, then run rootpre.sh on both nodes.

However because in our test environment, we have the CRS Disk1 (code) on a GPFS shared file system, there is no need to make an NFS mount or copy files to the other node.


Figure 2-3 Executing the runInstaller program


To install the Oracle CRS code:

1. At the OUI welcome window, click Next, as shown in Figure 2-4.

Figure 2-4 OUI Welcome window


2. Specify an ORACLE_HOME name and destination directory for the CRS installation as shown in Figure 2-5.

Figure 2-5 Specifying Home name and path


3. Figure 2-6 shows Product-Specific Prerequisite Checks. The installer verifies that your environment meets minimum requirements. If there are no failures, click Next.

Figure 2-6 Product-Specific Prerequisite Checks


4. Specify a cluster name, which is austin_crs in our example. Click Add to add public node name, private node name, and virtual host name specified in the /etc/hosts file as shown in Figure 2-7.

Figure 2-7 Specifying node configuration

5. Repeat for the second node and check the cluster information as shown in Figure 2-8.

Figure 2-8 Specifying cluster configuration


6. Click Edit to change the Interface type for each network interface, as shown in Figure 2-9 and Figure 2-10 on page 57.

Figure 2-9 Specifying interface type

Important: Oracle Clusterware uses interface number (en3 in our example) to define the interconnect network to use. It is mandatory that this interface number is the same on all the nodes in the cluster. You must enforce this requirement prior to installing Oracle Clusterware.


Figure 2-10 Specifying network interface usage


7. Specify the Oracle Cluster Registry (OCR) disk location as shown as in Figure 2-11. With normal redundancy, you can have two OCR disks. However, If you select external redundancy, no OCR mirroring is provided by Oracle.

Figure 2-11 Specifying OCR location


8. Specify the voting disks’ location as shown in Figure 2-12. With normal redundancy, you can have three vote disks.

Figure 2-12 Specifying voting disk location


9. Check if you can see both cluster local node and remote node before the installation begins (see Figure 2-13), and then click Install.

Figure 2-13 Summary before CRS installation


10.The OUI continues the installation on the first node, and then copies the code files on the second node automatically, as shown in Figure 2-14.

Figure 2-14 Installing CRS codes


11.Figure 2-15 shows the list of configuration scripts that you need. Keep the OUI window open, execute the scripts in the following order. Wait for each script execution to complete successfully before you run the next script. Do not run these commands at the same time.As root user:

a. On austin1, execute orainstRoot.sh.b. On austin2, execute orainstRoot.sh.c. On austin1, execute root.sh.d. On austin2, execute root.sh.

Figure 2-15 Executing configuration scripts

Note: At the final stage of executing root.sh on the second node, VIP Configuration Assistant starts automatically. But, if you had an error message stating that “The given interface(s), ‘en2’ is not public”, use Public interfaces to configure VIPs. Run $ORACLE_CRS_HOME/bin/vipca as root in a GUI environment.

The reason for this error message: When verifying the IP addresses, VIP uses calls to determine if an IP address is valid. In this case, VIP finds that the IPs are nonroutable (for example, IP addresses, such as 192.168.* and 10.10.*). Oracle is aware that the IPs can be made public, but because mostly these IPs are used for Private, it displays this error message.


To use the VIP Configuration Assistant:

1. Figure 2-16 shows the Welcome window for VIP configuration. Click Next.

Figure 2-16 Welcome window for VIP configuration


2. Select the network interface corresponding to the public network as shown in Figure 2-17, then click Next.

Figure 2-17 Selecting network interfaces for VIP


3. Provide the IP alias name and IP address as shown in Figure 2-18, then click Next.

Figure 2-18 Configuring VIPs


4. Validate the previous entries for VIP configuration as shown in Figure 2-19 and click Finish.

Figure 2-19 Summary before configuring VIPs


In Figure 2-20, the VIP Configuration Assistant proceeds with the creation, configuration, and startup of all application resources on all selected nodes.

Figure 2-20 VIP configuration progress window


5. Check the configuration results in Figure 2-21 and click Exit.

Figure 2-21 VIP configuration results


6. Clicking Exit takes you back to the previous window in Figure 2-22 (same as Figure 2-16 on page 63). Click OK and Configuration Assistant automatically executes.

Figure 2-22 Executing configuration scripts


If Configuration Assistant is successful, “End of Installation” appears automatically as shown in Figure 2-23.

Figure 2-23 End of CRS installation window

2.3 Oracle 10g Clusterware patch set update

Before updating Oracle CRS with the patch set, stop CRS on both nodes using crsctl stop crs as a root user on both nodes. Execute <patchset_directory>/Disk1/runInstaller in GUI environment. You are asked if /usr/sbin/slibclean is run by root. Run this command on both nodes and type <y> (yes) to proceed. To update the patch set:


1. You see the Welcome window as shown in Figure 2-24. Click Next.

Figure 2-24 Welcome window for installing CRS patch


2. Specify ORACLE_CRS_HOME installation directory (Figure 2-25), and click Next.

Figure 2-25 Specifying ORACLE_CRS_HOME installation directory


3. Click Next. All the nodes are selected by default (Figure 2-26).

Figure 2-26 Specifying cluster nodes


4. Click Next to verify the ORACLE_CRS_HOME directory, node list, space requirements, and so on (Figure 2-27).

Figure 2-27 Summary before updating patch set

Wait for the patch installation process to complete (Figure 2-28 on page 75).


Figure 2-28 Installing update patch set


5. Run $ORACLE_CRS_HOME/install/root102.sh as root user on both nodes to complete update patch set and then click Exit (Figure 2-29).

Figure 2-29 End of CRS update

2.4 Oracle 10g database installation

Refer to Appendix D, “Oracle 10g database installation” on page 269.

2.5 Networking considerations

This section presents networking considerations and differences from the previous version of Oracle 9i RAC.

Network architecture simpler than for Oracle 9i RACIn the previous release of Oracle 9i RAC, the network requirements and architecture for the interconnect network are quite complex, due to the fact that Oracle 9i RAC relies on a vendor-provided high availability infrastructure, such as HACMP in the case of AIX. At the same time, GPFS required a separate interconnect network, which is not managed by HACMP.

For more information about Oracle 9i RAC on an IBM System p setup, refer to Deploying Oracle9i RAC on eServer Cluster 1600 with GPFS, SG24-6954.


Oracle 10g RAC does not require HACMP to provide the clustering layer nor to protect the interconnect network. Etherchannel provides network (interconnect) high availability for Oracle 10g RAC. The Etherchannel is configured and managed at the AIX OS level. The configuration is simple, standard, and fault resilient. Oracle Clusterware does not have to care about the availability of its interconnect, because AIX provides this availability without any other clustering product, such as HACMP.

Single interconnect network for RAC and GPFSAll clusters need a network for internal communication. This communication link is essential, and in case of the loss of this network, the cluster cannot operate normally. This type of a network is called interconnect. For the architecture that we are considering, two clustering layers are configured, GPFS and Oracle Clusterware (formally called CRS), and both architectures require an interconnect network.

Oracle ClusterwareOracle 10g RAC needs a “private” interconnect network for cache fusion traffic, which includes data block exchange between instances, plus service messages. Depending on the amount of load and the type of database operations (select, insert, updates, cross-update, and so forth) running on the instances, the throughput on this interconnect can be high. Most Oracle database traffic between instances is based on UDP protocol.

The term “private” used for interconnect means that this network must be separated from the client access (public) network (used by the clients to access the database). The private interconnect is limited to the nodes hosting a RAC instance. The public network might connect to WAN. However, the term “private” does not mean that another cluster layer (in this case, GPFS) cannot share it.

GPFSAs a cluster file system, GPFS needs an interconnect network. In a typical Oracle database and GPFS configuration, the actual data I/O flows through the host bus adapters (for example, Fibre Channel) and not through the IP network (interconnect). This method allows superior performances. The GPFS interconnect network is used for service messages and token management mechanism. However, Oracle10gRAC comes with its own data synchronization mechanism and does not use GPFS locking.

GPFS uses TCP for its internal messages and relies on IP addresses not on a specific interface number. Because the data I/Os are not using the IP network, GPFS does not require a high network bandwidth; thus, the GPFS interconnect can be overlapped with the Oracle interconnect (same network).

Sizing the Interconnect network

Important: Oracle Clusterware uses interface number (en3 for example) to define the interconnect network to be used. It is mandatory that this interface number is the same on all the nodes in the cluster. You have to enforce this requirement prior to installing Oracle Clusterware.

Note: Even though Oracle 10g RAC interconnect requires special attention, sizing for this network is based on the same principles as for any other IP network. If the interconnect is properly sized, this network can be shared for other clustering traffic, such as GPFS, which adds almost no load onto the network. Therefore, the GPFS interconnect traffic can be mixed together with Oracle interconnect without a potential impact on RAC performance.


Because these networks share the same purpose and need the same level of availability, it is worth grouping them into a single network. Doing so, we avoid the cost and complexity of maintaining two resilient networks, one of which is not heavily loaded (GPFS) but necessary.

The diagram in Figure 2-30 presents a typical network configuration.

Figure 2-30 Typical network configuration

Note: We recommend using a single network for both Oracle 10g RAC and GPFS interconnects.

node austin1

RAC instance1

Oracle Clusterware

GPFS

Oracle 10gRAC + GPFS interconnect network

Public network

node austin2

RAC instance2

Oracle Clusterware

GPFS

192.168.100.31 192.168.100.32

VIP 192.168.100.131 192.168.100.132 VIP

10.1.100.31 10.1.100.32

node austin1

RAC instance1

Oracle Clusterware

GPFS

Oracle 10gRAC + GPFS interconnect network

Public network

node austin2

RAC instance2

Oracle Clusterware

GPFS

192.168.100.31 192.168.100.32

VIP 192.168.100.131 192.168.100.132 VIP

10.1.100.31 10.1.100.32


Example 2-35 shows the IP name resolution for our sample cluster.

Example 2-35 /etc/hosts file for both nodes

# Public network192.168.100.31 austin1192.168.100.32 austin2

# Oracle RAC Virtual IP addresses on the public network192.168.100.131 austin1_vip192.168.100.132 austin2_vip

# Oracle RAC + GPFS interconnect network10.1.100.31 austin1_interconnect10.1.100.32 austin2_interconnect

Example 2-36 presents the network interface configuration for both public and private networks on node austin1.

Example 2-36 Network configuration on node austin1

root@austin1:/> ifconfig -aen2: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN> inet 192.168.100.31 netmask 0xffffff00 broadcast 192.168.100.255 inet 192.168.100.131 netmask 0xffffff00 broadcast 192.168.100.255en3: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN> inet 10.1.100.31 netmask 0xffffff00 broadcast 10.1.100.255 tcp_sendspace 131072 tcp_recvspace 65536

Oracle Clusterware installation fails network interface names are not the same on all nodes in the cluster. The VIP addresses are configured as IP aliases on the public network, as shown in Example 2-37.

Example 2-37 Network configuration for node austin2

root@austin2:/> ifconfig -aen2: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN> inet 192.168.100.32 netmask 0xffffff00 broadcast 192.168.100.255 inet 192.168.100.132 netmask 0xffffff00 broadcast 192.168.100.255en3: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN> inet 10.1.100.32 netmask 0xffffff00 broadcast 10.1.100.255 tcp_sendspace 131072 tcp_recvspace 65536

Note: In our configuration, Oracle public network is using the same adapter, en2, on both nodes, and en3 is dedicated to RAC and GPFS interconnect traffic.


Different types of networks for interconnectThe network chosen for interconnect has to provide the high bandwidth that is needed for Oracle cache fusion mechanism and also low latency.

Oracle does not support the use of crossover cables between two nodes. Use a switch in all cases. A switch is needed for interconnect network failure detection by Oracle Clusterware. Although there are no AIX issues, crossover cables are not recommended nor supported.

There are several network choices for interconnect usage:

� Ethernet 100 Mb/sAlthough certified with Oracle 10g RAC, this network does not provide adequate performance for current demands and must be avoided in production environments.

� Ethernet 1 Gb/sCurrently, this network is the most commonly used network and provides reasonable throughput and low latency.

� Ethernet 10 Gb/sAlthough not yet officially supported by Oracle (quite new technology), because IP protocol is used (as for Fast and Gbit Ethernet), this network can be used for testing purposes. When certification becomes available on your platform, this network is a very good candidate for interconnect, removing the limitations of the single 1Gb Ethernet adapter. It also avoids having to use Etherchannel with multiple 1 Gb adapters to aggregate their bandwidth.

You can check for 10 Gb Ethernet support on the following link:

http://www.oracle.com/technology/products/database/clustering/certify/tech_generic_unix_new.html

� InfiniBandOn AIX, at the time of this writing, only IP over IB is supported for Oracle interconnect.

Currently, on AIX, Oracle does not support Reliable Datagram Sockets (RDS), which can be used on Linux with the release 10.2.0.3 and higher. With RDS, latency is lower, because protocol is simpler than IP and takes less CPU to build each frame; thus, the throughput is higher.

Using virtual network for interconnectVirtual Ethernet network is also supported for Oracle interconnect private network and also for client (public) network with VIP address failover.

RAC cluster nodes on a single physical serverFigure 2-31 on page 81 shows the virtual network used as interconnect.

Tip: For more information, see the Oracle Metalink note 220970.1 at:


You need an Oracle Metalink ID to access this note.







Figure 2-31 Virtual interconnect network for nodes in the same server

A virtual network environment can be used for development, test, or benchmark purposes where no high availability is required for client access, and when the nodes are different logical partitions (LPARs) of the same physical server. In this case, a virtual network can be created without the need of physical network interfaces, or a Virtual I/O Server (VIOS).

This virtual Ethernet network practically never fails, because it does not rely on physical adapters or cables. A virtual network inside a physical server is highly available in itself. No need to secure it with Etherchannel for example, as you do for the physical Ethernet networks.

This virtual network is a perfect candidate for RAC and GPFS interconnects when all cluster nodes reside inside one physical server (for example, for a test environment). The bandwidth is 1 Gb/s minimum, but it can be much higher. The latency varies depending on the overall CPU load for the entire server.

RAC cluster nodes on separate physical serversFor high availability purposes, we recommend for production to use LPARs on separate physical systems.

In the current IBM System p5 implementation, external network access from a virtual Ethernet network requires a VIOS with a shared Ethernet adapter (SEA). It is possible to design, implement, and use a a interconnect network similar to the one shown in Figure 2-32 on page 82. A typical configuration uses two VIOSs per frame with the SEA failover, which provides a good high availability of the network.

RAC node 1

RAC node 2

Virtual ethernetinterconnect network

Physical ethernet public network

RAC node 1

RAC node 2

Virtual ethernetinterconnect network



Figure 2-32 Virtual interconnect network for nodes on different servers (see previous Note)

The use of a single VIOS (and thus, no SEA failover) for the interconnect network is not resilient enough. The VIOS is a single point of failure. This configuration is not recommended, although it is supported.

If this setup is not the best one for RAC interconnect purposes, it remains the state of art for all other usages, including public or administrative networks.

For another virtual network setup using Etherchannel over dual VIOS to protect the network against failures (instead of using SEA failover), refer to 7.1, “Virtual networking environment” on page 229.

When RAC nodes reside on different servers (which needs to be the standard configuration), we recommend that you set up a physical interconnect network by using dedicated adapters that are not managed by VIOS as shown in Figure 2-33 on page 83.

Note: However, in the current implementation, this network architecture is not recommended (although supported) for Oracle RAC interconnect. Even though in case of a failure in the primary VIOS physical network adapter, the SEA failover mechanism provides failover to the second VIOS without TCP/IP disruption, the process is considered too slow for Oracle Clusterware and can lead to problems with the CRS daemons.

RAC node 1 RAC node 2

interconnect network using VIOS + SEA failover


VIOS VIOSVIOS VIOS

SEA failover Not fo

r interconnect

Virtual

Physical SEA

failover


interconnect network using VIOS + SEA failover


VIOS VIOSVIOS VIOS

SEA failover Not fo

r interconnect

VirtualVirtual

Physical Physical SEA

failover


Figure 2-33 Physical interconnect network for nodes on different servers

Jumbo framesMost of the modern 1Gb (or higher) Ethernet network switches support a feature called “jumbo frames”, which allows to them handle a maximum packet size of 9000 bytes instead of traditional Ethernet frames (1500 bytes). You can set this parameter at the interface level and switch. Jumbo frames are not activated by default.

Example 2-38 shows how to enable the jumbo frames for one adapter.

Example 2-38 smit chgenet window at the adapter level

Change / Show Characteristics of an Ethernet Adapter


[Entry Fields] Ethernet Adapter ent0 Description 2-Port 10/100/1000 Ba> Status Available Location 03-08 Rcv descriptor queue size [1024] +# TX descriptor queue size [512] +# Software transmit queue size [8192] +# Transmit jumbo frames yes + Enable hardware TX TCP resegmentation yes + Enable hardware transmit and receive checksum yes + Media speed Auto_Negotiation + Enable ALTERNATE ETHERNET address no + ALTERNATE ETHERNET address [0x000000000000] +

Note: Except for testing or development, we recommend nodes on different hardware with a physical network as interconnect.


Physical ethernetinterconnect network



Physical ethernetinterconnect network



Apply change to DATABASE only no + Enable failover mode disable +

If you use Etherchannel, enabling jumbo frames when creating the Etherchannel pseudo device automatically sets transmit jumbo frames to yes for all the underlying interfaces (starting in AIX 5.2), as shown in Example 2-39.

Example 2-39 smit etherchannel to create an Etherchannel with jumbo frames



[Entry Fields] EtherChannel / Link Aggregation Adapters ent0,ent1 + Enable Alternate Address no + Alternate Address [] + Enable Gigabit Ethernet Jumbo Frames yes + Mode round_robin + Hash Mode default + Backup Adapter + Automatically Recover to Main Channel yes + Perform Lossless Failover After Ping Failure yes + Internet Address to Ping [] Number of Retries [] +# Retry Timeout (sec) [] +#

Switches and all other networking components involved must support jumbo frames. Whenever possible, choosing jumbo frames for interconnect network is a good choice. Choosing jumbo frames reduces the number of packets (thus data fragmentation) used in heavy loads. If all of your networks are 1Gb or faster, and none of them are 10 or 100 Mb, jumbo frames can be set everywhere.

EtherchannelEtherchannel is a network port aggregation technology that allows several Ethernet adapters to be put together to form a single pseudo-Ethernet device. In our test environment, on nodes austin1 and austin2, ent0 and ent1 interfaces have been aggregated to form the logical device called ent2; the interface ent2 is configured with an IP address. The system and the remote hosts consider these aggregated adapters as one logical device (interface).

All adapters in an Etherchannel must be configured for the same speed (1Gb, for example) and must be full duplex. Mixing adapters of different speeds in the same Etherchannel is not supported.

In order to achieve bandwidth aggregation, all physical adapters have to be connected to the same switch, which must also support Etherchannel.

You can have up to eight primary Ethernet adapters and only one backup adapter per Etherchannel.

Note: As long as the switches support jumbo frames, we recommend using jumbo frames for interconnect network.


Etherchannel or IEEE 802.3adThere are two types of port aggregation:

� IEEE 802.3ad Link Aggregation� Etherchannel, which is the standard CISCO implementation

Both Etherchannel and IEEE 802.3ad Link Aggregation require switches capable of handling these protocols. Certain switches can auto-discover the IEEE 802.3ad ports to aggregate. Etherchannel needs configuration at the switch level to define the grouped ports.

Etherchannel mode: Standard or round-robinThere are three supported configuration modes for Etherchannel:

� Standard (the default)Outgoing packets to the same IP address are all sent over the same adapter (chosen depending on an hash mode algorithm). In a two node cluster, which is common for Oracle 10g RAC, this mode leads to using only one interface. Thus, there is no real bandwidth aggregation (or load balancing). This mode is useful when communicating with a large number of IP addresses.

� Round-robinOutgoing packets are sent evenly across all adapters. In a two node cluster, this mode provides the best load balancing possible. However, packets might be received out of sequence at the destination node. The command netstat -s | grep out-of-order gives the out of order packets received, as shown in Example 2-40. Observe the number of out-of-order packets over a period of time under heavily loaded database conditions. If this number steadily increases, you must not use this mode for your cluster.

Example 2-40 Checking for out of order packets

root@dallas2:/> netstat -s | grep out-of-order 1261 out-of-order packets (0 bytes)root@dallas2:/> netstat -s | grep out-of-order 1274 out-of-order packets (0 bytes)

� IEEE 802.3adAccording to the IEEE 802.3ad specification, the packets are always distributed in the standard fashion, never in a round-robin mode.

Example 2-39 on page 84 shows you how to set the round-robin mode when creating the Etherchannel.

Note: For interconnect network, we recommend Etherchannel, because it provides performance enhancement through the round-robin load balancing algorithm.

Note: For Oracle 10g RAC private interconnect network, we recommend an Etherchannel using the round-robin algorithm.


Making the interconnect highly availableEtherchannel is also a good answer for high availability. Because multiple network adapters are used together, the failure of one adapter reduces the actual bandwidth, but the IP connectivity is still guaranteed, until at least one physical adapter is available.

Remember, all adapters are connected to the same switch, which is a constraint to aggregate the bandwidths. Although we have several adapters, the switch itself can be considered as a single point of failure. The entire Etherchannel is lost if the switch is unplugged or fails, even if the network adapters are still available.

To address this issue and remove the last single point of failure found in the interconnect networks, Etherchannel provides a backup interface. In the event that all of the adapters in the Etherchannel fail, or if the primary switch fails, the backup adapter will be used to send and receive all traffic. In this case, the bandwidth is the one provided by the single backup adapter, with no aggregation any longer. When any primary link in the Etherchannel is restored, the service is moved back to the Etherchannel. Only one backup adapter per Etherchannel can be configured. The adapters configured in the primary Etherchannel are used preferentially over the backup adapter. As long as at least one of the primary adapters is functional, it is used.

Of course, the backup adapter has to be connected to a separate switch and linked to a different network infrastructure. It is not necessary for the backup switchto be Etherchannel capable or enabled.

Figure 2-34 on page 87 shows how to design a resilient Etherchannel and how to connect the physical network adapters to the switches. Interfaces en2 and en3 are used together in an aggregated mode and are connected on the Etherchannel capable switch. Interface en1 is connected on the backup switch and is used only if both en2 and en3 fail, or if the primary switch has problems.


Figure 2-34 Resilient Etherchannel architecture

Oracle parameter CLUSTER_INTERCONNECTSThe CLUSTER_INTERCONNECTS environment variable can be used for load balancing traffic over more interfaces; however, this mechanism does not provide any failover mechanism, which means interconnect failover must be handled by AIX.

When set with an IP address, this mechanism overrides clusterware settings, and the specified IP address is used for the interconnect traffic, including Oracle Global Cache Service (GCS), Global Enqueue Service (GES), and Interprocessor Parallel Query (IPQ). If set with two addresses, both addresses are used in a load balancing mode, but as soon as one link is down, all interconnect traffic is stopped, because the failover mode is turned off.

Note: Etherchannel provides a resilient networking infrastructure, implemented at the AIX level, without any other software (HACMP for instance). Oracle 10g RAC can use this network regardless of other failover considerations.

Node 1

Etherchannel or 802.3ad capable primary switch

en4

en1

en2

en3

en4

en1

en2

en3

Backup switch

Interconnect

Node 1 Node 1

Etherchannel or 802.3ad capable primary switch

en4

en1

en2

en3

en4

en1

en2

en3

Backup switch

Interconnect

Node 1


To query (using Oracle SQL client) the network used by Oracle 10g RAC for its private interconnect usage, see Example 2-41.

Example 2-41 How to display the interconnect network actually used

SQL> select INST_ID, NAME_KSXPIA, IP_KSXPIA from X$KSXPIA where PUB_KSXPIA = 'N';

INST_ID NAME_KSXPIA IP_KSXPIA---------- --------------- ---------------- 1 en3 10.1.100.33

SQL>

2.6 Considerations for Oracle code on shared space

In this scenario, we describe how to use a shared repository for Oracle code files, which simplifies code maintenance and backup operations.

With GPFS, it is possible to share the same set of binary files by all instances and thus minimize the effort for upgrading od patching software. The following components can be stored on GPFS file systems:

� Oracle database files� Oracle Clusterware files (OCR and voting disks)� Oracle Flash Recovery Area� Oracle archive log destination� Oracle Inventory� Oracle database binaries (ORACLE_HOME)� Oracle Clusterware binaries (ORA_CRS_HOME)� Oracle database log/trace files� Oracle Clusterware log/trace files

Oracle datafiles, Oracle Clusterware OCR and voting disks, Oracle Flash Recovery Area, and Oracle archive log destination require shared storage. However, this is not mandatory for the remaining components. For these components, you can choose a shared space (file system) or individual storage space on each cluster node. The advantage of using GPFS for these components is ease of administration as well as the possibility to access files belonging to a crashed node, before the node has been recovered. The disadvantage of this solution is the extra layer that GPFS introduces and the constraint of not being able to perform Oracle rolling upgrades.

When using a shared file system for Oracle binaries, you need to make sure that all instances are shut down before code upgrades, because code files for Oracle RAC ,as well as Oracle Clusterware, cannot be changed while the instance is running.

To shut down the Oracle cluster, refer to the readme file shipped with the patch code. You must make sure that all database instances, Enterprise Manager Database Control, iSQL*Plus, and Oracle Clusterware processes are shut down.

Note: We recommend that you use Etherchannel at the AIX level, which provides load balancing and failover at the same time. Do not use the cluster_interconnects parameter, so Oracle 10g RAC can use the network defined when installing CRS Clusterware for its interconnect.


Even though it is possible to place all Oracle files in GPFS, due to the recovery time in GPFS when losing the Configuration Manager node, we recommend to put the Oracle Clusterware voting disk and OCR files outside GPFS on raw partitions. For a description of how to move OCR and voting disks from GPFS to raw devices, refer to 3.8, “Moving OCR and voting disks from GPFS to raw devices” on page 144. This configuration avoids the situation when Oracle Clusterware might reboot nodes, before GPFS has recovered. Another option is to tune the MissCount parameter in Oracle Clusterware, but this option just extends the time for the cluster to react on a real error situation as well.

Similar to OCR and voting disks, you can argue that having Oracle Clusterware binaries and log files on GPFS might cause clusterware malfunction and thus cause node eviction in case of GPFS freeze or an erroneous configuration. Furthermore, Oracle Clusterware does support rolling upgrades, which will not work with a shared binaries installation. In fact, it seems that OUI is actually not fully understanding that Oracle Clusterware is installed on shared space.

In conclusion, even though GPFS can be used to provide shared storage for Oracle Clusterware throughout the environment, we recommend to use local file systems for Oracle Clusterware code.

2.7 Dynamic partitioning and Oracle 10g

With LPAR, IBM System p servers can run multiple operating systems in a single physical machine. Logical partitioning helps to optimize system resources and reduce system complexity. On servers featuring the POWER5 processor, up to 10 logical partitions can be configured per processor. Each partition runs an individual copy of the operating system, with its own memory and devices (physical or logical).

In addition to partitioning resources, dynamic resource management capabilities further enhance system flexibility and usability. Administrators can change partition configuration dynamically without rebooting the operating system or stop running processes. System resources, such as processors, memory, and I/O devices (both physical and virtual) can be added, removed, or moved between partitions that are running within the same physical server. This flexibility allows administrators to dynamically assign the resources to the partition based on the workload requirements.

Figure 2-35 on page 90 shows an example of a partitioned eight CPU IBM System p5 server. Processors, memory, and disks are shared among partitions using virtualization capabilities. Through the Hardware Management Console (HMC), an administrator can dynamically adjust running partitions by changing the number of assigned processors, the size of memory, and physical or virtual adapters. This capability allows for better utilization of all server resources by moving them to partitions that have higher requirements.

Oracle 10g Database is DLPAR aware, which means that it is capable of adapting to changes in the LPAR configuration and make use of additional (dynamically added) resources. This section describes how Oracle database exploits the dynamic changes in processors and memory when running in LPAR.

Note: When running Oracle 10g RAC on LPAR nodes, we recommend that you have LPARs located on separate System p servers in order to avoid single points of failure, such as the power supply, Central Electronic Complex (CEC), system backplane, and so forth.


Figure 2-35 Dynamic partitioning with POWER5 systems

2.7.1 Dynamic memory changes

Oracle introduced dynamic SGA in Oracle 9i. In Oracle 10g, these capabilities are enhanced. It is now possible for the database administrator to change (decrease or increase) memory pools dynamically by setting the SGA_TARGET parameter. The SGA_TARGET parameter can be increased up to the value of the SGA_MAX_SIZE parameter (set in init.ora file for spfile device, for example). The administrator can specify the size of individual SGA pools manually or just set the SGA_TARGET and let the database automatically size the pools within the SGA.

The size of virtual memory allocated by Oracle during startup time is equal to the value of the SGA_MAX_SIZE parameter, but only part of it, which is specified by SGA_TARGET, is actually used. It means that Oracle database can start with larger SGA_MAX_SIZE than the amount of memory assigned to a partition at Oracle startup time. SGA_TARGET can be increased to reach the limit of physical memory available to LPAR. By adding more memory to the partition, SGA_TARGET can be increased as well. The administrator must anticipate the amount of memory that can be given to the instance and must set the SGA_MAX_SIZE parameter.

The following Oracle views are useful to monitor the behavior of the dynamic SGA:

� v$sga view displays summary information about SGA

� v$sgastat displays detailed information about SGA

� v$sgainfo displays size information about SGA, including sizes of different SGA components, granule size, and free memory

� v$sga_dynamic_components displays current, minimum, and maximum size for the dynamic SGA components

� v$sga_dynamic_free_memory displays information about the amount of SGA memory that is available for future dynamic SGA operations

� v$sga_resize_ops displays information about the last 400 completed SGA resize operations

2 CPU

MEM

DISK

½CPU

MEM

DISK

UP TO 80 PARTITIONS

P5 Server (8 CPU)

HYPERVISOR

ManagementConsole

½CPU

MEM

DISK

4 CPU

MEM

DISK

C

M

D

C

M

D

C

M

D

C

M

D

AIX 5 AIX 5 i5/0S LINUX A A L L

Application 1 App 2 App 3 Application 4 5 6 7 8


SGA must not use pinned memory, because the size of virtual memory allocated by Oracle during startup time is equal to the value of SGA_MAX_SIZE parameter. The amount of SGA_MAX_SIZE memory must be available in LPAR prior to instance startup. If the LOCK_SGA parameter is set to TRUE, the SGA is pinned and is not paged.

The AIX 5L operating system does not allow the removal of pinned memory, which means that when using a pinned SGA, database administrator can neither reduce the effective size of the SGA_TARGET values, nor remove real memory from the LPAR. For this reason, when using a pinned SGA, it is not possible to change the SGA_TARGET value to move memory out of the LPAR. When the SGA is not pinned, this is possible.

Example of dynamic memory changeFigure 2-36 shows the partition configured with 2.5 GB of memory. Limits are set within the partition profile, and you can decrease the amount of memory to 1 GB or increase it up to 6 GB. See Figure 2-36.

Figure 2-36 LPAR configuration

Note: When specifying pinned memory for SGA, an instance does not start unless there is enough memory for the LPAR to host the SGA_MAX_SIZE. Also, DLPAR memory operations are not permitted on memory reserved for SGA_MAX_SIZE.


The amount of 2.5 GB is consistent to the memory size available to AIX (Example 2-42).

Example 2-42 Physical memory available to AIX operating system before addition

{texas:oracle}/orabin/ora102/dbs -> prtconf | grep "Memory Size"Memory Size: 2560 MBGood Memory Size: 2560 MB

Example 2-43 shows Oracle initialization parameters related to SGA memory.

Example 2-43 Instance parameters related to SGA memory

SQL> show parameter sga

NAME TYPE VALUE------------------------------------ ----------- ------------------------------lock_sga boolean FALSEpre_page_sga boolean FALSEsga_max_size big integer 4Gsga_target big integer 1G

Figure 2-37 on page 93 shows an additional 1 GB of memory assigned to this partition with the Hardware Management Console.


Figure 2-37 Adding 1 GB of memory to a partition

After this operation, AIX sees more physical memory (3.5 GB), as shown in Example 2-44.

Example 2-44 Physical memory available to AIX operating system

{texas:oracle}/orabin/ora102/dbs -> prtconf | grep "Memory Size"Memory Size: 3584 MBGood Memory Size: 3584 MB

At this point, Oracle allocates only 1 GB of memory for SGA (SGA_TARGET parameter value). Output from the Oracle sqlplus command is in Example 2-45.

Example 2-45 Size information about SGA memory, including free SGA

SQL> select * from v$sgainfo;

NAME BYTES RES-------------------------------- ---------- ---Fixed SGA Size 2078368 NoRedo Buffers 14696448 No


Buffer Cache Size 771751936 YesShared Pool Size 251658240 YesLarge Pool Size 16777216 YesJava Pool Size 16777216 YesStreams Pool Size 0 YesGranule Size 16777216 NoMaximum SGA Size 4294967296 NoStartup overhead in Shared Pool 67108864 NoFree SGA Memory Available 3221225472

11 rows selected.

The next step is to change the SGA_TARGET value, so Oracle can use additional memory segments. In Example 2-46, SGA_TARGET is set to 3.5 GB (3584 MB).

Example 2-46 Changing SGA_TARGET value

SQL> alter system set sga_target=3584M;

System altered.

SQL> show parameter sga

NAME TYPE VALUE------------------------------------ ----------- ------------------------------lock_sga boolean FALSEpre_page_sga boolean FALSEsga_max_size big integer 4Gsga_target big integer 3584M

At this point, v$sgainfo view values correspond to new values. Only about 512 MB of SGA memory is available (see Example 2-47).

Example 2-47 Size information about SGA memory after resizing SGA_TARGET

SQL> select * from v$sgainfo;

NAME BYTES RES-------------------------------- ---------- ---Fixed SGA Size 2078368 NoRedo Buffers 14696448 NoBuffer Cache Size 3456106496 YesShared Pool Size 251658240 YesLarge Pool Size 16777216 YesJava Pool Size 16777216 YesStreams Pool Size 0 YesGranule Size 16777216 NoMaximum SGA Size 4294967296 NoStartup overhead in Shared Pool 67108864 NoFree SGA Memory Available 536870912

11 rows selected.


From now on, the Oracle database can use new memory segments that are added to the partition dynamically.

2.7.2 Dynamic CPU allocation

Dynamic CPU reconfiguration is available starting in AIX V5.2 and POWER4 processors. AIX V5.3 and POWER5 introduced micropartitioning and the ability to share a single physical CPU between partitions. With POWER5 and AIX, V5.3 partitions can run on as little as 0.1 (or 10%) of one CPU. Up to 10 partitions with their own operating systems can run concurrently on a single CPU.

Without micropartitioning, a single physical processor assigned to partition is visible as a single processor in AIX. Each change in partition processors (dedicated processors) is also visible in operating system.

Things are more complicated when micropartitioning is used. The AIX operating system sees virtual processors instead of physical ones, because the kernel and its scheduler have to see a natural number of processors.

Also, up to 10 virtual processors can be defined on a partition with the assigned processing unit of 1.0 CPU, and the other way, a single virtual CPU can utilize only 0.1 of real processor. With dynamic partitioning, both entitled capacity (amount of processing units) and the number of virtual processors can be changed dynamically. When necessary, both entitled capacity and the number of virtual processors can change at the same time.

When the capacity is increased, applications run faster, because the power hypervisor assigns more physical processor time to each virtual processor. By increasing the number of virtual CPUs only, it is unlikely that the application runs faster, because the overall amount of processing units does not change.

All running applications gain performance when the capacity is increased in an LPAR. Oracle also recognizes new virtual processors (because they appear the same way as dedicated CPUs on AIX) and adjusts its SQL optimizer plans.

Both Oracle 9i and 10g use CPU_COUNT and PARALLEL_THREADS_PER_CPU parameters to compute the number of parallel query processes. Oracle 10g automatically adjusts the CPU_COUNT parameter when the number of CPUs changes in the partition.

With POWER5 processor and AIX V5.3, Simultaneous Multi-Threading (SMT) is introduced. With SMT, the POWER5 processor gets instructions from more than one thread. What differentiates this implementation is its ability to schedule instructions for execution from all threads concurrently.

With SMT, the system dynamically adjusts to the environment, allowing instructions to execute from each thread (if possible) and allowing instructions from one thread to use all the execution units if the other thread encounters a long latency event. The POWER5 design implements two-way SMT on each CPU.

If simultaneous multi-threading is activated:

� More instructions are executed at the same time.

� The operating system views the processing threads as twice the number of physical processors installed in the system.

� Each physical processor (in dedicated partitions) or virtual processor (in shared partitions) is visible to Oracle as two processing threads.


� OLTP application can gain up to 30% of performance.

Oracle works just fine on any OS that recognizes a hyper-threading-enabled system. In addition, it takes advantage of the logical CPUs to their fullest extent (assuming the OS reports that it recognizes that hyper-threading is enabled).

The simultaneous multi-threading policy is controlled by the operating system and is partition specific.

Example of dynamic CPU changeAn Oracle instance is running on LPAR with the following CPU resources:

� Current processing units: One� Minimum processing units: Zero point five� Maximum processing units: Four� Current virtual processors: One� Minimum virtual processors: One� Maximum virtual processors: Eight� SMT-enabled

Figure 2-38 shows the partition resources visible through HMC.

Figure 2-38 Partition resources visible through HMC


Because SMT functionality turns on by default, and processes cannot see any difference between SMT threads and real CPUs, the number of processors visible by the operating system and Oracle instance is different. In AIX, only one CPU is reported, while Oracle reports two CPUs, as shown in Example 2-48.

Example 2-48 Number of processors in AIX and Oracle

root@texas:/> prtconf | grep ProcessorsNumber Of Processors: 1

root@texas:/> lsattr -El proc0frequency 1900098000 Processor Speed Falsesmt_enabled true Processor SMT enabled Falsesmt_threads 2 Processor SMT threads Falsestate enable Processor state Falsetype PowerPC_POWER5 Processor type False

SQL> show parameter cpu_count

NAME TYPE VALUE------------------------------------ ----------- ------------------------------cpu_count integer 2

By disabling SMT functionality, the number of processors visible in Oracle as CPU_COUNT is reduced by 50%, as shown in Example 2-49.

Example 2-49 Disabling SMT in AIX

root@texas:/> smtctl -m off -w nowsmtctl: SMT is now disabled.

root@texas:/> lsattr -El proc0frequency 1900098000 Processor Speed Falsesmt_enabled false Processor SMT enabled Falsesmt_threads 2 Processor SMT threads Falsestate enable Processor state Falsetype PowerPC_POWER5 Processor type False

SQL> show parameter cpu_count


The action shown in Example 2-49 does not change the actual partition configuration. In the next step, the number of virtual processors changes in the partition to two, as presented on Figure 2-39 on page 98.


Figure 2-39 Increasing the number of virtual processors with HMC

Example 2-50 shows the number of processors that change in AIX and Oracle.

Example 2-50 Changed number of processors in AIX and Oracle

root@texas:/> prtconf | grep ProcessorsNumber Of Processors: 2

SQL> show parameter cpu_count;


Additional information appears in the Oracle alert.log file (Example 2-51 on page 99).


Example 2-51 Oracle alert.log containing changes in the CPU count

Detected change in CPU count to 2

Oracle does not change the CPU_COUNT value if the number of CPUs are more than three times the CPU count at instance startup. For example, after starting Oracle instance with one CPU and increasing the number of processors to four, the CPU_COUNT is set to three and the following entry, shown on Example 2-52, is generated in the alert.log file.

Example 2-52 Change in CPU from one to four

Detected change in CPU count to fourDetected CPU count four higher than the allowed maximum (3), capped(3 times CPU count at instance startup)

When operating in the AIX 5L and System p environment, you can dynamically add or remove CPUs from an LPAR with an active Oracle instance. The AIX 5L kernel scheduler automatically distributes work across all CPUs. In addition, Oracle Database 10gl dynamically detects the change in CPU count and exploits it with parallel query processes.

Important: Dynamic LPAR reconfiguration can cause high utilization of processors in partitions and cause some processes to slow down or stop. Oracle Clusterware is highly sensitive and might evict the node that is being reconfigured from the cluster. The fix for this issue (APAR: IY84564) is available from http://www.ibm.com/support.


http://www.ibm.com/support


Part 2 Configurations using dedicated resources

Part two discusses the following Oracle 10g RAC scenarios:

� Basic RAC configuration with GPFS

We cover hardware, networking, operating system, and GPFS configuration parameters. We also present the Oracle CRS installation and patch update steps.

� Migration and upgrade scenarios

We describe various migration and upgrade scenarios covering the steps that we recommend to perform, expand, or shrink your system and to replace software components or bring them to the latest version.

Part 2



Chapter 3. Migration and upgrade scenarios

This chapter provides various migration scenarios that help you migrate and scale your environment as needed. Figure 3-1 on page 104 shows the infrastructure that we use in the scenarios. The following scenarios are tested in our environment:

� Migrating a standalone Oracle database instance from JFS2 or raw partitions to GPFS

� Migrating a single database instance to a clustered RAC environment with GPFS and shared ORACLE_HOME

� Adding a node to an existing Oracle RAC (with GPFS)

� Migrating an HACMP-based RAC to Oracle Clusterware and GPFS

� Upgrading a RAC installation from GPFS V2.3 to GPFS V3.1

� Moving Oracle Clusterware OCR and voting disks from GPFS to raw devices

3


3.1 Migrating a single database instance to GPFS

Figure 3-1 shows a simple diagram of the test environment that we use for this publication.

Figure 3-1 Residency test environment

The purpose of this scenario is to change the storage space for Oracle files (single instance) from JFS/JFS2 or raw devices to GPFS. There is no Oracle clustering involved at this time. At the end of this migration, a single instance database is still running on a single node GPFS cluster.

In this section, we consider two source scenarios. The target is GPFS, but sources can be both JFS2 or raw partitions for datafiles.

The starting point for this scenario is a single database instance, using JFS2 (local) file system. Oracle code files are located in /orabin directory (a separate file system) and Oracle data files are in /oradata file system. Oracle Inventory is stored in /home/oracle directory. Oracle Inventory, data files, and code are moved to GPFS.

Note: Although a single node GPFS cluster is not officially supported, this scenario is in fact a step toward a multi-node RAC environment based on GPFS. For this matter, GPFS file system parameters are configured as for a multi-node cluster (the correct number of estimated nodes that mounts the file system (the -n option in mmcrfs)).

For details about GPFS considerations, refer to the GPFS V3.1 Concepts, Planning, and Installation Guide, GA76-0413, and section 2.1.7, “Special consideration for GPFS with Oracle” on page 44.

p5 5708-way 64GB

Austin1

p5 57012-way 32GB

Austin2

Dallas1

Dallas2

Houston1

Houston2

Alamo1

Alamo2

BigBend1

BigBend2

BigBend3

Texas

Vios1 Vios2

DS4800640 GB RAID5

Linuxgw8810

P5 550nim server

nim8810

HMC

Public / admin

192.168.100.x

RAC+GPFS interconne

10.1.100.x

.2

.231

.20

.241

FC switch

.251

.252

.31

.32

.33

.34

.51

.52

.53

.54

.55

.56

.57

.58

Network switches

.41 .42

AIX 5.3 TL5GPFS 3.1.0.6HACMP 5.2VIOS 1.4Oracle RAC 10.2.0.3

To IBM network

To IBM network


3.1.1 Moving JFS2-based ORACLE_HOME to GPFS

For this scenario, you must have enough SAN-attached storage to hold all Oracle-related files that you plan to move.

To move ORACLE_HOME and Oracle Inventory from JFS to GPFS, follow these steps:

1. Shut down all Oracle processes.

2. Unmount the /orabin file system and remount it on /jfsorabin.

3. Create a single node GPFS cluster and the NSDs that you are using for GPFS.

4. Create GPFS for /orabin, and mount on /orabin. Make sure that the right permissions exist for the /orabin GPFS file system, for example, oracle:dba.

5. Copy the entire ORACLE_HOME from JFS2 to GPFS (as oracle user):

cd /jsforabin; tar cvf - ora102 | (cd /orabin; tar xvf -)

6. Unmount the /jfsorabin file system.

7. Move Oracle Inventory:

a. cd /home/oracle; tar cvf - OraInventory | (cd /orabin; tar xvf -)

b. Update the OraInventory location stored in /etc/oraInst.loc.

3.1.2 Moving JFS2-based datafiles to GPFS

For copying the database files from JFS to GPFS, you have several options. Database size and availability requirements might require a step-by-step approach or at least impose the requirement of a certain parallelism of the copy operation. The options for moving database files are:

1. Assuming the database files are located in one directory, /oradata/db in our case, and you can afford to take the database offline, the simplest way is to copy data files from JFS2 to GPFS:

After taking the database offline, change the mount point of the /oradata file system to /jfsoradata and remount the file system into the new mount point; then, as oracle user, run the following commands:

$ cd /jfsoradata; mv db /oradata/ && umount /jfsoradata

This approach requires that the space is allocated on both JFS and GPFS.

2. If the database is too large (too large in that copying takes longer than acceptable), but you have the necessary space on both JFS2 and GPFS, you can use the Oracle RMAN utility to back up the database as an image copy (see Oracle documentation).

Note: On the test system, the file oraInst.loc exists in several places:

root@dallas1:/> find / -name oraInst.loc -print 2> /dev/null/etc/oraInst.loc/orabin/ora102/oraInst.loc/orabin/ora102/bigbend1_GPFSMIG1/oraInst.loc

On our test system, all these files are updated.

Chapter 3. Migration and upgrade scenarios 105

3. Finally, if the database files cannot be held in two copies, we have two options:

a. Performing a full database backup to tape, destroy the existing JFS/JFS2 file system, then create the GPFS reusing the disk space previously used for JFS/JFS2, and restore the database files. This method requires that the database is offline during the backup and restore operation.

b. Copy the datafiles in smaller portions. This method requires that the associated tablespaces are offline during the operation. Copying is done on the OS level or by using the Oracle RMAN utility.

3.1.3 Moving raw devices database files to GPFS

The operation of copying datafiles from raw partitions to GPFS is similar to copying datafiles from JFS to GPFS. However, using the mv command to copy database files is not possible. The other options are available.

We describe a more complex example covering this topic in 3.4, “Migrating from HACMP-based RAC cluster to GPFS using RMAN” on page 123 and in chapter 3.5, “Migrating from RAC with HACMP cluster to GPFS using dd” on page 133.

3.2 Migrating from Oracle single instance to RAC

In this section, you start with a single database instance running on GPFS. GPFS is also used to store Oracle code and data files. Oracle Inventory is stored on a local file system, in /home/oracle. For additional information about how to store Oracle code and data files, refer to 3.1, “Migrating a single database instance to GPFS” on page 104.

For this scenario, we conduct a full installation of both Oracle Clusterware and Oracle RAC software. Software installation is needed, because Oracle must link its binary files to the system libraries. After software installation is complete, the steps to convert from single instance to RAC are:

1. Perform basic node preparation: prerequisites, network, and GPFS code.2. Add the node to the GPFS cluster.3. Install and configure Oracle Clusterware using OUI.4. Install Oracle RAC code using OUI.5. Configure the database for RAC.6. Setup Transparent Application Failover (TAF).


3.2.1 Setting up the new node

The steps to add the node to the cluster are similar to the steps presented in 2.1, “Basic scenario” on page 20:

1. Install the node with the correct version of base software (AIX, GPFS, Secure Shell, and so forth). Make sure that the new node has exactly the same software versions as the existing node running Oracle.

2. Make sure that you have sufficient free space in /tmp. Oracle Installer requires 600 MB, but the requirements for the node might be higher.

3. Check for the identical parameters to the existing node kernel configuration parameters.

4. Create the oracle user with same user and group ID as on the existing node.

5. Set up oracle user environment and shell limits as the existing node.

6. Attach the new node to the storage subsystem, and make sure that you can access the GPFS logical unit numbers (LUNs) from both nodes.

7. Check the remote command execution (rsh/ssh) between nodes.

3.2.2 Add the new node to existing (single node) GPFS cluster

After the preparations described in section 3.2.1, “Setting up the new node” on page 107 have been completed, add the new node to the GPFS cluster by running the mmaddnode command from the node that is already part of the cluster. After you add the node, make sure the new node has been added to the cluster, then start the GPFS daemon on the new node using the mmstartup command. If necessary, use the mmmount command to mount the existing file systems on the new node.

3.2.3 Installing and configuring Oracle Clusterware using OUI

The process is exactly the same as for any normal Oracle Clusterware installation. We covered the process in detail in 2.2, “Oracle 10g Clusterware installation” on page 46, 2.3, “Oracle 10g Clusterware patch set update” on page 70, and 2.4, “Oracle 10g database installation” on page 76.

Tip: A good technique is to use AIX Network install Manager to “clone” an mksysb of the existing node.

Note: After the file system has been successfully mounted and is accessible from both nodes in the cluster, stop GPFS on both nodes and make sure your GPFS cluster and file systems follow the quorum and availability recommendations:

– Check and adjust the cluster quorum method. Add NSD tiebreaker disks to the cluster and change cluster quorum to node quorum with tiebreaker disks.

– Check and configure the secondary cluster data server.

For more details, refer to 2.1.6, “GPFS configuration” on page 30.


3.2.4 Installing Oracle RAC option using OUI

During this installation, Oracle 10g database software is installed as though Oracle had not previously been installed. The installation process is similar to the one shown in Appendix D, “Oracle 10g database installation” on page 269. In this way, previous installation and configuration can be kept unchanged (as a fallback/backup option). You will need to run Oracle network configuration assistant (netca) for configuring listener.ora and tnsnames.ora. You will also need to setup the ORACLE_SID with instance number to reflect the new configuration.

3.2.5 Configure database for RAC

The changes to the database are:

� Recreate the database control file, depending on current settings� Changing the spfile� Add redo log thread for each instance� Add undolog per thread� Create RAC data dictionary views (catclust)� Enable new thread� Register new instance with Oracle Clusterware

Note: Running crsctl stop crs might not always be sufficient to shut down Oracle Clusterware entirely. We have observed that at least oprocd has not been stopped. This process needs to be removed before any Oracle code patching operation. Repeated crsctl start crs and crsctl stop crs, or kill <PID> might work, because using the kill command can result in node reboot. The only process left running is:

root@bigbend1:/> ps -ef | grep crs

root 491592 1 0 09:37:53 - 0:00 /bin/sh /etc/init.crsd run

Note: If Oracle Clusterware uninstall is needed, note that just running OUI will not cleanly deinstall Oracle Clusterware. Oracle Metalink Doc ID Note:239998.1 documents this process. For a complete installation, $ORACLE_CRS_HOME/install contains the scripts rootdelete.sh and rootdeinstall.sh, which you need to run before using the OUI.

If your nodes continuously reboot, the only chance you have to stop this behavior is to try to log on to the system as root as soon as you get a login prompt, before Oracle Clusterware starts, and use the crsctl disable crs command. This command will stop repeated system reboot. Oracle Metalink can be found at:


Note: You need an Oracle Metalink ID to access this note.



Control file re-creationCertain parameters might be inappropriate for RAC and might require the re-creation of the control file to reflect the changes. These parameters are:

MAXLOGFILESMAXLOGMEMBERSMAXDATAFILESMAXINSTANCESMAXLOGHISTORY

Change these parameters according to your requirements.

We use the default parameters from the 10g installation. However in the field, you might run into installations that are upgraded from 9 i, or even with MAXINSTANCES deliberately set to 2. The 10 g defaults from austin1 are:

MAXLOGFILES 192MAXLOGMEMBERS 3MAXDATAFILES 1024MAXINSTANCES 32MAXLOGHISTORY 292

One way to verify the parameters is to back up controlfile to trace, which produces a file in the udump destination, as shown in Example 3-1.

Example 3-1 Creating a backup of the database controlfile

{austin1:oracle}/home/oracle -> sqlplus / as sysdba

SQL*Plus: Release 10.2.0.1.0 - Production on Thu Nov 15 01:24:41 2007

Copyright (c) 1982, 2005, Oracle. All rights reserved.

Connected to:Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit ProductionWith the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> alter session set tracefile_identifier='CTLTRACE';

Session altered.

SQL> alter database backup controlfile to trace;

Database altered.

SQL> host ls -l /oracle/ora102/admin/austindb/udump/*CTLTRACE*-rw-r----- 1 oracle dba 6060 Nov 15 01:25 /oracle/ora102/admin/austindb/udump/austindb1_ora_712888_CTLTRACE.trc

SQL> Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit ProductionWith the Partitioning, Real Application Clusters, OLAP and Data Mining options{austin1:oracle}/home/oracle ->


The file contains two sections, each of which has the complete statement for creating a controlfile. The two sections cover two scenarios:

� Current redologs in place (NORESETLOGS)� Current redologs lost/damaged (RESETLOGS)

To recreate the controlfile to change MAXINSTANCES, use the NORESETLOGS section after changing the value of MAXINSTANCES appropriately. As also mentioned in the tracefile, Oracle Recovery Manager (rman) information is lost, which you must also consider.

Database spfile reconfigurationIn this task, an spfile is already being used, but for this type of editing, the pfile is probably faster. Thus, we create a pfile, as shown in Example 3-2.

Example 3-2 sqlplus command to create pfile

SQL> create pfile=’mypfile.ora’ from spfile;

Edit the pfile and add specific RAC information. In Oracle RAC, each instance needs its own undo table space and its own redo logs. Thus, the new configuration needs to be similar to the configuration shown in Example 3-3.

Example 3-3 Oracle RAC specific configuration

GPFSMIG1.undo_tablespace=’UNDOTBS1’GPFSMIG2.undo_tablespace=’UNDOTBS2’*.cluster_database=true*.cluster_database_instances=2GPFSMIG1.thread=1GPFSMIG2.thread=2GPFSMIG1.instance_number=1GPFSMIG2.instance_number=2

The new environment has specific (per instance) configurations; you must remove any database-wide configuration that might conflict with this new environment’s configurations. In our environment, only the undo configuration is in conflict. The parameter that we removed is shown in Example 3-4.

Example 3-4 Configuration parameter to remove

*.undo_tablespace=’UNDOTBS1’

Note: Due to the dynamic features in most of the configuration parameters, we recommend that you use the spfile for the installation. The pfile is only used to quickly make changes.


Finally, both instances need to run on the same spfile. Here, running both instances is handled by creating spfile, spfileGPFSMIG.ora, which is then linked to the instance specific names. The result in $ORACLE_HOME/dbs is shown in Example 3-5. The spfileGPFSMIG.ora is created from the mypfile.ora (see Example 3-3 on page 110).

Example 3-5 Oracle spfile

{bigbend1:oracle}/orabin/ora102RAC/dbs -> ls -l spfileGPFSMIG*.ora-rw-r----- 1 oracle dba 4608 Sep 30 16:07 spfileGPFSMIG.oralrwxrwxrwx 1 oracle dba 17 Sep 28 15:46 spfileGPFSMIG1.ora -> spfileGPFSMIG.oralrwxrwxrwx 1 oracle dba 17 Sep 28 15:46 spfileGPFSMIG2.ora -> spfileGPFSMIG.ora

Other configurations might be needed to increase the SGA, because Oracle RAC uses part of the SGA for Global Cache Directory (GCD). The size of the GCD depends on the database size. Also, due to the multi-versioning of data blocks in the instances, increased SGA might be needed. Whether to increase SGA varies depending on the way that the application loads characteristics.

Creating new redo and undo logs

In this scenario, undo and redo logs for the new instance are created in a similar manner to the old instance. The undo log creation is shown in Example 3-6 on page 112.

Note: Even though the spfiles are created in $ORACLE_HOME/dbs, we recommend that you place the spfiles outside of $ORACLE_HOME.

Note: It is impossible to recommend a proper value for the SGA size. Use the buffer cache advisor to assess the effectiveness of the caching.

Note: You must evaluate various options, such as naming, sizing, and mirroring for redo logs based on the installation that you are upgrading.


Example 3-6 Creating the undo log for second instance

SQL> select file_name, tablespace_name, bytes, bytes/1024/1024 MB 2 from dba_data_files where tablespace_name like 'UNDO%';

FILE_NAME TABLESPACE BYTES MB-------------------- ---------- ---------- ----------/oradata/GPFSMIG/und UNDOTBS1 267386880 255otbs01.dbf

SQL> create undo tablespace UNDOTBS2 2 datafile ‘/oradata/GPFSMIG/undotbs02.dbf’ size 255M;

Tablespace created.

SQL> select file_name, tablespace_name, bytes, bytes/1024/1024 MB 2 from dba_data_files where tablespace_name like 'UNDO%';

FILE_NAME TABLESPACE BYTES MB-------------------- ---------- ---------- ----------/oradata/GPFSMIG/und UNDOTBS1 267386880 255otbs01.dbf

/oradata/GPFSMIG/und UNDOTBS2 267386880 255otbs02.dbf

In our scenario, we chose to rename the redo logs. Example 3-7 lists the current redo logs.

Example 3-7 Identify existing redo logs

SQL> select group#, member from v$logfile;

GROUP# MEMBER---------- ------------------------------ 3 /oradata/GPFSMIG/redo03.log 2 /oradata/GPFSMIG/redo02.log 1 /oradata/GPFSMIG/redo01.log

SQL> select group#, thread#, bytes, bytes/1024/1024 MB from v$log;

GROUP# THREAD# BYTES MB---------- ---------- ---------- ---------- 1 1 52428800 50 2 1 52428800 50 3 1 52428800 50

Example 3-8 on page 113 shows the creation of new redo log files. The naming is chosen so the thread is part of the redo log file name. Therefore, the new redo logs are named differently. In a later step, the old files are renamed.


Example 3-8 Creating the redo log groups

SQL> alter database add logfile thread 2 group 4 '/oradata/GPFSMIG/redo01-02.log' size 50M;

Database altered.


Database altered.


Database altered.

To rename the redo logs, the database must be in mount mode. Renaming is shown in Example 3-9.

Example 3-9 Renaming the existing redo logs


GROUP# MEMBER---------- ------------------------------ 3 /oradata/GPFSMIG/redo03.log 2 /oradata/GPFSMIG/redo02.log 1 /oradata/GPFSMIG/redo01.log 4 /oradata/GPFSMIG/redo01-02.log 5 /oradata/GPFSMIG/redo02-02.log 6 /oradata/GPFSMIG/redo03-02.log

6 rows selected.

SQL> host mv /oradata/GPFSMIG/redo01.log /oradata/GPFSMIG/redo01-01.log



SQL> alter database rename file '/oradata/GPFSMIG/redo01.log', 2 '/oradata/GPFSMIG/redo02.log', '/oradata/GPFSMIG/redo03.log' to 3 '/oradata/GPFSMIG/redo01-01.log', '/oradata/GPFSMIG/redo02-01.log', 4 '/oradata/GPFSMIG/redo03-01.log';

Database altered.


GROUP# MEMBER---------- ------------------------------ 3 /oradata/GPFSMIG/redo03-01.log 2 /oradata/GPFSMIG/redo02-01.log 1 /oradata/GPFSMIG/redo01-01.log


4 /oradata/GPFSMIG/redo01-02.log 5 /oradata/GPFSMIG/redo02-02.log 6 /oradata/GPFSMIG/redo03-02.log

6 rows selected.

Finally, the new thread is enabled, as shown in Example 3-10. After the new thread is enabled, the new instance can be started.

Example 3-10 Enabling new thread

SQL> alter database enable thread 2;

Database altered.

Both instances can now be started, and the database can be opened from both nodes.

Registering a new instance with Oracle ClusterwareTo enable Oracle Clusterware control of the new instance, you must register the second instance into the clusterware as shown in Example 3-11.

Example 3-11 Adding database and instance with srvctl

{bigbend1:oracle}/home/oracle -> crs_stat -tName Type Target State Host------------------------------------------------------------ora....D1.lsnr application ONLINE ONLINE bigbend1ora....nd1.gsd application ONLINE ONLINE bigbend1ora....nd1.ons application ONLINE ONLINE bigbend1ora....nd1.vip application ONLINE ONLINE bigbend1ora....D2.lsnr application ONLINE ONLINE bigbend2ora....nd2.gsd application ONLINE ONLINE bigbend2ora....nd2.ons application ONLINE ONLINE bigbend2ora....nd2.vip application ONLINE ONLINE bigbend2{bigbend1:oracle}/home/oracle -> srvctl add database -d GPFSMIG -o $ORACLE_HOME{bigbend1:oracle}/home/oracle -> srvctl add instance -d GPFSMIG -i GPFSMIG1 -n bigbend1{bigbend1:oracle}/home/oracle -> srvctl add instance -d GPFSMIG -i GPFSMIG2 -n bigbend2

Example 3-12 on page 115 shows how the crs_stat -t output reflects that srvctl starts the database and instances.

Note: At this time, you must create certain Oracle RAC specific data dictionary views by running the catclust.sql file. This action produces a lot of output, so we do not show running the catclust.sql file here. The catclust file is in $ORACLE_HOME/rdbms/admin.


Example 3-12 Startup database using srvctl

{bigbend1:oracle}/home/oracle -> crs_stat -tName Type Target State Host------------------------------------------------------------ora....G1.inst application OFFLINE OFFLINEora....G2.inst application OFFLINE OFFLINEora.GPFSMIG.db application OFFLINE OFFLINEora....D1.lsnr application ONLINE ONLINE bigbend1ora....nd1.gsd application ONLINE ONLINE bigbend1ora....nd1.ons application ONLINE ONLINE bigbend1ora....nd1.vip application ONLINE ONLINE bigbend1ora....D2.lsnr application ONLINE ONLINE bigbend2ora....nd2.gsd application ONLINE ONLINE bigbend2ora....nd2.ons application ONLINE ONLINE bigbend2ora....nd2.vip application ONLINE ONLINE bigbend2{bigbend1:oracle}/home/oracle -> srvctl start database -d GPFSMIG{bigbend1:oracle}/home/oracle -> crs_stat -tName Type Target State Host------------------------------------------------------------ora....G1.inst application ONLINE ONLINE bigbend1ora....G2.inst application ONLINE ONLINE bigbend2ora.GPFSMIG.db application ONLINE ONLINE bigbend1ora....D1.lsnr application ONLINE ONLINE bigbend1ora....nd1.gsd application ONLINE ONLINE bigbend1ora....nd1.ons application ONLINE ONLINE bigbend1ora....nd1.vip application ONLINE ONLINE bigbend1ora....D2.lsnr application ONLINE ONLINE bigbend2ora....nd2.gsd application ONLINE ONLINE bigbend2ora....nd2.ons application ONLINE ONLINE bigbend2ora....nd2.vip application ONLINE ONLINE bigbend2{bigbend1:oracle}/home/oracle ->

3.2.6 Configuring Transparent Application Failover

If transparent application failover (TAF) is used, you must set it up in the tnsnames.ora configuration file. The tnsnames.ora that we use for the final verification is shown in Example 3-13.

Example 3-13 tnsnames.ora with TAF

GPFSMIG = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = bigbend1_vip)(PORT = 1521)) (ADDRESS = (PROTOCOL = TCP)(HOST = bigbend2_vip)(PORT = 1521)) (LOAD_BALANCE = yes) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = GPFSMIG) (FAILOVER_MODE = (TYPE = select) (METHOD = basic) ) ) )


3.2.7 Verification

In this task, we connect to one of the nodes (the first node in this case), select information to show the connection, stop this instance, and reselect information to show that failover has occurred. Example 3-14 shows these actions. The test is run on node bigbend1.

Example 3-14 TAF failover

SQL> select instance_number instance#, instance_name, host_name, status 2 from v$instance;

INSTANCE# INSTANCE_NAME HOST_NAME STATUS---------- ---------------- -------------------- ------------ 1 GPFSMIG1 bigbend1 OPEN

SQL> select failover_type, failover_method, failed_over 2 from v$session where username='SYSTEM';

FAILOVER_TYPE FAILOVER_M FAILED_OVER------------- ---------- -----------SELECT BASIC NO

REM initiate shutdown instance 1SQL> host sqlplus -s / as sysdbaselect instance_number from v$instance;

INSTANCE_NUMBER--------------- 1

shutdown abortORACLE instance shut down.exit

SQL>REM end shutdown instance 1

SQL> select instance_number instance#, instance_name, host_name, status 2 from v$instance;

INSTANCE# INSTANCE_NAME HOST_NAME STATUS---------- ---------------- -------------------- ------------ 2 GPFSMIG2 bigbend2 OPEN

SQL> select failover_type, failover_method, failed_over 2 from v$session where username='SYSTEM';

FAILOVER_TYPE FAILOVER_M FAILED_OVER------------- ---------- -----------SELECT BASIC YES

SQL>


3.3 Adding a node to an existing RAC

The scenario described in this section assumes the existing Oracle RAC is running on GPFS. The steps are:

1. Set up the new node basic, just as in 3.2.1, “Setting up the new node” on page 107. Also, the file/etc/oraInst.loc is copied to the new node from one of the existing nodes.

2. Add the new node to GPFS (this is similar to 3.2.2, “Add the new node to existing (single node) GPFS cluster” on page 107).

3. Add the node to Oracle Clusterware.

4. Add the new instance to RAC.

5. Reconfigure the database.

3.3.1 Add the node to Oracle Clusterware

Because the ORA_CRS_HOME is shared, and both Oracle Clusterware and Oracle RAC are already configured, adding a new node is a straightforward process.

To add a node to the Oracle Clusterware configuration, Oracle provides a script, $ORA_CRS_HOME/oui/bin/addNode.sh, which will start OUI. If Oracle Clusterware is installed in a separate directory from Oracle RAC, the addNode.sh script must be in the Oracle Clusterware home directory.

You add the new node information on the second window of the OUI, which is shown in Figure 3-2 on page 118. In all other windows, click Next.


Figure 3-2 New node information in OUI

When the installation is finished, OUI asks to run root scripts on both the old nodes and the new nodes (see Figure 3-3 on page 119). Running these root scripts performs all of the configuration changes required to install Oracle Clusterware and start Oracle Clusterware on the new node.


Figure 3-3 OUI root scripts

To run the scripts:

1. Start by running the /orabin/crs102/install/rootaddnode.sh script on the existing node, (where the addNode.sh script was also run), which is shown in Example 3-15.

Example 3-15 OUI root script on existing node (bigbend1)

root@bigbend1:/> cd /orabin/crs102/installroot@bigbend1:/orabin/crs102/install> ./rootaddnode.shclscfg: EXISTING configuration version 3 detected.clscfg: version 3 is 10G Release 2.Attempting to add 1 new nodes to the configurationUsing ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.node <nodenumber>: <nodename> <private interconnect name> <hostname>node 3: bigbend3 bigbend3_interconnect bigbend3Creating OCR keys for user 'root', privgrp 'system'..Operation successful./orabin/crs102/bin/srvctl add nodeapps -n bigbend3 -A bigbend3_vip/255.255.255.0/en0|en1 -o /orabin/crs102root@bigbend1:/orabin/crs102/install>

2. Next, run the /orabin/crs102/root.sh script on the new node, which is shown in Example 3-16 on page 120.


Example 3-16 OUI root script on new node (bigbend3)

root@bigbend3:/orabin/crs102> ./root.shWARNING: directory '/orabin' is not owned by rootChecking to see if Oracle CRS stack is already configuredOCR LOCATIONS = /dev/OCR1,/dev/OCR2Setting the permissions on OCR backup directorySetting up NS directoriesOracle Cluster Registry configuration upgraded successfullyWARNING: directory '/orabin' is not owned by rootclscfg: EXISTING configuration version 3 detected.clscfg: version 3 is 10G Release 2.Successfully accumulated necessary OCR keys.Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.node <nodenumber>: <nodename> <private interconnect name> <hostname>node 1: bigbend1 bigbend1_interconnect bigbend1node 2: bigbend2 bigbend2_interconnect bigbend2clscfg: Arguments check out successfully.

NO KEYS WERE WRITTEN. Supply -force parameter to override.-force is destructive and will destroy any previous clusterconfiguration.Oracle Cluster Registry for cluster has already been initializedStartup will be queued to init within 30 seconds.Adding daemons to inittabAdding daemons to inittabExpecting the CRS daemons to be up within 600 seconds.CSS is active on these nodes. bigbend1 bigbend2 bigbend3CSS is active on all nodes.Waiting for the Oracle CRSD and EVMD to startOracle CRS stack installed and running under init(1M)Running vipca(silent) for configuring nodeapps

Creating VIP application resource on (0) nodes.Creating GSD application resource on (0) nodes.Creating ONS application resource on (0) nodes.Starting VIP application resource on (2) nodes...Starting GSD application resource on (2) nodes...Starting ONS application resource on (2) nodes...

Done.root@bigbend3:/orabin/crs102> crs_stat -tName Type Target State Host------------------------------------------------------------ora....G1.inst application OFFLINE OFFLINEora....G2.inst application OFFLINE OFFLINEora.GPFSMIG.db application OFFLINE OFFLINEora....D1.lsnr application ONLINE ONLINE bigbend1ora....nd1.gsd application ONLINE ONLINE bigbend1ora....nd1.ons application ONLINE ONLINE bigbend1ora....nd1.vip application ONLINE ONLINE bigbend1ora....D2.lsnr application ONLINE ONLINE bigbend2


ora....nd2.gsd application ONLINE ONLINE bigbend2ora....nd2.ons application ONLINE ONLINE bigbend2ora....nd2.vip application ONLINE ONLINE bigbend2ora....nd3.gsd application ONLINE ONLINE bigbend3ora....nd3.ons application ONLINE ONLINE bigbend3ora....nd3.vip application ONLINE ONLINE bigbend3

As you can see in the last part of Example 3-16 on page 120, bigbend3 now appears with the basic Oracle Clusterware services, but not instance and listener, because they are not configured.

3.3.2 Adding a new instance to existing RAC

To add a new instance to existing RAC that is running with shared ORACLE_HOME, perform the following actions:

1. Make sure that the Oracle Inventory location is defined by copying /etc/oraInst.loc from one of the existing nodes.

2. Run $ORACLE_HOME/root.sh to create the local files.

3. Update Oracle Inventory with the new cluster node: ./runInstaller -noClusterEnabled -updateNodeList ORACLE_HOME="/orabin/ora102RAC" ORACLE_HOME_NAME="{oraRACHome}" CLUSTER_NODES="{bigbend1,bigbend2,bigbend3}" LOCAL_NODE="bigbend1"

Oracle provides a shell script, $ORACLE_HOME/oui/bin/addNOde.sh for this task, but as with Oracle Clusterware, the Oracle Inventory is not updated in shared inventory, and the script is prompting for the execution of the $ORACLE_HOME/root.sh script.

4. Run the Oracle network configuration assistant, netca, or reconfigure listener.ora and tnsnames.ora to reflect the new node.

Note: The CRS-0215 error can occur when configuring VIP. According to Oracle Metalink, this error is caused by default routing configuration. In our test environment, we have observed the same issue. The VIP is not configured (the vipca command did not complete successfully). We solve this problem by running the following command:

ifconfig en0 alias 192.168.100.157 netmask 255.255.255.0

The interface is en0, and 192.168.100.157 is the IP address used as the VIP.

Note: When addNode.sh was run with Oracle Inventory on shared storage, the node list was not updated. When Oracle Inventory resides on local storage, node list was updated on the local Inventory, but not for all nodes. To update the Oracle Inventory node list for a specific HOME on a specific node, you can use the following command:

$ORA_CRS_HOME/oui/bin/runInstaller -noClusterEnabled -updateNodeList ORACLE_HOME="$ORA_CRS_HOME" ORACLE_HOME_NAME="{CRSHome}" CLUSTER_NODES="{bigbend1,bigbend2,bigbend3}" LOCAL_NODE="bigbend3"

In the previous command, $ORA_CRS_HOME=/orabin/crs102 is the path where Oracle Clusterware is installed, CRSHome is the name for this Oracle HOME in OUI, {bigbend1,bigbend2,bigbend3} is the complete list of nodes in the cluster, and bigbend3 is the node from where the command is run.


3.3.3 Reconfiguring the database

The database reconfiguration process is similar to 3.2.5, “Configure database for RAC” on page 108, except that the bullet “Create RAC data dictionary views, catclust” is not needed here:

1. Add node three configuration to spfile.2. Create a link for node three to spfile in $ORACLE_HOME/dbs.3. Add redo log groups for new RAC node.4. Add undo tablespace for the new RAC node.5. Enable thread three.

3.3.4 Final verification

To check if the third instance is actually part of the RAC, Example 3-17 shows output from crs_stat -t and after that, select from gv$instance.

Example 3-17 crs_stat -t output with three nodes

{bigbend1:oracle}/home/oracle -> crs_stat -tName Type Target State Host------------------------------------------------------------ora....G1.inst application ONLINE ONLINE bigbend1ora....G2.inst application ONLINE ONLINE bigbend2ora....G3.inst application ONLINE ONLINE bigbend3ora.GPFSMIG.db application ONLINE ONLINE bigbend1ora....D1.lsnr application ONLINE ONLINE bigbend1ora....nd1.gsd application ONLINE ONLINE bigbend1ora....nd1.ons application ONLINE ONLINE bigbend1ora....nd1.vip application ONLINE ONLINE bigbend1ora....D2.lsnr application ONLINE ONLINE bigbend2ora....nd2.gsd application ONLINE ONLINE bigbend2ora....nd2.ons application ONLINE ONLINE bigbend2ora....nd2.vip application ONLINE ONLINE bigbend2ora....D3.lsnr application ONLINE ONLINE bigbend3ora....nd3.gsd application ONLINE ONLINE bigbend3ora....nd3.ons application ONLINE ONLINE bigbend3ora....nd3.vip application ONLINE ONLINE bigbend3

Example 3-18 shows the output from querying gv$instance.

Example 3-18 gv$instance output

SQL> select instance_number, host_name, database_status from gv$instance;

INSTANCE_NUMBER HOST_NAME DATABASE_STATUS--------------- --------------- ----------------- 1 bigbend1 ACTIVE 3 bigbend3 ACTIVE 2 bigbend2 ACTIVE


3.4 Migrating from HACMP-based RAC clusterto GPFS using RMAN

This section describes how to migrate from an HACMP-based RAC cluster to a GPFS-based configuration, and this section provides the walk-through process of completely removing HACMP to run only Oracle Clusterware and RAC on GPFS with raw partitions for OCR and voting disks.

3.4.1 Current raw devices with HACMP

This scenario starts with a RAC configuration that is based on concurrent raw devices managed by HACMP and AIX CLVM. The basic elements of this configuration are:

� Hardware: Two p5-570 LPARs

� OS: AIX V5.3 TL6 and HACMP V5.3 PTF 4

� Oracle 10g RAC Release 2

The initial cluster configuration (Figure 3-4) is based on HACMP, oraclevg is in enhanced concurrent mode (ECM) and is opened in concurrent mode when HACMP is up and running (RSCT is responsible for resolving concurrent access).

Figure 3-4 Oracle RAC with HACMP (before migration)

All raw devices in this test environment are created with the mklv -B -TO options, thus the logical volume control block (LVCB) does not occupy the first block of the logical volume. Special consideration must be taken if the raw devices for Oracle are created without the mklv -TO option for later use of the dd command to copy data files from raw logical volumes to a file system, as described in 3.5.1, “Logical volume type and the dd copy command” on page 134. Example 3-19 on page 124 shows the list of raw devices that we have used for this scenario.

austin1 austin2

ent2 ent2ent3 ent3

hdisk0rootvg


austin2192.168.100.32


austin1192.168.100.31

hdisk1

oraclevg

DS4800

rootvghdisk0



fcs0 fcs0

austin1_vip192.168.100.31

austin2_vip192.168.100.32

Note: oraclevg is an ECM VG and is managed by HACMP


Example 3-19 Current raw devices for oracle RAC

root@austin1:/> lsvg -l oraclevgoraclevg:LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINTraw_system jfs 85 85 1 open/syncd N/Araw_sysaux jfs 65 65 1 open/syncd N/Araw_undotbs1 jfs 7 7 1 open/syncd N/Araw_temp jfs 33 33 1 closed/syncd N/Araw_example jfs 20 20 1 open/syncd N/Araw_users jfs 15 15 1 open/syncd N/Araw_redo1_1 jfs 15 15 1 closed/syncd N/Araw_redo1_2 jfs 15 15 1 closed/syncd N/Araw_control1 jfs 15 15 1 open/syncd N/Araw_control2 jfs 15 15 1 open/syncd N/Araw_spfile jfs 1 1 1 closed/syncd N/Araw_pwdfile jfs 1 1 1 closed/syncd N/Araw_ocr1 jfs 40 40 1 closed/syncd N/Araw_ocr2 jfs 40 40 1 closed/syncd N/Araw_vote1 jfs 40 40 1 open/syncd N/Araw_vote2 jfs 40 40 1 open/syncd N/Araw_vote3 jfs 40 40 1 open/syncd N/Araw_undotbs2 jfs 7 7 1 open/syncd N/Araw_redo2_1 jfs 15 15 1 closed/syncd N/Araw_redo2_2 jfs 15 15 1 closed/syncd N/A

3.4.2 Migrating data files to GPFS

The target configuration is presented in Figure 3-5. Oracle data files reside on GPFS (/oradata based on hdisk2 ... hdisk22). GPFS is providing concurrent access to data files.

Figure 3-5 Target configuration (data files on GPFS)

austin1 austin2

ent2 ent2ent3 ent3

hdisk0rootvg


austin2192.168.100.32


austin1192.168.100.31

hdisk1 hdisk3

hdisk2

hdisk22

DS4800

rootvghdisk0



fcs0 fcs0

austin1_vip192.168.100.31

austin2_vip192.168.100.32

Note: hdisk2...hdisk22 are NSDs, configured in /oradata file system


Example 3-20 shows the commands that we use to migrate control files and data files from raw devices to a GPFS using Oracle Recovery Manager (RMAN). RMAN is a utility for backing up, restoring, and recovering Oracle databases. RMAN is a standard utility in Oracle and does not require a separate installation.

Example 3-20 Migrate control files and data files to GPFS using RMAN

SQL> startup nomountORACLE instance started.

Total System Global Area 4966055936 bytesFixed Size 2027648 bytesVariable Size 889196416 bytesDatabase Buffers 4060086272 bytesRedo Buffers 14745600 bytes

SQL> alter system set db_create_file_dest='/oradata/';

SQL> alter system set control_files='/oradata/control1.dbf','/oradata/control2.dbf' scope=spfile;

System altered.

SQL> shutdown immediate;ORA-01507: database not mounted

ORACLE instance shut down.

{austin1:oracle}/home/oracle -> rman target /

Recovery Manager: Release 10.2.0.1.0 - Production on Thu Sep 27 14:51:29 2007


connected to target database (not started)

RMAN> startup nomount;

Oracle instance started

Total System Global Area 4966055936 bytes

Fixed Size 2027648 bytesVariable Size 889196416 bytesDatabase Buffers 4060086272 bytesRedo Buffers 14745600 bytes

RMAN> restore controlfile from '/dev/rraw_control1';

Starting restore at 2007-09-27 15:15:16using target database control file instead of recovery catalogallocated channel: ORA_DISK_1channel ORA_DISK_1: sid=144 instance=austindb1 devtype=DISK

Tip: Follow the instructions while the second node is down.


channel ORA_DISK_1: copied control file copyoutput filename=/oradata/control1.dbfoutput filename=/oradata/control2.dbfFinished restore at 2007-09-27 15:15:18

RMAN> alter database mount ;

using target database control file instead of recovery catalogdatabase mounted

###Even though there is a RMAN command for copying all data files at once, we decided to copy each data file at a time, because we want to give the files names of our own choice.

RMAN> copy datafile '/dev/rraw_system' to '/oradata/system.dbf';

Starting backup at 2007-09-27 16:31:54using channel ORA_DISK_1channel ORA_DISK_1: starting datafile copyinput datafile fno=00001 name=/dev/rraw_systemoutput filename=/oradata/system.dbf tag=TAG20070927T163154 recid=10 stamp=634408334channel ORA_DISK_1: datafile copy complete, elapsed time: 00:00:25Finished backup at 2007-09-27 16:32:19

###Repeat copying other datafiles

RMAN> copy datafile '/dev/rraw_sysaux' to '/oradata/sysaux.dbf';RMAN> copy datafile '/dev/rraw_example' to '/oradata/example.dbf';RMAN> copy datafile '/dev/rraw_users' to '/oradata/users.dbf';RMAN> copy datafile '/dev/rraw_undotbs1' to '/oradata/undotbs1.dbf';RMAN> copy datafile '/dev/rraw_undotbs2' to '/oradata/undotbs.dbf';

RMAN> switch database to copy;

datafile 1 switched to datafile copy "/oradata/system.dbf"datafile 2 switched to datafile copy "/oradata/undotbs1.dbf"datafile 3 switched to datafile copy "/oradata/sysaux.dbf"datafile 4 switched to datafile copy "/oradata/users.dbf"datafile 5 switched to datafile copy "/oradata/example.dbf"datafile 6 switched to datafile copy "/oradata/undotbs.dbf"

RMAN> alter database open;

database opened


3.4.3 Migrating the temp tablespace to GPFS

Because temporary tablespace (temp) files do not contain any volatile information at this point, it is easier to create new ones and drop the old versions. Create a new temporary tablespace before removing the previous default temp tablespace. Example 3-21 shows how to replace the temp tablespace. The size and number of temp files are the same as on the system being upgraded.

Example 3-21 Migrate temp tablespace to GPFS

SQL> create temporary tablespace TEMPORARY tempfile '/oradata/temp.dbf' size 512M;

Tablespace created.

SQL> alter database default temporary tablespace temporary;

Database altered.

SQL> drop tablespace temp;

Tablespace dropped.

SQL> alter tablespace TEMPORARY rename to TEMP;

Tablespace altered.

SQL> select file_name from dba_temp_files;

FILE_NAME---------------------------------------------------------------------------------------------/oradata/temp01.dbf

3.4.4 Migrating the redo log files to GPFS

Drop existing redo logs and recreate them in GPFS. Each instance (thread) in RAC must have at least two redo log groups. In order to replace redo log files in the RAC environment, add two log file groups per each instance (thread). And then, run the SQL> alter system switch logfile; command to make the old log files inactive (see Example 3-22 on page 128). You can drop logfiles that are in “inactive” or “unused” status only.


Example 3-22 Migrate redo logs to GPFS

SQL> select group#, members,status,thread# from v$log;

GROUP# MEMBERS STATUS THREAD#---------- ---------- ---------------- ---------- 1 1 INACTIVE 1 2 1 CURRENT 1 3 1 CURRENT 2 4 1 UNUSED 2

SQL> alter database add logfile thread 1 group 5 '/oradata/redo5.log' size 120M;SQL> alter database add logfile thread 1 group 6 '/oradata/redo6.log' size 120M;SQL> alter database add logfile thread 2 group 7 '/oradata/redo7.log' size 120M;SQL> alter database add logfile thread 2 group 8 '/oradata/redo8.log' size 120M;

###while keeping running SQL> alter system switch logfile on each node;drop logfile that is in “inactive” or “unused” status one by one.

SQL> alter database drop logfile group 1;SQL> alter database drop logfile group 2;SQL> alter database drop logfile group 3;SQL> alter database drop logfile group 4;

3.4.5 Migrating the spfile to GPFS

Example 3-23 shows how to migrate the spfile from raw device to GPFS.

Example 3-23 Migrate spfile to GPFS

SQL> create pfile='/tmp/tmppfile.ora' from spfile;

File created.

SQL> shutdown immediate;Database closed.Database dismounted.ORACLE instance shut down.

SQL> startup pfile='/tmp/tmppfile.ora'ORACLE instance started.

Total System Global Area 4966055936 bytesFixed Size 2027648 bytesVariable Size 889196416 bytesDatabase Buffers 4060086272 bytesRedo Buffers 14745600 bytesDatabase mounted.Database opened.SQL> create spfile='/oradata/spfile_austindb' from pfile;

File created.


###Create a link from the new spfile to $ORACLE_HOME/dbs/spfile_name on both nodes.

{austin1:oracle}/oracle/ora102/dbs -> ln -s /oradata/spfile_austindb spfileaustindb1.ora

{austin2:oracle}/oracle/ora102/dbs -> ln -s /oradata/spfile_austindb spfileaustindb2.ora

###Create a pfile from spfile if necessary.

3.4.6 Migrating the password file

Using the orapwd utility, create a password file in the /oradata GPFS file system, as shown in Example 3-24.

Example 3-24 Migrate a password file

{austin1:oracle}/home/oracle -> orapwd file=/oradata/orapw_austindb password=itsoadmin

###Remove a previous password file and link to the new password file.

{austin1:oracle}/oracle/ora102/dbs -> ls -ltotal 80-rw-rw---- 1 oracle dba 1552 Sep 29 21:53 hc_austindb1.dat-rw-rw---- 1 oracle dba 1552 Sep 27 09:32 hc_raw1.dat-rw-r----- 1 oracle dba 8385 Sep 11 1998 init.ora-rw-r----- 1 oracle dba 34 Sep 28 15:12 initaustindb1.ora-rw-r----- 1 oracle dba 12920 May 03 2001 initdw.oralrwxrwxrwx 1 oracle dba 17 Sep 28 12:02 orapwaustindb1 -> /dev/rraw_pwdfile

{austin1:oracle}/oracle/ora102/dbs -> rm orapwaustindb1{austin1:oracle}/oracle/ora102/dbs -> ln -s /oradata/orapw_austindb orapwaustindb1

###Remove a previous password file and link to the new password file on the second node.

root@austin2:/oracle/ora102/dbs> rm orapwaustindb2root@austin2:/oracle/ora102/dbs> ln -s /oradata/orapw_austindb orapwaustindb2

3.4.7 Removing Oracle Clusterware

Because the previous CRS is installed on top of the HACMP cluster, you must reinstall CRS to convert to a CRS only based cluster. It is possible to remove CRS after HACMP is uninstalled. However, OCR and voting (vote) disks are relying on the logical volumes that are managed by HACMP; thus, we decided to remove CRS first, followed by uninstalling the HACMP filesets. Use the following steps to remove CRS:

1. Stop the database, instance, and node application as described in Example 3-33 on page 137.

2. Stop CRS demons on both nodes as shown in Example 3-33 on page 137.


3. Remove CRS as described in Appendix E, “How to cleanly remove CRS” on page 283.

3.4.8 Removing HACMP filesets and third-party clusterware information

Before you reinstall CRS, you must uninstall HACMP file sets and third-party clusterware-related information. The third-party clusterware-related information includes the /opt/ORCLcluster directory. Files in this directory are created during the previous HACMP-based CRS installation and contain information about HACMP. If the CRS installer detects this directory, it creates a soft link from $ORA_CRS_HOME/lib/libskgxn* to /opt/ORCLcluster/lib/*. Running the $ORACLE_CRS/root.sh script fails during CRS installation with the following error shown in Example 3-25.

Example 3-25 Error when running root.sh without removing the /opt/ORCLcluster directory

root@austin1:/oracle/crs> root.shWARNING: directory '/oracle' is not owned by rootChecking to see if Oracle CRS stack is already configured

Setting the permissions on OCR backup directorySetting up NS directoriesOracle Cluster Registry configuration upgraded successfully

.

.

.Startup will be queued to init within 30 seconds.Adding daemons to inittabAdding daemons to inittabExpecting the CRS daemons to be up within 600 seconds.Failure at final check of Oracle CRS stack.10

To avoid this issue, perform the following tasks:

1. Uninstall the following HACMP filesets:

– HACMP filesets:

cluster.adt.*cluster.doc.*cluster.es.*

– Uninstall RSCT filesets for HACMP:

rsct.basic.hacmp rsct.compat.basic.hacmprsct.compat.clients.hacmp

2. Remove the hagsuser group for oracle RAC.

3. Remove the directory /opt/ORCLcluster.

3.4.9 Reinstalling Oracle Clusterware

Remove the previous CRS and reinstall it without HACMP. Refer to Appendix E, “How to cleanly remove CRS” on page 283 for removing CRS, and refer to 2.2, “Oracle 10g Clusterware installation” on page 46 to reinstall CRS on raw disks. Instead of storing OCR and voting (vote) devices on the GPFS file system, we use raw disks to prevent CRS and GPFS from getting in each other’s way during node recovery.


3.4.10 Switch link two library files and relink database

We have experienced an issue with the Oracle database, because the CRS installation has a problem with the /opt/ORCLcluster directory. This issue is explained in 3.4.8, “Removing HACMP filesets and third-party clusterware information” on page 130. Several library files linked to the /opt/ORCLcluster directory need to be changed to the $ORACLE_CRS/lib directory that has been newly installed. Change these links on both nodes, as shown in Example 3-26.

Example 3-26 Switch link for two library files

###Run the following commands on both nodes.###Verify current links for old two database libray files

root@austin1:/oracle/ora102/lib> ls -l libskgxn2*lrwxrwxrwx 1 oracle dba 32 Oct 03 14:04 libskgxn2.a -> /opt/ORCLcluster/lib/libskgxn2.a

lrwxrwxrwx 1 oracle dba 33 Oct 03 14:04 libskgxn2.so -> /opt/ORCLcluster/lib/libskgxn2.so

###Verify the new CRS links for the newly created library files

root@austin1:/oracle/ora102/lib> cd /oracle/crs/libroot@austin1:/oracle/crs/lib> ls -l libskgxn2*lrwxrwxrwx 1 oracle system 27 Oct 03 22:49 libskgxn2.a -> /oracle/crs/lib/libskgxns.alrwxrwxrwx 1 oracle system 28 Oct 03 22:49 libskgxn2.so -> /oracle/crs/lib/libskgxns.so

###Remove the links for old two database libray files

root@austin1:/oracle/ora102/lib> rm libskgxn2.aroot@austin1:/oracle/ora102/lib> rm libskgxn2.so

###Make new links for the library files

root@austin1:/oracle/ora102/lib> ln -s /oracle/crs/lib/libskgxns.a libskgxn2.aroot@austin1:/oracle/ora102/lib> ln -s /oracle/crs/lib/libskgxns.so libskgxn2.so

###Verify the new links

root@austin1:/oracle/ora102/lib> ls -l libskgxn2*lrwxrwxrwx 1 root system 27 Oct 03 23:53 libskgxn2.a -> /oracle/crs/lib/libskgxns.alrwxrwxrwx 1 root system 28 Oct 03 23:54 libskgxn2.so -> /oracle/crs/lib/libskgxns.so


Next, relink the database binaries on both nodes as oracle user, as shown in Example 3-27. If the Oracle home directory ($ORACLE_HOME) is moved to GPFS, relinking is done on one node only. However, the changes to /opt/ORCLcluster must be made on all nodes.

Example 3-27 Relink database binaries

###Run this command on both nodes.

{austin1:oracle}/oracle/ora102/bin -> relink all

3.4.11 Starting listeners

If not already started, start listeners on both nodes as shown in Example 3-28.

Example 3-28 Start a listener

{austin2:oracle}/home/oracle -> lsnrctl start

LSNRCTL for IBM/AIX RISC System/6000: Version 10.2.0.1.0 - Production on 03-OCT-2007 23:25:57


Starting /oracle/ora102/bin/tnslsnr: please wait...

TNSLSNR for IBM/AIX RISC System/6000: Version 10.2.0.1.0 - ProductionSystem parameter file is /oracle/ora102/network/admin/listener.oraLog messages written to /oracle/ora102/network/log/listener.logListening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=austin2)(PORT=1521)))

Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))STATUS of the LISTENER------------------------Alias LISTENERVersion TNSLSNR for IBM/AIX RISC System/6000: Version 10.2.0.1.0 - ProductionStart Date 03-OCT-2007 23:25:58Uptime 0 days 0 hr. 0 min. 0 secTrace Level offSecurity ON: Local OS AuthenticationSNMP ONListener Parameter File /oracle/ora102/network/admin/listener.oraListener Log File /oracle/ora102/network/log/listener.logListening Endpoints Summary... (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=austin2)(PORT=1521)))The listener supports no servicesThe command completed successfully


3.4.12 Adding a database and instances

Register a database and instance using the srvctl command as shown in Example 3-29.

Example 3-29 Register a database and instance using the srvctl command

###Add a database

{austin1:oracle}/home/oracle -> srvctl add database -d austindb -o /oracle/ora102

###Add an instance for each node

{austin1:oracle}/home/oracle -> srvctl add instance -d austindb -i austindb1 -n austin1

{austin1:oracle}/home/oracle -> srvctl add instance -d austindb -i austindb2 -n austin2

{austin1:oracle}/home/oracle -> crs_stat -t

Name Type Target State Host------------------------------------------------------------ora....N1.lsnr application ONLINE ONLINE austin1ora....in1.gsd application ONLINE ONLINE austin1ora....in1.ons application ONLINE ONLINE austin1ora....in1.vip application ONLINE ONLINE austin1ora....N2.lsnr application ONLINE ONLINE austin2ora....in2.gsd application ONLINE ONLINE austin2ora....in2.ons application ONLINE ONLINE austin2ora....in2.vip application ONLINE ONLINE austin2ora....b1.inst application ONLINE ONLINE austin1ora....b2.inst application ONLINE ONLINE austin2ora....indb.db application ONLINE ONLINE austin2

3.5 Migrating from RAC with HACMP cluster to GPFS using dd

This section describes how to migrate a HACMP-based RAC to a GPFS-based RAC cluster using the dd command. The migration process is almost the same as migration using RMAN that we explained in the previous section. The only difference is in migrating control files and data files. Here, we focus on this difference, which is highlighted in bold in the following list. For the remaining steps, refer to 3.4, “Migrating from HACMP-based RAC cluster to GPFS using RMAN” on page 123. The steps are:

1. Use the logical volume type (mklv -TO) and the dd copy command.

2. Migrate control files GPFS.

3. Migrate data files GPFS.

4. Migrate temp tablespace to GPFS.

5. Migrate redo logs to GPFS.

6. Migrate spfile to GPFS.

7. Migrate password file to GPFS.

8. Remove Oracle Clusterware.


9. Remove HACMP filesets and third-party clusterware-related information.

10.Reinstall Oracle Clusterware.

11.Switch link two library files and relink database binaries.

12.Start listeners.

13.Add a database and instances using the srvctl command.

3.5.1 Logical volume type and the dd copy command

It is important to identify whether the raw devices used for Oracle are created with the mklv -TO (capital O, not number 0) command, which determines the parameters that you have to use for the dd command when copying the raw devices to flat files (in GPFS).

Depending on the logical volume device subtype (see Table 3-1), you need different options for the dd command: DS_LVZ or DS_LV. The mklv -TO flag indicates that the logical volume control block does not occupy the first block of the logical volume; therefore, the space is available for application data. This logical volume has a device subtype of DS_LVZ. A logical volume created without this option has a device subtype of DS_LV. For “classic” volume groups, the devsubtype of a logical volume is always DS_LV. For scalable format volume groups, the devsubtype of a logical volume is always DS_LVZ, regardless of whether the mklv -TO flag is used to create the logical volume.

Table 3-1 Types of volume group used for raw devices

VG type LV subtype Description Others

Normal volume group Always DS_LV The logical volume control block will occupy the first block of the logical volume.

mklv -TO flag is always ignored in normal volume group.

Big volume group DS_LV The logical volume control block will occupy the first block of the logical volume.

mklv without -TO in big volume group

DS_LVZ The logical volume control block will not occupy the first block of the logical volume.

mklv with -TO in big volume group

Scalable volume group Always DS_LVZ The logical volume control block will not occupy the first block of the logical volume.

DS_LVZ (mklv -TO) is always set by defaultin scalable volume group.


If raw devices are not DS_LVZ type, when using the dd command to copy raw devices to a file system, you must skip the first block to avoid data corruption:

$ dd if=/dev/rraw_control1 of=/oradta/control1.dbf bs=4096 skip=1 count=30720

3.5.2 Migrate control files to GPFS

However, because raw devices in our test environment are created with mklv -TO, we use the dd command without the “skip=1” option to copy all raw devices to GPFS file system, as shown in Example 3-30 and Example 3-31 on page 136.

Example 3-30 Migrating control files using the dd command

SQL> alter system set control_files='/oradata/control1.dbf','/oradata/control2.dbf' scope=spfile;

SQL> shutdown immediateDatabase closed.Database dismounted.ORACLE instance shut down.

{austin1:oracle}/oradata -> dd if=/dev/rraw_control2 of=/oradata/control2.dbf bs=1M120+0 records in.

{austin1:oracle}/oradata -> dd if=/dev/rraw_control2 of=/oradata/control2.dbf bs=1M120+0 records in.

Tip: The following output shows how to determine if the “-TO” flag is used at LV creation:

#root@austin1:/oracle/crs/bin> lslv raw_systemLOGICAL VOLUME: raw_system VOLUME GROUP: oraclevgLV IDENTIFIER: 00cc5d5c00004c000000011538f31351.1 PERMISSION: read/writeVG STATE: active/complete LV STATE: closed/syncdTYPE: jfs WRITE VERIFY: offMAX LPs: 512 PP SIZE: 8 megabyte(s)COPIES: 1 SCHED POLICY: parallelLPs: 85 PPs: 85STALE PPs: 0 BB POLICY: relocatableINTER-POLICY: minimum RELOCATABLE: noINTRA-POLICY: middle UPPER BOUND: 128MOUNT POINT: N/A LABEL: NoneMIRROR WRITE CONSISTENCY: offEACH LP COPY ON A SEPARATE PV ?: noSerialize IO ?: NODEVICESUBTYPE : DS_LVZ

If the “-TO” flag is used in a big volume group, it will show the following additional attribute:

"DEVICESUBTYPE : DS_LVZ".


3.5.3 Migrate data files to GPFS

It is possible to migrate each raw device data file to GPFS while a database is online with the dd command as shown in Example 3-31 except for system and undo table spaces. For the system and undo tablespaces, refer to Example 3-32 on page 137, which cannot be put offline independently.

Example 3-31 Online data file migration using the dd command

SQL> alter tablespace example offline;Tablespace altered.

{austin1:oracle}/home/oracle -> dd if=/dev/rraw_example of=/oradata/example.dbf bs=1M160+0 records in.160+0 records out.

SQL> alter database rename file '/dev/rraw_example' to '/oradata/example.dbf';Database altered.

SQL> alter tablespace example online;Database altered.

###Repeat the same process for other data files(sysaux, users tablespaces)except system and undo tablespaces.

SQL> alter tablespace sysaux offline;Tablespace altered.

{austin1:oracle}/home/oracle -> dd if=/dev/rraw_sysaux of=/oradata/sysaux.dbf bs=1M160+0 records in.160+0 records out.

SQL> alter database rename file '/dev/rraw_sysaux' to '/oradata/sysaux.dbf';Database altered.

SQL> alter tablespace sysaux online;Database altered.

SQL> alter tablespace users offline;Tablespace altered.

{austin1:oracle}/home/oracle -> dd if=/dev/rraw_users of=/oradata/users.dbf bs=1M160+0 records in.160+0 records out.

SQL> alter database rename file '/dev/rraw_users' to '/oradata/users.dbf';Database altered.

SQL> alter tablespace users online;Database altered.

For the system and undo tablespaces, because they cannot be offline, run the dd command as shown in Example 3-32 on page 137 while having the database in mount (not open) status.


Example 3-32 Migrate system and undo tablespaces using the dd command

SQL> startup mountORACLE instance started.

Total System Global Area 4966055936 bytesFixed Size 2027648 bytesVariable Size 905973632 bytesDatabase Buffers 4043309056 bytesRedo Buffers 14745600 bytesDatabase mounted.

{austin1:oracle}/home/oracle -> dd if=/dev/rraw_system of=/oradata/system.dbf bs=1MSQL> alter database rename file '/dev/rraw_system' to '/oradata/system.dbf';Database altered.

{austin1:oracle}/home/oracle -> dd if=/dev/rraw_undotbs1 of=/oradata/undotbs1.dbf bs=1MSQL> alter database rename file '/dev/rraw_undotbs1' to '/oradata/undotbs1.dbf';Database altered.

{austin1:oracle}/home/oracle -> dd if=/dev/rraw_undotbs2 of=/oradata/undotbs2.dbf bs=1M

SQL> alter database rename file '/dev/rraw_undotbs2' to '/oradata/undotbs2.dbf';Database altered.

SQL> alter database open;

Database altered.

3.6 Upgrading from HACMP V5.2 to HACMP V5.3

In our environment, we test the HACMP upgrade in Oracle 10g RAC environment. We realized that rolling migration is not supported. We also experienced periodical reboots when either node is down. Because CRS is relying on raw devices, which are managed by HACMP, system reboots might occur when OCR and voting (vote) devices are lost. In order to upgrade HACMP in the RAC environment, we perform the steps shown in Example 3-33.

Example 3-33 Stop CRS and database and upgrade HACMP in the RAC environment

###Check the current status of CRS and database

{austin1:oracle}/home/oracle -> crs_stat -tName Type Target State Host------------------------------------------------------------ora....N1.lsnr application ONLINE ONLINE austin1ora....in1.gsd application ONLINE ONLINE austin1ora....in1.ons application ONLINE ONLINE austin1ora....in1.vip application ONLINE ONLINE austin1ora....N2.lsnr application ONLINE ONLINE austin2ora....in2.gsd application ONLINE ONLINE austin2ora....in2.ons application ONLINE ONLINE austin2


ora....in2.vip application ONLINE ONLINE austin2ora....b1.inst application ONLINE ONLINE austin1ora....b2.inst application ONLINE ONLINE austin2ora....indb.db application ONLINE ONLINE austin1

###stop database and nodeapps in order.

{austin1:oracle}/home/oracle -> srvctl stop database -d austindb{austin1:oracle}/home/oracle -> srvctl stop nodeapps -n austin1{austin1:oracle}/home/oracle -> srvctl stop nodeapps -n austin2{austin1:oracle}/home/oracle -> crs_stat -tName Type Target State Host------------------------------------------------------------ora....N1.lsnr application OFFLINE OFFLINEora....in1.gsd application OFFLINE OFFLINEora....in1.ons application OFFLINE OFFLINEora....in1.vip application OFFLINE OFFLINEora....N2.lsnr application OFFLINE OFFLINEora....in2.gsd application OFFLINE OFFLINEora....in2.ons application OFFLINE OFFLINEora....in2.vip application OFFLINE OFFLINEora....b1.inst application OFFLINE OFFLINEora....b2.inst application OFFLINE OFFLINEora....indb.db application OFFLINE OFFLINE

###As root Stop CRS on both nodes

root@austin1:/oracle/crs/bin> crsctl stop crsStopping resources. This can take several minutes.Successfully stopped CRS resources.Stopping CSSD.Shutting down CSS daemon.Shutdown request successfully issued.

###Stop HACMP cluster on both nodes using smitty clstop.###Install HACMP 5.3 filesets on both and reboot the systems.

3.7 GPFS upgrade from 2.3 to 3.1

This section describes the procedure that we use to migrate a GPFS V2.3 file system to GPFS V3.1 in an Oracle 10g RAC environment. You can migrate from GPFS V2.3 to V3.1 in at least two ways:

� Upgrading the code and the existing file system, reusing disk and configuration

� Exporting the file system, deleting the GPFS cluster, removing the old code, installing the new code, then creating the new GPFS cluster, and finally, importing the previously exported file systems


The major advantage of the second option is that it also allows for easy fallback to the previous configuration if you experience problems with the new GPFS version.

In this section, we document both methods. The setup used for this exercise is a two-node cluster with nodes dallas1 and dallas2. GPFS Version 2.3 is installed, and there are two file systems: /oradata and /orabin. For details, refer to Appendix C, “Creating a GPFS 2.3” on page 263.

3.7.1 Upgrading using the mmchconfig and mmchfs commands

Figure 3-6 presents the cluster diagram (nodes dallas1 and dallas2) that we use for this migration scenario.

Figure 3-6 Test configuration diagram

Migrating to GPFS 3.1 from GPFS 2.3 consists of the following steps:

1. Stop all file system user activity. For Oracle 10g RAC, you stop all file system user activity by using the following command as oracle user on all nodes:

crsctl stop crs

2. As root, cleanly unmount all GPFS file systems. Do not use force unmount. Use the fuser -cux command to identify any leftover processes attached to the file system.

3. Stop GPFS on all nodes in the cluster (as root user):

mmshutdown -a

Note: In preparation for any migration or upgrade operation, we strongly recommend that you save your data and also have a fallback or recovery plan in case something goes wrong during this process.

Note: You might also need to run the emctl stop dbconsole and isqlplusctl stop commands. Any scripts run from cron or other places must be stopped as well.

austin1 dallas2

ent2 ent2ent3 ent3

hdisk0rootvg

dallas2_interconn10.1.100.34

dallas2192.168.100.34

dallas1_interconn10.1.100.33

dallas1192.168.100.33

hdisk1 hdisk3

hdisk2

hdisk22

DS4800

rootvghdisk0



fcs0 fcs0

dallas1_vip192.168.100.133

dallas2_vip192.168.100.134


4. Copy the GPFS installation packages and install the new code on the nodes in the cluster from AIX (we use AIX Network Install Manager).

5. Start GPFS on all nodes in the cluster and mount the file systems if this is not done automatically on GPFS daemon start:

mmstartup -a; mmmount all -a

6. Operate GPFS with the new level of code until you are sure that you want to permanently migrate.

7. Migrate the cluster configuration data and enable the new cluster-wide functionality as shown in Example 3-34.

Example 3-34 Migrating cluster configuration data

root@dallas1:/> mmchconfig release=LATESTmmchconfig: Command successfully completedmmchconfig: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.root@dallas1:/>

8. Migrate all file systems to reflect the latest metadata format changes. For each file system in your cluster, use: mmchfs <file_system> -V, as shown in Example 3-35.

Example 3-35 Upgrading file systems using mmchfs

root@dallas1:/> mmchfs oradata -VGPFS: 6027-471 You have requested that the file system be upgraded to version 9.03.This will enable new functionality but will prevent you from usingthe file system with earlier releases of GPFS. Do you want to continue?yroot@dallas1:/> mmchfs orabin -VGPFS: 6027-471 You have requested that the file system be upgraded to version 9.03.This will enable new functionality but will prevent you from usingthe file system with earlier releases of GPFS. Do you want to continue?yroot@dallas1:/>

For more details about the GPFS upgrade procedure, see the manual GPFS V3.1 Concepts, Planning, and Installation Guide, GA76-0413.

3.7.2 Upgrading using mmexportfs, cluster recreation, and mmimportfs

Start by shutting down all activity (see 3.7.1, “Upgrading using the mmchconfig and mmchfs commands” on page 139), and then perform the following actions:

1. Export the file systems one by one: mmexportfs <file system> -o <Export-file>, as shown in Example 3-36 on page 141.

Note: After the upgrade, the output of the mmlsconfig config command shows the same maxFeatureLevelAllowed (822) as before, which is normal behavior.

Important: In this scenario, we delete the existing GPFS cluster and recreate it after installing new GPFS code. You must prepare the environment, node, and disk definition files for the new cluster.


Example 3-36 Exporting file systems

root@dallas1:/etc/gpfs_config> cd 3.1-Upgraderoot@dallas1:/etc/gpfs_config/3.1-Upgrade> mmexportfs oradata -o oradata.exp

mmexportfs: Processing file system oradata ...mmexportfs: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.root@dallas1:/etc/gpfs_config/3.1-Upgrade> mmexportfs orabin -o orabin.exp

mmexportfs: Processing file system orabin ...mmexportfs: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.root@dallas1:/etc/gpfs_config/3.1-Upgrade>

2. Check the current cluster configuration as shown Example 3-37.

Example 3-37 Checking mmlscluster

root@dallas1:/etc/gpfs_config/3.1-Upgrade> mmlscluster > Cluster-Defroot@dallas1:/etc/gpfs_config/3.1-Upgrade> cat Cluster-Def

GPFS cluster information======================== GPFS cluster name: dallas_cluster.dallas1_interconnect GPFS cluster id: 720967509442396828 GPFS UID domain: dallas_cluster.dallas1_interconnect Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp

GPFS cluster configuration servers:----------------------------------- Primary server: dallas1_interconnect Secondary server: dallas2_interconnect

Node Daemon node name IP address Admin node name Designation----------------------------------------------------------------------- 1 dallas1_interconnect 10.1.100.33 dallas1_interconnect quorum-manager 2 dallas2_interconnect 10.1.100.34 dallas2_interconnect quorum-manager

root@dallas1:/etc/gpfs_config/3.1-Upgrade>

Note: The mmexportfs command actually removes the file system definition from the cluster.

Note: We recommend that you use mmexportfs for individual file systems, and do not use mmexportfs all, because this will also export NSD disks that are not used for any file systems, such as tiebreaker NSDs. Using all can create issues when importing all file systems into the new cluster.


3. Document the current quorum tiebreaker disk configuration as shown in Example 3-38. The output of mmlsconfig correctly states (none) in the list of file systems, because the file systems are exported.

Example 3-38 Documenting current configuration

root@dallas1:/etc/gpfs_config/3.1-Upgrade> mmlsconfig > Cluster-Configroot@dallas1:/etc/gpfs_config/3.1-Upgrade> mmlspv > Cluster-PVroot@dallas1:/etc/gpfs_config/3.1-Upgrade> cat Cluster-ConfigConfiguration data for cluster dallas_cluster.dallas1_interconnect:-------------------------------------------------------------------clusterName dallas_cluster.dallas1_interconnectclusterId 720967509442396828clusterType lcmultinode yesautoload yesuseDiskLease yesmaxFeatureLevelAllowed 822tiebreakerDisks nsd_tb1;nsd_tb2;nsd_tb3[dallas2_interconnect]

File systems in cluster dallas_cluster.dallas1_interconnect:------------------------------------------------------------(none)root@dallas1:/etc/gpfs_config/3.1-Upgrade> cat Cluster-PVhdisk7 nsd_tb1hdisk8 nsd_tb2hdisk9 nsd_tb3hdisk10 nsd01hdisk11 nsd02hdisk12 nsd03hdisk13 nsd04hdisk14 nsd05hdisk15 nsd06root@dallas1:/etc/gpfs_config/3.1-Upgrade>

4. Shut down the GPFS cluster:

mmshutdown -a

5. Remove the GPFS cluster:

mmdelnode -a

6. Remove the current GPFS code:

installp -u gpfs.*

7. Install the new code using the method of choice (we use Network Installation Management (NIM)).

8. Create a new cluster using the information in the <Cluster-Def> file created in the previous step and node descriptor files. Refer to 2.1.6, “GPFS configuration” on page 30 for details about how to create the GPFS cluster.

9. Create tiebreaker disks.

Note: Because we are reusing disk, use the -v no option to let mmcrnsd overwrite disks:

mmcrnsd -F /etc/gpfs_config/gpfs_disks_tb -v no


10.Add tiebreaker disks to the cluster:

mmchconfig tiebreakerDisks="nsd_tb1;nsd_tb2;nsd_tb3"

11.Start GPFS:

mmstartup -a

12.Import file systems, one by one, as shown in Example 3-39.

Example 3-39 Importing file systems

root@dallas1:/etc/gpfs_config/3.1-Upgrade> mmimportfs oradata -i oradata.exp

mmimportfs: Processing file system oradata ...mmimportfs: Processing disk nsd01mmimportfs: Processing disk nsd02mmimportfs: Processing disk nsd03mmimportfs: Processing disk nsd04

mmimportfs: Committing the changes ...

mmimportfs: The following file systems were successfully imported: oradatammimportfs: The NSD servers for the following disks from file system oradata were reset or not defined: nsd01 nsd02 nsd03 nsd04mmimportfs: Use the mmchnsd command to assign NSD servers as needed.mmimportfs: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.root@dallas1:/etc/gpfs_config/3.1-Upgrade> mmimportfs orabin -i orabin.exp

mmimportfs: Processing file system orabin ...mmimportfs: Processing disk nsd05mmimportfs: Processing disk nsd06

mmimportfs: Committing the changes ...

mmimportfs: The following file systems were successfully imported: orabinmmimportfs: The NSD servers for the following disks from file system orabin were reset or not defined: nsd05 nsd06mmimportfs: Use the mmchnsd command to assign NSD servers as needed.mmimportfs: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.root@dallas1:/etc/gpfs_config/3.1-Upgrade>

13.Mount the file systems as shown in Example 3-40 on page 144.


Example 3-40 Mounting file systems

root@dallas1:/etc/gpfs_config/3.1-Upgrade> mmmount oradataThu Sep 20 11:49:20 CDT 2007: 6027-1623 mmmount: Mounting file systems ...root@dallas1:/etc/gpfs_config/3.1-Upgrade> mmmount orabinThu Sep 20 11:49:26 CDT 2007: 6027-1623 mmmount: Mounting file systems ...

3.8 Moving OCR and voting disks from GPFS to raw devices

This section presents the actions that we took to move OCR and Oracle Clusterware voting disks from GPFS to raw devices. In order to move OCR and voting disks out of GPFS, you must first prepare the raw partitions and then run the commands for actually moving OCR and voting disks.

3.8.1 Preparing the raw devices

To prepare the raw devices, we performed the following actions:

1. Create a raw device (LUN/hdisk) for each component that is being moved. We have two LUNs for OCR and three LUNs for CRS voting disks.

2. Use at least 20 MB per voting disk partition. Ownership for OCR devices must be root, and the group must be the same as the oracle installation owner, in this case dba. The voting disk must have owner and group as the oracle installation, in this case oracle and dba. Permissions must be 640 for OCR and 644 for voting disks.

The current voting disks can be listed with the crsctl command, as shown in Example 3-41.

Example 3-41 Checking CRS voting disks

oracle@dallas1:/oracle> crsctl query css votedisk 0. 0 /oradata/crs/votedisk1 1. 0 /oradata/crs/votedisk2 2. 0 /oradata/crs/votedisk3

located 3 votedisk(s).oracle@dallas1:/oracle>

OCR device/path names can be obtained using the ocrcheck command, as shown in Example 3-42 on page 145.

Note: Even though Oracle installation documentation states that a minimum of 100 MB are required for OCR, for replacement you will need a minimum of 256 MB LUNs. In fact, even 256 MB do not work in the test case, so we had to increase the LUN size to 260 MB. We have seen the following errors:

� PROT-21: Invalid parameter

� PROT-16: Internal error

� PROT-22: Storage too small

These all seem to be related to insufficient LUN size or insufficient privilege.


Example 3-42 Checking OCR

oracle@dallas1:/oracle> ocrcheckStatus of Oracle Cluster Registry is as follows : Version : 2 Total space (kbytes) : 130984 Used space (kbytes) : 1984 Available space (kbytes) : 129000 ID : 2020226445 Device/File Name : /oradata/crs/OCR1 Device/File integrity check succeeded Device/File Name : /oradata/crs/OCR2 Device/File integrity check succeeded

Cluster registry integrity check succeededoracle@dallas1:/oracle>

3. Use the mknod command to create a device with a meaningful name, using the same major/minor number as the AIX hdisk.

In Example 3-43, we show how to identify the LUNs for DS4000 Series storage.

Example 3-43 Getting LUN information

root@dallas1:/> fget_config -vA

---dar0---

User array name = 'Austin_DS4800'dac0 ACTIVE dac1 ACTIVE

Disk DAC LUN Logical Drivehdisk2 dac1 0 DALLAS_ocr1hdisk3 dac0 1 DALLAS_ocr2hdisk4 dac1 2 DALLAS_vote1hdisk5 dac1 3 DALLAS_vote2hdisk6 dac0 4 DALLAS_vote3hdisk7 dac0 5 DALLAS_gptb1hdisk8 dac0 6 DALLAS_gptb2hdisk9 dac1 7 DALLAS_gptb3hdisk10 dac1 8 DALLAS_gpDataA1hdisk11 dac0 9 DALLAS_gpDataA2hdisk12 dac0 10 DALLAS_gpDataB1hdisk13 dac1 11 DALLAS_gpDataB2hdisk14 dac1 12 DALLAS_gpOraHomeAhdisk15 dac0 13 DALLAS_gpOraHomeBroot@dallas1:/>

Knowing the mapping between LUN names and AIX default naming, we can now get the major/minor numbers that we need to create devices, as shown in Example 3-44 on page 146.


Example 3-44 Listing major/minor numbers

root@dallas1:/> ls -l /dev/hdisk[2-6]brw------- 1 root system 36, 3 Sep 18 10:21 /dev/hdisk2brw------- 1 root system 36, 4 Sep 18 10:21 /dev/hdisk3brw------- 1 root system 36, 5 Sep 18 10:21 /dev/hdisk4brw------- 1 root system 36, 6 Sep 18 10:21 /dev/hdisk5brw------- 1 root system 36, 7 Sep 18 10:21 /dev/hdisk6root@dallas1:/>

The LUN DALLAS_ocr1 is used for OCR. This is translated to hdisk2, so its major/minor numbers are 36.3. Example 3-45 shows how we use the mknod command to create the device OCR1.

Example 3-45 Creating the new devices

root@dallas1:/> mknod /dev/OCR1 c 36 3root@dallas1:/> mknod /dev/OCR2 c 36 4root@dallas1:/> mknod /dev/crs_votedisk1 c 36 5root@dallas1:/> mknod /dev/crs_votedisk2 c 36 6root@dallas1:/> mknod /dev/crs_votedisk3 c 36 7

The link between these devices, the hdisks, are the major/minor numbers and AIX default naming. To identify which hdisk is actually used for /dev/crs_votedisk2, use the major and minor number as shown in Example 3-46.

Example 3-46 Listing all devices with the specific major/minor number

root@dallas1:/> ls -l /dev/crs_votedisk2crw-r--r-- 1 root system 36, 6 Sep 14 15:38 /dev/crs_votedisk2root@dallas1:/> ls -l /dev | grep "36, 6"crw-r--r-- 1 root system 36, 6 Sep 14 15:38 crs_votedisk2brw------- 1 root system 36, 6 Sep 10 10:07 hdisk5crw------- 1 root system 36, 6 Sep 10 10:07 rhdisk5root@dallas1:/>

4. Set the ownership mode to 640 and root.dba for all OCR devices and to oracle.dba and 644 for CRS voting disk devices. Make sure that the AIX LUN reservation policy is set to no_reserve. To change the reservation policy, use the chdev command, as shown in Example 3-47.

Example 3-47 Setting and verifying reservation policy

root@dallas1:/> chdev -l hdisk5 -a reserve_policy=no_reserveroot@dallas1:/> lsattr -El hdisk5PR_key_value none Persistant Reserve Key Value Truecache_method fast_write Write Caching method Falseieee_volname 600A0B800011A6620000019A00125A8A IEEE Unique volume name Falselun_id 0x0003000000000000 Logical Unit Number Falsemax_transfer 0x100000 Maximum TRANSFER Size Trueprefetch_mult 1 Multiple of blocks to prefetch on read Falsepvid none Physical volume identifier Falseq_type simple Queuing Type Falsequeue_depth 10 Queue Depth Trueraid_level 5 RAID Level Falsereassign_to 120 Reassign Timeout value Truereserve_policy no_reserve Reserve Policy Truerw_timeout 30 Read/Write Timeout value True


scsi_id 0x661600 SCSI ID Falsesize 20 Size in Mbytes Falsewrite_cache yes Write Caching enabled Falseroot@dallas1:/>

5. Make sure that new raw devices do not contain any information that might confuse CRS. We used the dd command to erase any information about /dev/OCR1, as shown in Example 3-48.

Example 3-48 Using dd to erase disks

root@dallas1:/> dd if=/dev/zero of=/dev/OCR1 bs=1024kdd: 0511-053 The write failed.: There is a request to a device or address that does not exist.262+0 records in.260+0 records out.

6. Erase all raw devices before proceeding to the next step. Refer to the UNIX man pages for more information about the dd command. The write error in Example 3-48 indicates that /dev/zero is larger that the raw device. To check the device size, use the bootinfo command, as shown in Example 3-49.

Example 3-49 Checking device size

root@dallas1:/> bootinfo -s hdisk2260

3.8.2 Moving OCR

Make sure that Oracle Clusterware is running on all nodes in the cluster. If any node is shut down during this operation, you must run ocrconfig -repair on this node afterwards:

1. Create a backup of the OCR. The ocrconfig command must be run as root, as shown in Example 3-50.

Example 3-50 Creating a backup of OCR

root@dallas1:/> /orabin/crs/bin/ocrconfig -export /oradata/OCR-before-move.bck -s online

Example 3-51 shows how to restore OCR information from a backup.

Example 3-51 Restoring a backup of OCR

ocrconfig -import /oradata/OCR-before-move.bck

2. Move the OCR as shown in Example 3-52 on page 148.

Note: Make sure that the mknod, chown, chmod, and chdev commands are run on all nodes in the cluster.


Example 3-52 Moving the OCR location

root@dallas1:/> /orabin/crs/bin/ocrconfig -replace ocr /dev/OCR1root@dallas1:/> /orabin/crs/bin/ocrcheckStatus of Oracle Cluster Registry is as follows : Version : 2 Total space (kbytes) : 262120 Used space (kbytes) : 3292 Available space (kbytes) : 258828 ID : 187234612 Device/File Name : /dev/OCR1 Device/File integrity check succeeded Device/File Name : /oradata/crs/OCR2 Device/File integrity check succeeded

Cluster registry integrity check succeeded

Example 3-53 shows how to move the OCR mirror.

Example 3-53 Moving OCR mirror location

root@dallas1:/> /orabin/crs/bin/ocrconfig -replace ocrmirrir /dev/OCR2root@dallas1:/> /orabin/crs/bin/ocrcheckStatus of Oracle Cluster Registry is as follows : Version : 2 Total space (kbytes) : 262120 Used space (kbytes) : 3292 Available space (kbytes) : 258828 ID : 187234612 Device/File Name : /dev/OCR1 Device/File integrity check succeeded Device/File Name : /dev/OCR2 Device/File integrity check succeeded

Cluster registry integrity check succeeded

3.8.3 Moving CRS voting disks

Compared to OCR, voting disks are used to store dynamic cluster information; thus, we do not recommend changing these disks while CRS is running. You must first shut down Oracle Clusterware on all nodes:

1. Example 3-54 shows how to shut down Oracle Clusterware. Shut down Oracle Clusterware on all nodes.

Example 3-54 Shutting down Oracle Clusterware

root@dallas2:/orabin/crs/bin> crsctl stop crsStopping resources. This could take several minutes.Successfully stopped CRS resources.Stopping CSSD.Shutting down CSS daemon.Shutdown request successfully issued.

2. The current voting disks are listed using the crsctl command, as shown in Example 3-55 on page 149.


Example 3-55 Listing current voting disks

root@dallas1:/> /orabin/crs/bin/crsctl query css votedisk 0. 0 /oradata/crs/votedisk1 1. 0 /oradata/crs/votedisk2 2. 0 /oradata/crs/votedisk3

located 3 votedisk(s).

Even though Oracle Clusterware is shut down, we still need to use the -force option when deleting and adding voting disks. Example 3-56 shows how to delete and add voting disks with the crsctl command.

Example 3-56 Deleting and adding voting disks

root@dallas1:/> /orabin/crs/bin/crsctl delete css votedisk /oradata/crs/votedisk1 -forcesuccessful deletion of votedisk /oradata/crs/votedisk1.root@dallas1:/> /orabin/crs/bin/crsctl add css votedisk /dev/crs_votedisk1 -forceNow formatting voting disk: /dev/crs_votedisk1successful addition of votedisk /dev/crs_votedisk1.root@dallas1:/> /orabin/crs/bin/crsctl query css votedisk 0. 0 /dev/crs_votedisk1 1. 0 /oradata/crs/votedisk2 2. 0 /oradata/crs/votedisk3


3. Repeat this for all voting disks. Example 3-57 shows how we remove the remaining two voting disks.

Example 3-57 Removing the remaining voting disks

root@dallas1:/> /orabin/crs/bin/crsctl delete css votedisk /oradata/crs/votedisk2 -forcesuccessful deletion of votedisk /oradata/crs/votedisk2.root@dallas1:/> /orabin/crs/bin/crsctl add css votedisk /dev/crs_votedisk2 -forceNow formatting voting disk: /dev/crs_votedisk2successful addition of votedisk /dev/crs_votedisk2.root@dallas1:/> /orabin/crs/bin/crsctl delete css votedisk /oradata/crs/votedisk3 -forcesuccessful deletion of votedisk /oradata/crs/votedisk3.root@dallas1:/> /orabin/crs/bin/crsctl add css votedisk /dev/crs_votedisk3 -forceNow formatting voting disk: /dev/crs_votedisk3successful addition of votedisk /dev/crs_votedisk3.root@dallas1:/> /orabin/crs/bin/crsctl query css votedisk 0. 0 /dev/crs_votedisk1 1. 0 /dev/crs_votedisk2 2. 0 /dev/crs_votedisk3


Note: During the testing, we experienced Oracle Clusterware rebooting the nodes (due to our own mistake), while we were adding a voting disk, which led to a voting disk entry without a name for the voting disk. Use the /orabin/crs/bin/crsctl delete css votedisk ... to forcefully remove it.


4. Finally, start Oracle Clusterware on all nodes.

For more information, refer to the Chapter 3 of the Oracle Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment Guide,10g Release 2 (10.2), Part Number B14197-04.


Part 3 Disaster recovery and maintenance scenarios

Part three covers certain configurations and aspects concerning disaster recovery (DR) scenarios. We describe the architecture and the steps that you need to take to implement a disaster resilient Oracle RAC configuration using GPFS and storage replication. We also present several of the tools available to help you use, protect, and maintain your environment more effectively, such as GPFS snapshots and storage pools (introduced in GPFS V3.1).

Part 3



Chapter 4. Disaster recovery scenario using GPFS replication

This chapter describes the architecture and the steps that we take to set up a disaster recovery scenario using the configuration for Oracle 10g RAC using GPFS mirroring.

GPFS mirroring is also known as replication and is independent of any other replication mechanism (storage-based or AIX Logical Volume Manager (LVM)). GPFS replication uses synchronous mirroring. This solution consists of production nodes and storage that are located in two sites, plus a third node that is located in a separate (third) site. The node in the third site keeps the GPFS cluster alive in case one of the production sites fails. It acts as a quorum buster for both GPFS cluster GPFS file systems. It also participates in Oracle CRS voting, by defining a CRS voting disk on an NFS share held by this third node. The third node is not connected to the SAN.

The advantages of this solution are:

� No outage or disruption occurs in the case of the total loss of a site (node and storage).

� No manual intervention. The disaster recovery failover works in unattended mode.

The disadvantages of this solution are:

� You must have three sites: Two production sites with nodes and storage, plus a third site with only a (low end) node with internal disks only. The third node cannot be located in any of the production sites.

� Only applications using GPFS or Oracle 10g RAC are protected.

� Networking is configured to support Oracle Virtual IP (VIP) address takeover.

4


4.1 Architectural considerations

In this section, we describe the architecture and design elements that you must consider when planning and implementing a disaster recovery configuration.

4.1.1 High availability: One storage

A failure of a single component (server, switch, adapter, cable, power, disk, and so on) can lead to an application outage. This single component is defined as a a Single Point of Failure (SPOF). Identifying these components is critical for designing an environment that is resilient to the various failures. The components that are identified as SPOFs can be eliminated (doubled and managed) so that failures that do occur will not affect the application users. Defining a highly available architecture means that all SPOFs are eliminated.

Server levelEtherChannel and Multi-Path I/O (MPIO) are providing high availability at the AIX level. Each new release of AIX is also providing more features that contribute to continuous operations, for example, by reducing the need to reboot the server when upgrading the OS or performing system maintenance. However, the server itself remains a SPOF.

Storage levelA SAN can also provide high availability, because the storage subsystems are designed to be fully redundant and fault resilient. Failures from individual spindles (disks) are managed through the RAID algorithm by using automatic replacement with hot spare disks. All the Fibre Channel connections are at least doubled, and there are two separate controllers to manage the host access to the data. There is no single point of failure.

Application levelOn top of resilient hardware, Oracle 10g RAC also provides a highly available database, which is provided by Oracle Clusterware and RAC software.

Figure 4-1 on page 155 shows a common architecture for high availability: two nodes connected to a storage device. Of course, the nodes belong to two different physical frames. We do not recommend using two logical partitions (LPARs) in the same frame, because the frame itself is a single point of failure. This solution is also called local high availability, because the two servers and the storage device are located in the same data center.


Figure 4-1 Typical high availability architecture: two nodes and one type of storage in one data center

This setup provides excellent high availability to your IT environment. But in case of a global disaster that affects the entire data center, all the hardware, servers, and storage are lost at the same time. A disaster includes fire, flood, building collapse, power supply failure, but also malicious attacks or terrorism acts.

To address these issues related to a single data center and thus reduce the risk related to a disaster, you must use a second data center. In this case, this is not called high availability, but disaster recovery.

In addition to having two data centers, a disaster recovery solution also requires two storage subsystems and a storage replication mechanism.

4.1.2 Disaster recovery: Two storage subsystems

To achieve disaster recovery, you must have two data centers. Figure 4-2 on page 156 shows a common disaster recovery architecture, which is also called multi-site architecture. There is one node and one storage device in each site. You must consider the location of the two data centers, and more importantly, the level of separation between the sites.

S t o r a g e

n o d e n o d e

S t o r a g e

n o d e n o d e

S t o r a g e

n o d e n o d e

S t o r a g e

n o d e n o d e

Chapter 4. Disaster recovery scenario using GPFS replication 155

Figure 4-2 Typical disaster recovery architecture

Distance considerationsThe distance between the sites provides a better separation, but it also introduces latency in the communication between the sites (IP and SAN).

The distance between the sites impacts the system performance depending on the data throughput required by your application. If the throughput is high, a maximum of a few kilometers between the two sites must be considered. If the SAN is less heavily used, a distance of 20 to 40 Km is not a problem. These distances are only indicative, and varies depending on the quality of the SAN, I/O on disks, the application, and so on.

A good compromise is two locate the two data centers on two different buildings of the same company. The distance is less than a few kilometers, so the distance is not of any concern. This is a good response to fire disasters, but not the best for earthquakes or floods. This setup is called a campus-wide disaster recovery solution.

Another frequently used architecture involves two sites of the company in the same city, or nearby, located less than 20 km (12 miles) away. The impact of the distance remains reasonable and has the ability to administer and manage both sites. This architecture is considered a metropolitan disaster recovery solution.

To fully address the earthquake risk, imagine a backup data center on another continent. Here, the major point is only the distance, and a completely different set of solutions applies (for example, asynchronous replication). We do not address this subject in this book.

Mirroring considerationsBecause we have two storage units (one in each site) and the same set of nodes and applications accessing the same data, mirroring must be defined between the two storage units. You need to use mirroring to keep the application running with only one surviving site and with a full data copy.

In campus-wide and metropolitan-wide solutions, data mirroring can be synchronous. Further than 100 km (62 miles), or between two continents, data mirroring must be asynchronous. Even though technically possible, synchronous mirroring over thousands of kilometers is not a viable approach due to the poor I/O performance (which makes the application respond too slowly).

You can implement mirroring at the file system level (here GPFS), or at the storage level, by using Metro Mirror (synchronous) or Global Mirror (asynchronous). There are differences between these mirroring methods. We describe GPFS mirroring, also called replication, in this chapter. We describe how to use Metro Mirror for disaster recovery in Chapter 5, “Disaster recovery using PPRC over SAN” on page 185.

Storage

node

Storage

node

Storage

node

Storage

node

Storage

nodenode

Storage

node

Storage

node

Storage

nodenode


4.2 Configuration

To provide fault resilience and to survive the loss of a site (node and storage), GPFS needs three separate sites. There are two main sites (production), which are connected to the same SAN and LAN. Each production site has a complete copy of GPFS file systems’ data and metadata. The third site is not connected to the SAN and has only one node that is connected to the same LAN as the nodes in the production sites. This third site, although it requires little hardware, is essential in case of the loss of a production site. Its role is to keep GPFS alive, by participating in GPFS cluster quorum voting and by holding a third copy of the file systems’ descriptors, thus being able to arbitrate the file systems’ quorum in case an entire storage unit is lost.

4.2.1 SAN configuration for the two production sites

As required for high availability, all SPOF must be eliminated, including SAN elements. At least two SAN switches are required, and each node must have redundant paths to each SAN switch and to each storage device. Figure 4-3 on page 158 shows the minimum SAN cabling requirements between the two sites. Each node must be able to access all application disks from both storage units.

Data mirroring is done at the GPFS level, so all the disks must be visible from both nodes. GPFS mirrors data based on failure group information. In this case, a failure group is a set of disks that belongs to the same site. GPFS enforces the mirroring between the failure groups, guaranteeing that each site contains a good copy of the entire data, metadata, and file system descriptors.

Note: In this configuration, the application runs in both primary and secondary production sites, but not in the third site. In case of either production site failure, operation in the surviving site continues without any user intervention.


Figure 4-3 SAN and network connections through the two sites

4.2.2 GPFS node configuration using three nodes

The GPFS cluster consists of three nodes: two production nodes and one quorum buster. The two production nodes are directly connected to the SAN (FC connection) together with the storage subsystems. The application and database running on these nodes (site A and site B) perform disk I/O directly from each node through the Fibre Channel attachment.

The third node is not attached to the SAN and has only internal disks. It provides GPFS with an internal disk as a Network Shared Disk (NSD). This disk only holds a third copy of the file

No

de 2

No

de 1

Storage 1

SAN switches

Storage 2

Site A Site B

A few Km

No

de 2

No

de 2

No

de 1

Storage 1Storage 1

SAN switches

Storage 2Storage 2

Site A Site B

A few Km


system descriptors, which is limited in volume and access. All the read and write requests on this NSD disk are done through the network and processed on this third node.

The I/O throughput for this node is negligible, because the disk attached to this node does not hold any data or metadata, and there is no application running on this node that accesses the GPFS file system (in fact, this node must not run any application), thus, this node does not affect the overall GPFS performance. This node is called the tiebreaker node, and the site is called the tiebreaker site.

This third (tiebreaker) site must be an independent site. It cannot be a node that is hosted in one of the two production sites. GPFS cannot survive if a main site and the third site are down. Actually, GPFS can survive if only one site is failing. If two sites fail at the same time, GPFS stops on the surviving site, even though this site might still have a whole set of data (one valid copy). Figure 4-4 shows the disaster resilient architecture using GPFS replication.

Figure 4-4 GPFS replication using three nodes

Storage

austin1

Site A

Storage

austin2

Site B

α α ββ

gpfs_dr

Site C

desc

Internal disk(No external storage)

GPFS replicationGPFS replication

Data copy #1Metadata copy #1File system descriptor copy #1

Data copy #2Metadata copy #2File system descriptor copy #2

File system descriptor copy #3

SAN

IP network


Three GPFS nodesIn this example, nodes austin1 and austin2 are located in two production sites (site A and B), and gpfs_dr is the node in the third site. The names used for naming the GPFS hosts are the names used for the interconnect network designed for GPFS + RAC (see Chapter 2, “Basic RAC configuration with GPFS” on page 19). We do not recommend that you use the public (or administrative) network for GPFS; it is not designed for this purpose and is less reliable (no etherchannel). A sample /etc/hosts file is shown in Example 4-1.

Example 4-1 /etc/hosts file for the GPFS nodes

# Public network192.168.100.31 austin1 aus_lpar1192.168.100.32 austin2 aus_lpar2192.168.100.21 gpfs_dr

# Oracle RAC + GPFS interconnect network10.1.100.31 austin1_interconnect10.1.100.32 austin2_interconnect10.1.100.21 gpfs_dr_interconnect

The two production nodes must have the manager attribute, but the third node is only a client. However, all tree nodes must be quorum nodes.

We have prepared a node descriptor file, which is shown in Example 4-2. For more details about how to create a GPFS cluster, see 2.1.6, “GPFS configuration” on page 30.

Example 4-2 Node file for creating a GPFS cluster

root@austin1:/home/michel> cat gpfs_nodefileaustin1_interconnect:quorum-manageraustin2_interconnect:quorum-managergpfs_dr_interconnect:quorum

No GPFS tiebreaker diskThe tiebreaker disk is mainly designed for use with a two node cluster. In this case, we have three nodes, and because the third node is not connected to the storage, having tiebreaker disks does not make sense. If you migrate an existing cluster to this disaster recovery configuration, and you plan to reuse an existing GPFS cluster, make sure that you are not using tiebreaker disks, as shown in Example 4-3 on page 161.

Important: This file must be the same on all the nodes and must remain unchanged after the GPFS cluster is created. IP name resolution is critical for any clustering environment. You must make sure that all nodes in this cluster resolve all IP labels (names) identically.


Example 4-3 GPFS configuration must not use tiebreaker disks

root@austin1:/home/michel> mmlsconfig

Configuration data for cluster austin_cluster.austin1_interconnect:-------------------------------------------------------------------clusterName austin_cluster.austin1_interconnectclusterId 720967500852570099clusterType lcautoload yesuseDiskLease yesmaxFeatureLevelAllowed 906tiebreakerDisks no[gpfs_dr_interconnect]unmountOnDiskFail yes[austin1_interconnect]takeOverSdrServ yes

GPFS cluster topologyExample 4-4 shows the GPFS cluster configuration.

Example 4-4 GPFS three node topology where each node must be part of the quorum

root@austin1:/home/michel> mmlscluster

GPFS cluster information======================== GPFS cluster name: austin_cluster.austin1_interconnect GPFS cluster id: 720967500852570099 GPFS UID domain: austin_cluster.austin1_interconnect Remote shell command: /usr/bin/rsh Remote file copy command: /usr/bin/rcp

GPFS cluster configuration servers:----------------------------------- Primary server: austin1_interconnect Secondary server: austin2_interconnect

Node Daemon node name IP address Admin node name Designation ---------------------------------------------------------------------------------- 1 austin1_interconnect 10.1.100.31 austin1_interconnect quorum-manager 2 austin2_interconnect 10.1.100.32 austin2_interconnect quorum-manager 3 gpfs_dr_interconnect 10.1.100.21 gpfs_dr_interconnect quorum


Third node special setupBecause the function of this node is to serve as a tiebreaker in GPFS quorum decisions, the third node does not require normal file system access and SAN connectivity. To ignore disk access errors on the tiebreaker node, enable the unmountOnDiskFail configuration parameter, as shown in Example 4-5. When enabled, this parameter forces the tiebreaker node to treat the lack of disk connectivity as a local error, resulting in a failure to mount the file system, rather that reporting this condition to the file system manager as a disk failure.

Example 4-5 Avoid propagating inappropriate error messages on the third node

root@gpfs_dr:/home/michel> mmchconfig unmountOnDiskFail=yes gpfs_dr_interconnectmmchconfig: Command successfully completedmmchconfig: 6027-1371 Propagating the cluster configuration data to allaffected nodes. This is an asynchronous process.

Verify that the parameter has been set as shown in Example 4-6.

Example 4-6 GPFS configuration with unmountOnDiskFail option turned on

root@austin1:/home/michel> mmlsconfigonfigConfiguration data for cluster austin_cluster.austin1_interconnect:-------------------------------------------------------------------clusterName austin_cluster.austin1_interconnectclusterId 720967500852570099clusterType lcautoload yesuseDiskLease yesmaxFeatureLevelAllowed 906tiebreakerDisks no[gpfs_dr_interconnect]unmountOnDiskFail yes[austin1_interconnect]takeOverSdrServ yes

4.2.3 Disk configuration using GPFS replication

On nodes austin1 and austin2, we have two free shared LUNs, one of which is allocated on each SAN storage device. They must be equal in size and meet the space requirements for your database. Each LUN holds one copy of all data and metadata and one copy of the file system descriptor area.

In our example, there are only two LUNs, but in an actual configuration, you might have more LUNs. Just make sure that you have an even number of LUNs and that all LUNs are the same size. Also make sure that half of the LUNs are located in each storage subsystem (in different sites). By assigning the LUNs in each storage subsystem to a different failure group, you make sure that GPFS replication (mirroring) is consistent and useful in case one production site fails.

On node gpfs_dr, we have one free internal SCSI disk. The size of this disk is not that important; it contains only a copy of the file system descriptors (no data or metadata). This disk is in a separate failure group.

The LUNs in storage subsystems in sites A and B and the internal SCSI disk belong to the same GPFS. If you plan to have more than one file system, in addition to an equal number of


and equal size LUNs in sites A and B, you need one separate disk attached to the node in the third site.

Example 4-7 shows the disks that will be used for the new GPFS file system.

Example 4-7 Free LUNs on the main nodes and free internal disk on the third node

root@austin1:/home/michel> lspvhdisk0 0022be2ab1cd11ac rootvg active...hdisk16 none None hdisk17 none None

root@gpfs_dr:/home/michel> lspvhdisk0 00c6629e00bddee5 rootvg active...hdisk3 none None

Create the NSDs to use later by GPFS on one of the main nodes and on the third node. The command is mmcrnsd. You must issue this command on a node that sees the disk or the LUN.

Example 4-8 shows the disk descriptor file that we use for this scenario. For more information about this command, refer to 2.1.6, “GPFS configuration” on page 30.

Make sure that the failure group (1, 2, or 3) for each LUN reflects the actual site; there is one failure group on each site. The disks in our example are:

� hdisk16 is a LUN in site A storage (failure group 1)� hdisk17 is a LUN (same size than hdisk16) in site B storage (failure group 2)� hdisk3 is an internal disk in the node in site C (failure group 3)

For more information about the sites, see Figure 4-4 on page 159.

Example 4-8 GPFS disk file for NSD creation on the main nodes and third node

root@austin1:/home/michel> cat gpfs_disk_filehdisk16:austin1_interconnect:austin2_interconnect:dataAndMetadata:1:dr_copy1:hdisk17:austin2_interconnect:austin1_interconnect:dataAndMetadata:2:dr_copy2:hdisk3:gpfs_dr_interconnect::descOnly:3:dr_desc:


Check the NSD created as shown in Example 4-9.

Example 4-9 List of NSD disks

root@austin1:/home/michel> mmlsnsd -m

Disk name NSD volume ID Device Node name Remarks ---------------------------------------------------------------------------------- dr_copy1 C0A8641F46F18EF0 /dev/hdisk16 austin1_interconnect primary node dr_copy1 C0A8641F46F18EF0 /dev/hdisk16 austin2_interconnect backup node dr_copy2 C0A8642046F18EF9 /dev/hdisk17 austin1_interconnect backup node dr_copy2 C0A8642046F18EF9 /dev/hdisk17 austin2_interconnect primary node dr_desc C0A8641546F18FDE /dev/hdisk3 gpfs_dr_interconnect primary node

We are now ready to create the file system. To create the file system, we use the same disk descriptor file that was used for the mmcrnsd command. This file is modified by the mmcrnsd command, when the NSDs have been created, as shown in Example 4-10. At this point, start the GPFS daemon on all nodes in the cluster (mmstartup -a).

Example 4-10 Disk file for GPFS file system creation

root@austin1:/home/michel> cat gpfs_disk_file# hdisk16:austin1_interconnect:austin2_interconnect:dataAndMetadata:1:dr_copy1:dr_copy1:::dataAndMetadata:1::# hdisk17:austin2_interconnect:austin1_interconnect:dataAndMetadata:2:dr_copy2:dr_copy2:::dataAndMetadata:2::# hdisk3:gpfs_dr_interconnect::descOnly:3:dr_desc:dr_desc:::descOnly:3::

The NSD disks are accessible from any node in the cluster; thus, you can run the mmcrfs command on any node (see Example 4-11 on page 165). Because you want a fully replicated file system, make sure that you use the correct replication parameters: -m2 -M2 -r2 -R2.

Note: Disks dr_copy1 and dr_copy2 appear twice in the listing shown in Example 4-9, which is normal, because these disks are attached (via SAN) to both production nodes.


Example 4-11 GPFS-replicated file system creation

root@austin1:/home/michel> mmcrfs /disaster /dev/disaster -F gpfs_disk_file -n3 -m2 -M2 -r2 -R2 -A yes

GPFS: 6027-531 The following disks of disaster will be formatted on node austin1: dr_copy1: size 4194304 KB dr_copy2: size 4194304 KB dr_desc: size 71687000 KBGPFS: 6027-540 Formatting file system ...GPFS: 6027-535 Disks up to size 208 GB can be added to storage pool 'system'.Creating Inode FileCreating Allocation MapsClearing Inode Allocation MapClearing Block Allocation MapGPFS: 6027-572 Completed creation of file system /dev/disaster.mmcrfs: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.

The replication parameters that you use to enable and activate the replication are:

� -m Number of copies of metadata (inodes, directories, and indirect blocks) for a file. Valid values are 1 and 2 (activates metadata replication by default). You cannot set this parameter to 2 if -M was not set to 2 also.

� -M Maximum number of copies of metadata (of inodes, directories, and indirect blocks) for a file. Valid values are also 1 and 2 (this enables metadata replication).

� -r Number of copies of each data block for a file. Valid values are 1 and 2 (activates data replication by default). You cannot set this parameter to 2 if -R was not also set to 2.

� -R Maximum number of copies of data blocks for a file. Valid values are 1 and 2 (enables data replication).

Mount and check the parameters of the newly created file system, as shown in Example 4-12.

Example 4-12 Checking the new GPFS-replicated file system

root@austin1:/home/michel> mount node mounted mounted over vfs date options -------- --------------- --------------- ------ ------------ --------------- /dev/hd4 / jfs2 Sep 16 02:06 rw,log=/dev/hd8 /dev/hd2 /usr jfs2 Sep 16 02:06 rw,log=/dev/hd8 /dev/hd9var /var jfs2 Sep 16 02:06 rw,log=/dev/hd8 /dev/hd3 /tmp jfs2 Sep 16 02:06 rw,log=/dev/hd8 /dev/hd1 /home jfs2 Sep 16 02:06 rw,log=/dev/hd8 /proc /proc procfs Sep 16 02:06 rw /dev/hd10opt /opt jfs2 Sep 16 02:06 rw,log=/dev/hd8 /dev/disaster /disaster mmfs Sep 19 18:24 rw,mtime,atime,dev=disaster

root@austin1:/home/michel> mmlsdisk disaster -Ldisk driver failure holds holds storagename type size group metadata data status availability disk id pool remarks -------- ---- ---- ----- -------- ---- ------ ------------ ------- ------- -------dr_copy1 nsd 512 1 yes yes ready up 1 system desc


dr_copy2 nsd 512 2 yes yes ready up 2 system descdr_desc nsd 512 3 no no ready up 3 system desc

Number of quorum disks: 3 Read quorum value: 2Write quorum value: 2

After successful creation, you can use this file system to store the Oracle data files. The file system is resilient to the loss of either one of the production sites.

4.2.4 Oracle 10g RAC clusterware configuration using three voting disks

We have seen earlier that Oracle 10g RAC provides its own high availability mechanism. Is it the same for disaster recovery?

A disaster recovery configuration implies duplicated SAN storage in different sites. Because the Oracle data files are located on GPFS, they are protected by the file system layer against disaster (if one site is down). Now, what about Oracle Clusterware, which is also called CRS?

Oracle RAC is based on a concurrent (shared) storage architecture, and the main goal of the clustering layer is to prevent unauthorized storage access from nodes that are not considered “safe”. From this perspective, Oracle Clusterware Cluster Ready Services (CRS) manages node failure similarly to GPFS. It uses a voting disk to act as a tiebreaker in case of a node failure. The CRS voting disk might be a raw device or a file that is accessible to all nodes in the cluster. The voting disk (or the access to the disk) is vital for Oracle Clusterware. If the voting disk is lost to any node, even temporarily, it triggers the reboot of the respective node.

Oracle Clusterware cannot survive without a valid voting disk. You can define up to 32 voting disks, all of which contain the same information. For the cluster to be up and running, more than half of the declared voting disks must be accessible. Assume that half of the voting disks are located on a storage unit in site A and the other half in site B. We can see immediately that if one of the storage units (through a site failure) is lost, the voting disks quorum cannot be fulfilled, and all nodes are rebooted by CRS. After the reboot, CRS can reconfigure itself with only the surviving voting disks and can restart all the instances. However, to avoid any disruption, we must have an odd number of voting disks (usually three copies are enough) and have (at least) the third copy on a third site.

As discussed in 1.3.1, “RAC with GPFS” on page 11, we do not recommend using GPFS for storing the CRS voting disks. You must use NFS-shared or SAN-attached storage (as raw devices). It is also possible to use a combination of NFS-shared and SAN-attached raw devices.

Three sites are necessaryWith three copies of the voting disk, one copy per site, we can be sure that there are at least two good copies in case of a site failure. Oracle Clusterware remains up and does not reboot any node from the surviving sites. The issue here is that the third node cannot to be connected to the SAN to hold a voting disk. That is why Oracle has extended the support for a third voting disk on an NFS share, which allows the use of a third site, with no SAN, connected only to the the network (accessible from all cluster nodes). This site exports via NFS a normal file, which is used to hold the third voting disk, and this makes the difference when a disaster occurs. This third node does not run any instance of the RAC database. Its role is only as a tiebreaker, similar to GPFS. And of course, the same third site and node can be used for both the Oracle third voting disk via NFS and for a copy of the GPFS file system descriptor.


Figure 4-5 shows the architecture suitable for a third voting disk on NFS.

Figure 4-5 Third voting disk on NFS

Minimum required releasesThe minimum required releases are:

� AIX 5.3 ML4� Oracle Clusterware 10.2.0.2

Preparing the third node as NFS serverThe steps to prepare the third node are:

1. First, create the oracle user and group (in our case, oracle and dba), and then create the same ID as on the other nodes.

2. Make sure that the system, network settings, and parameters correlate with the other nodes.

3. NFS must be running and automatically started at system boot.

4. Create a directory for the CRS voting disk with oracle.dba ownership, and export it via NFS. Nodes austin1 and austin2 are in the two production sites, and gpfs_dr is the node on the third site. Allow the client and root access for austin1 and austin2 (shown in Example 4-13 on page 168).

austin1

Site A

austin2

Site B

Site C

NFS serverNo external storage

vote3

gpfs_dr

SAN

IP network DBDBDBDBDBDB

vote1 DBDBDBDBDBDB

vote2

storage storage

NSD


Example 4-13 NFS directory export with access granted to the primary nodes

smit nfsAdd a Directory to Exports List


[TOP] [Entry Fields]* Pathname of directory to export [/voting_disk] / Anonymous UID [-2] Public filesystem? no +* Export directory now, system restart or both both + Pathname of alternate exports file [] Allow access by NFS versions [] + External name of directory (NFS V4 access only) [] Referral locations (NFS V4 access only) [] Replica locations [] Ensure primary hostname in replica list yes + Allow delegations? no +* Security method 1 [sys,krb5p,krb5i,krb5,> +* Mode to export directory read-write + Hostname list. If exported read-mostly [] Hosts & netgroups allowed client access [austin1,austin2] Hosts allowed root access [austin1,austin2] Security method 2 [] + Mode to export directory [] + Hostname list. If exported read-mostly [] Hosts & netgroups allowed client access [] Hosts allowed root access [] Security method 3 [] + Mode to export directory [] + Hostname list. If exported read-mostly [] Hosts & netgroups allowed client access [] Hosts allowed root access [] Security method 4 [] + Mode to export directory [] + Hostname list. If exported read-mostly [] Hosts & netgroups allowed client access [] Hosts allowed root access [] Security method 5 [] + Mode to export directory [] +[MORE...3]


The result is shown in Example 4-14 on page 169.


Example 4-14 Configuration of NFS server on the third node

root@gpfs_dr:/> ls -l /voting_diskdrwxrwxrwx 2 oracle dba 256 Sep 21 13:56 voting_disk/root@gpfs_dr:/> exportfs -a/voting_disk -sec=sys:krb5p:krb5i:krb5:dh,rw,access=austin1:austin2,root=austin1:austin2

Mounting the directory on nodes in production sitesThese nodes are NFS clients for the voting disk. Create a mount point with oracle.dba ownership, as shown in Example 4-15.

Example 4-15 Mount point for NFS voting disk

root@austin2:/> mkdir /voting_diskroot@austin2:/> chown oracle.dba /voting_disk

Configure the buffer size, timeout, protocol, and security method, as shown in Example 4-16 on page 170. Make sure that this directory is mounted automatically after a reboot.


Example 4-16 SMIT add an NFS for mounting window

Add a File System for Mounting


[TOP] [Entry Fields]* Pathname of mount point [/voting_disk] /* Pathname of remote directory [/voting_disk]* Host where remote directory resides [gpfs_dr] Mount type name []* Security method [sys] +* Mount now, add entry to /etc/filesystems or both? both +* /etc/filesystems entry will mount the directory yes + on system restart.* Mode for this NFS file system read-write +* Attempt mount in foreground or background background + Number of times to attempt mount [] # Buffer size for read [32768] # Buffer size for writes [32768] # NFS timeout. In tenths of a second [600] # NFS version for this NFS filesystem 3 + Transport protocol to use tcp + Internet port number for server [] #* Allow execution of setuid and setgid programs yes + in this file system?* Allow device access via this mount? yes +* Server supports long device numbers? yes +* Mount file system soft or hard hard + Minimum time, in seconds, for holding [3] # attribute cache after file modification Allow keyboard interrupts on hard mounts? yes + Maximum time, in seconds, for holding [60] # attribute cache after file modification Minimum time, in seconds, for holding [30] # attribute cache after directory modification Maximum time, in seconds, for holding [60] # attribute cache after directory modification Minimum & maximum time, in seconds, for [] # holding attribute cache after any modification[MORE...6]


Check the /etc/filesystems file for the new entry, as shown in Example 4-17 on page 171. Also, add the noac option in the /etc/filesystems file.


Example 4-17 Valid /etc/filesystems file

root@austin1:/home/michel> cat /etc/filesystems file.../voting_disk: dev = /voting_disk vfs = nfs nodename = gpfs_dr mount = true options = rw,bg,hard,intr,rsize=32768,wsize=32768,timeo=600,vers=3,proto=tcp,noac,sec=sys account = false

Make sure that the file system is mounted, as shown in Example 4-18.

Example 4-18 Valid result of mount command

root@austin1:/home/michel> mount node mounted mounted over vfs date options-------- --------------- --------------- ------ ------------ --------------- /dev/hd4 / jfs2 Sep 21 14:52 rw,log=/dev/hd8 /dev/hd2 /usr jfs2 Sep 21 14:52 rw,log=/dev/hd8 /dev/hd9var /var jfs2 Sep 21 14:53 rw,log=/dev/hd8 /dev/hd3 /tmp jfs2 Sep 21 14:53 rw,log=/dev/hd8 /dev/hd1 /home jfs2 Sep 21 14:53 rw,log=/dev/hd8 /proc /proc procfs Sep 21 14:53 rw /dev/hd10opt /opt jfs2 Sep 21 14:53 rw,log=/dev/hd8 /dev/disaster /disaster mmfs Sep 21 14:54 rw,mtime,atime,dev=disastergpfs_dr /voting_disk /voting_disk nfs3 Sep 21 16:10 rw,bg,hard,intr,rsize=32768,wsize=32768,timeo=600,vers=3,proto=tcp,noac,sec=sys

Adding the NFS voting disk to Oracle ClusterwareWe assume that Oracle Clusterware is running and configured with two nodes and three voting disks on SAN. Two of the nodes are located on one storage, and the remaining node is on the other storage. Even with these three voting disks, the cluster is not disaster resilient, because if the storage that holds the two voting disks fails, Oracle reboots all nodes. To add the NFS voting disk to Oracle Clusterware, follow these steps:

1. Because the online modification of the CRS voting disks configuration is not supported, we stop our Oracle 10g RAC database and CRS on all the nodes, as shown in Example 4-19.

Example 4-19 Stopping Oracle Clusterware

root@austin1:/orabin/crs/bin> crsctl stop crsStopping resources. This could take several minutes.Successfully stopped CRS resources.Stopping CSSD.Shutting down CSS daemon.Shutdown request successfully issued.

2. Check to see if CRS is really stopped, as shown in Example 4-20 on page 172.


Example 4-20 Check that the CRS stack is stopped

root@austin1:/orabin/crs/bin> crsctl check crsFailure 1 contacting CSS daemonCannot communicate with CRSCannot communicate with EVM

The initial configuration was made with three voting disks, which are located on two different storage devices. The voting disks are shown in Example 4-21.

Example 4-21 Initial list of the configured voting disks on SAN

root@austin1:/orabin/crs/bin> crsctl query css votedisk 0. 0 /dev/votedisk1 1. 0 /dev/votedisk2 2. 0 /dev/votedisk3


3. Delete one of these disks located in the storage that holds two voting disks, as shown in Example 4-22.

Example 4-22 Removal of an unnecessary voting disk on SAN

root@austin1:/orabin/crs/bin> crsctl delete css votedisk /dev/votedisk3 -forcesuccessful deletion of votedisk /dev/votedisk3.

4. We add the NFS-shared voting disk as shown in Example 4-23. Even though the NFS voting disk creates traffic over the IP network, this traffic is insignificant, and the existence of the voting disk is more important than its actual I/O.

Example 4-23 Adding the third NFS voting disk

root@austin1:/> crsctl add css votedisk /voting_disk/voting_disk3_for_DR -force

Now formatting voting disk: /voting_disk/voting_disk3_for_DRsuccessful addition of votedisk /voting_disk/voting_disk3_for_DR.

5. Next, we have to change the owner of the new voting disk (an NFS file in our case). This step, shown in Example 4-24, is extremely important, and if you skip it, CRS will not start.

Example 4-24 Change the owner of the newly created voting disk

root@austin2:/voting_disk> ll-rw-r--r-- 1 root system 10306048 Sep 21 16:27 voting_disk3_for_DR

root@austin2:/voting_disk> chown oracle.dba voting_disk3_for_DR

root@austin2:/voting_disk> ll-rw-r--r-- 1 oracle dba 10306048 Sep 21 16:27 voting_disk3_for_DR

6. Check the configuration. You must have an output similar to Example 4-25 on page 173.


Example 4-25 Final configuration: Two voting disks on SAN and one voting disk on NFS

root@austin1:/orabin/crs/bin> crsctl query css votedisk 0. 0 /dev/votedisk1 1. 0 /dev/votedisk2 2. 0 /voting_disk/voting_disk3_for_DR


7. Restart Oracle Clusterware on all the nodes, which triggers the restart of instances as well.

Example 4-26 Restart Oracle Clusterware

root@austin1:/orabin/crs/bin> crsctl start crsAttempting to start CRS stackThe CRS stack will be started shortly

4.3 Testing and recovery

The tests that we describe in this section simulate a disaster. We have tested node failure tests, the failure of one storage unit, and the failure of both storage units.

Node failure is simulated by halting the node (halt -q command), which also powers off the node. This method is different from a normal shutdown (shutdown -Fr), which stops all applications and processes, synchronizes the file systems, and then stops.

Storage failure is simulated by removing the host mapping at the storage level; thus, the host loses the disk connection immediately.

We can categorize the failure tests as follows:

� High availability tests: Loss of a node� Disaster recovery tests: Loss of a storage unit or the node and storage at the same time

The purpose of this series of tests is to verify that GPFS and Oracle 10g RAC behave as expected in a disaster recovery situation. The results are in line with the expectations, for both GPFS and Oracle, as long as the configuration explained in this chapter is complete.

The hardware architecture of the test platform is shown in Figure 4-4 on page 159. There are two production nodes: austin1 and austin2. A third one, gpfs_dr, is used as a tiebreaker node for GPFS and holds the third Oracle Clusterware voting disk that was exported via NFS to austin1 and austin2.


4.3.1 Failure of a GPFS node

In this test, we verify the GPFS node quorum rule (three quorum nodesand no tiebreaker disk).

We test the worse case scenario, by stopping the node primary node, austin1. This node is the GPFS cluster manager and also the file system manager node for the GPFS (mountpoint /disaster), as shown in Example 4-27.

Example 4-27 Checking the file system manager node for /disaster file system

root@austin1:/> mmlsmgrfile system manager node [from 10.1.100.31 (austin1_interconnect)]---------------- ------------------disaster 10.1.100.31 (austin1_interconnect)

The script shown in Example 4-28 has been run before the failure on austin1 and austin2, the main nodes. Its goal is to estimate the outage time. We have observed no outage.

Example 4-28 Script to check GPFS file system availability during failures

root@austin1:/home/michel> while true> do> print $(date) >> /disaster/test_date_austin1> sleep 1> done

The node failsThe node austin1 is halted at 12h19m37s. The process that writes the date in a file disappears together with the node. The date command output on austin1 is shown in Example 4-29.

Example 4-29 Node austin1 halted at 12:19:37

...Thu Sep 20 12:19:30 CDT 2007Thu Sep 20 12:19:31 CDT 2007Thu Sep 20 12:19:32 CDT 2007Thu Sep 20 12:19:33 CDT 2007Thu Sep 20 12:19:34 CDT 2007Thu Sep 20 12:19:35 CDT 2007Thu Sep 20 12:19:36 CDT 2007Thu Sep 20 12:19:37 CDT 2007Thu Sep 20 12:19:3

On the other node, austin2, we can see in Example 4-30 on page 175 that there is no outage on the GPFS file system /disaster, which remains up and running despite the failure of one node. The third node (gpfs_dr) is important to maintain the node quorum, thus, keeping the GPFS file system active on the surviving nodes.

When the failing node has the role of cluster configuration manager, the process to fail over this management role to another node (which has the management capability) must be less that 135 seconds. During this time, the GPFS file systems are frozen on all the nodes. The I/O can still process in memory, as long as the page pool size is sufficient. After the memory buffers are filled, the application will wait for the I/O to complete, just like normal I/O. If the node that is failing is not the cluster configuration manager, there is no freeze at all.


Example 4-30 Node austin2 node does not stop

...Thu Sep 20 12:19:30 CDT 2007Thu Sep 20 12:19:31 CDT 2007Thu Sep 20 12:19:32 CDT 2007Thu Sep 20 12:19:33 CDT 2007Thu Sep 20 12:19:34 CDT 2007Thu Sep 20 12:19:35 CDT 2007Thu Sep 20 12:19:36 CDT 2007Thu Sep 20 12:19:37 CDT 2007Thu Sep 20 12:19:38 CDT 2007Thu Sep 20 12:19:39 CDT 2007Thu Sep 20 12:19:40 CDT 2007Thu Sep 20 12:19:41 CDT 2007Thu Sep 20 12:19:42 CDT 2007Thu Sep 20 12:19:43 CDT 2007Thu Sep 20 12:19:44 CDT 2007...

The second node is aware of the austin1 failure, as displayed in austin2’s GPFS log shown in Example 4-31.

Example 4-31 Node austin2 GPFS log during node failure test

root@austin2:/var/adm/ras> cat mmfs.log.latest...Thu Sep 20 12:22:04 2007: GPFS: 6027-777 Recovering nodes: 10.1.100.31Thu Sep 20 12:22:05 2007: GPFS: 6027-630 Node 10.1.100.32 (austin2_interconnect) appointed as manager for disaster.Thu Sep 20 12:22:38 2007: GPFS: 6027-643 Node 10.1.100.32 (austin2_interconnect) completed take over for disaster.Thu Sep 20 12:22:38 2007: GPFS: 6027-2706 Recovered 1 nodes....

Node austin2 assumes the role of file system manager (mmlsmgr command) and cluster configuration manager (mmfsadm dump cfgmgr command) as shown in Example 4-32.

Example 4-32 Node austin2 is the new manager node for /disaster GPFS file system after node failure

root@gpfs_dr:/home/michel> mmlsmgrfile system manager node [from 10.1.100.32 (austin2_interconnect)]---------------- ------------------disaster 10.1.100.32 (austin2_interconnect)

root@gpfs_dr:/home/michel> mmfsadm dump cfgmgrnClusters 1

Cluster Configuration [0] "austin_cluster.austin1_interconnect": Type: 'LC' id 0A01641F46EB27F3ccUseCount 1 unused since (never) contactListRefreshMethod 0Domain , myAddr 0 10.1.100.21, authIsRequired falseUID domain 0xF1000004404EB650 (0xF1000004404EB650) Name "austin_cluster.austin1_interconnect" hold count 1 CredCacheHT 0x0 IDCacheHT 0x0No of nodes: 3 total, 3 local, 3 core nodes.


Authorized keys list:clusterName port cKeyGen nKeyGen cipherList

Cluster info list:clusterName port cKeyGen nKeyGen cipherListaustin_cluster.austin1_interconnect 1191 -1 -1 EMPTY

node primary admin --status--- join fail SGs other ip addrs, no idx host name ip address func tr p rpc seqNo cnt mngd last failure---- ----- -------------- ------------ ----- ----------- ------ ---- ---- ------------------- 3 0 gpfs_dr_interc 10.1.100.21 q-l -- J up 1 0 0 1 1 austin1_interc 10.1.100.31 qml -- - down 1 1 0 2007-09-20 12:21:54 2 2 austin2_interc 10.1.100.32 qml -- J up 1 0 0

Current clock tick (seconds since boot): 163390.45 (resolution 0.010) = 2007-09-20 14:25:24

Groupleader 10.1.100.32 0x00000002 (other node)Cluster configuration manager is 10.1.100.32 (other node); pendingOps 0group quorum formation time 2007-09-19 18:23:58gid 46f1af91:0a01641f elect <2:3> seq 2 pendingSeq 2, gpdIdle phase 0, joinedGroupLeader: joined 1 gid <2:3>useDiskLease yes, leaseDuration 35 recoveryWait 35 dmsTimeout 23lastFailedLeaseGranted: 155944.67lastLeaseObtained 163390.32, 34.87 sec left (ok)lastLeaseReplyReceived 163390.32 = 0.00 sec after requeststats: nTemporaryLeaseLoss 1 nSupendIO 0 nLeaseOverdue 4 nTakeover 0 nPinging 0Summary of lease renewal round-trip times: Number of keys = 1, total count 2522 Min 0 Max 0, Most common 0 (2522) Mean 0, Median 0, 99th 0ccSANergyExport no

For more information about the cluster configuration manager and file system manager roles, refer to 2.1.6, “GPFS configuration” on page 30.

The third node (gpfs_dr) is also aware of the cluster changes, but it takes no special action, as shown in Example 4-33.

Example 4-33 Node gpfs_dr GPFS log during node failure test

root@gpfs_dr:/var/adm/ras> cat mmfs.log.latest...Thu Sep 20 12:21:54 2007: GPFS: 6027-777 Recovering nodes: 10.1.100.31Thu Sep 20 12:21:55 2007: GPFS: 6027-2706 Recovered 1 nodes....


4.3.2 Recovery when the GPFS node is back

There is actually no special action for recovery. You must fix the failing node and restart it. After GPFS starts, the node is reintegrated and Oracle can start. Note that the GPFS is never unmounted on the other nodes during the failure.

Even if the failing node had the cluster configuration manager role before its failure, this role is not transferred back automatically. The other node (austin2) continues to perform this role, thus avoiding an unnecessary fallback that might freeze file system activity for a short time.

4.3.3 Loss of one storage unit

This is the most important case of the disaster recovery scenarios. The loss of a storage device is testing the actual GPFS replication (mirroring) and its capacity to survive with only one copy of the data. Remember that even after a disaster, we still have two good copies of the file system descriptors (gpfs_dr third node internal SCSI disk), which is mandatory to keep the file system active.

Austin1 has both management roles (listed in Example 4-34 and Example 4-35 on page 178). We test what happens if this node loses its access to the local storage.

Example 4-34 Node austin1 is the file system manager for /disaster

root@austin1:/var/mmfs/gen> mmlsmgrfile system manager node [from 10.1.100.31 (austin1_interconnect)]---------------- ------------------disaster 10.1.100.31 (austin1_interconnect)

Note: If a GPFS node fails, the file systems remain active on the remaining nodes with no I/O disruption. However, disk I/O might be suspended for up to 135 seconds if the failing node has a management role (configuration manager, file system manager, or metanode) that must be migrated to a surviving node.


Example 4-35 Node austin1 is the cluster configuration manager

root@austin1:/var/mmfs/gen> mmfsadm dump cfgmgr...node primary admin --status--- join fail SGs -lease-renewals-- --heartbeats-- other ip addrs, no idx host name ip address func tr p rpc seqNo cnt mngd sent processed - -------------- ------------ ----- ----------- ------ ---- ---- ----------------- 3 2 gpfs_dr_interc 10.1.100.21 q-l -- J up 2 0 0 03389.57 03390.17 1 0 austin1_interc 10.1.100.31 qml -- J up 1 0 2 00101.34 2 1 austin2_interc 10.1.100.32 qml -- J up 1 0 1 03404.27 03404.97

Current clock tick (seconds since boot): 3413.57 (resolution 0.010) = 2007-09-20 16:05:29

Groupleader 10.1.100.31 0x00000000 (this node)Cluster configuration manager is 10.1.100.31 (this node); pendingOps 0group quorum formation time 2007-09-20 15:10:16...

It is important to check the GPFS (/disaster) replication before the test. In Example 4-36, you can see that the NSD disks dr_copy1 and dr_copy2 are both holding data, metadata, and file system descriptors. They are connected with dual Fibre Channel attachment to both austin1 and austin2. These disks are located on different storage units, situated in separate sites, so the risk of losing both disks is limited. The third NSD disk, dr_desc, is an internal SCSI disk in the third node. Accessed by the network only (no SAN), it contains a third copy of the file system descriptors. The replication settings have been defined at the file system level (see 4.2.3, “Disk configuration using GPFS replication” on page 162).

Example 4-36 GPFS replicated file system configuration before disk failure

root@austin1:/var/mmfs/gen> mmlsdisk disaster -L

disk driver sector failure holds holds storagename type size group metadata data status availability disk id pool remarks------ ------ ------- -------- ----- ------ ------------ ------- ------- -------dr_copy1 nsd 512 1 yes yes ready up 1 system descdr_copy2 nsd 512 2 yes yes ready up 2 system descdr_desc nsd 512 3 no no ready up 3 system desc

Number of quorum disks: 3Read quorum value: 2Write quorum value: 2


The disk failsAs we previously mentioned, we simulate disk loss by removing LUN mapping at the storage subsystem level.

Now, the austin1 node has lost its NSD disks, because the LUN mapping to the host is removed at 16h19m19s. The failing disk is hdisk16 for AIX, or dr_copy1 for GPFS, as revealed by the the disk error messages from the AIX error report (errpt |egrep “ARRAY|mmfs”), which is detailed using errpt -aj command, as shown in Example 4-37.

Example 4-37 Node austin1 AIX error report during disk failure

root@austin1:/> errpt |egrep “ARRAY|mmfs”2E493F13 0920161907 P H hdisk16 ARRAY OPERATION ERROR9C6C05FA 0920161907 P H mmfs DISK FAILURE2E493F13 0920162007 P H hdisk16 ARRAY OPERATION ERROR

root@austin1:/> errpt -aj 2E493F13LABEL: FCP_ARRAY_ERR2IDENTIFIER: 2E493F13

Date/Time: Thu Sep 20 16:19:19 CDT 2007Sequence Number: 82Machine Id: 00CC5D5C4C00Node Id: austin1Class: HType: PERMResource Name: hdisk16Resource Class: diskResource Type: arrayLocation: U7879.001.DQDKZNV-P1-C1-T1-W201300A0B811A662-LE000000000000

DescriptionARRAY OPERATION ERROR

Probable CausesARRAY DASD DEVICE

Failure CausesDISK DRIVEDISK DRIVE ELECTRONICS

Recommended Actions PERFORM PROBLEM DETERMINATION PROCEDURES

root@austin1:/> errpt -aj 9C6C05FALABEL: MMFS_DISKFAILIDENTIFIER: 9C6C05FA

Date/Time: Thu Sep 20 16:19:24 CDT 2007Sequence Number: 83Machine Id: 00CC5D5C4C00Node Id: austin1Class: HType: PERMResource Name: mmfsResource Class: NONE


Resource Type: NONELocation:

DescriptionDISK FAILURE

Probable CausesSTORAGE SUBSYSTEMDISK

Failure CausesSTORAGE SUBSYSTEMDISK

Recommended Actions CHECK POWER RUN DIAGNOSTICS AGAINST THE FAILING DEVICE

Detail DataEVENT CODE 15913921VOLUMEdisasterRETURN CODE 22PHYSICAL VOLUMEdr_copy1

Because GPFS replication is activated, there is no impact and no freeze of the file system /disaster. The file system remains operational on all three nodes. A user or an application cannot see anything special regarding the I/O. You only see the problem in the logs (the AIX error report shown in Example 4-37 on page 179 and the GPFS log, which is shown in Example 4-38).

Example 4-38 GPFS log on austin1 node

root@austin1:/var/mmfs/gen> cat mmfslog...Thu Sep 20 16:19:24 2007: GPFS: 6027-680 Disk failure. Volume disaster. rc = 22. Physical volume dr_copy1....

Example 4-39 shows the status of the file system during the disk failure. A good copy of the data and the metadata is still accessible via the dr_copy2 disk, and the data is not lost during the failure. Also, because two valid copies of the file system descriptor still exist (dr_copy2 and dr_desc disks), the file system is still mounted and active.

Example 4-39 GPFS-replicated file system configuration during disk failure

root@austin1:/> mmlsdisk disaster -Ldisk driver sector failure holds holds storagename type size group metadata data status availability pool remarks-------- ------ ------ ------- -------- ----- ------ ------------ ------- -------dr_copy1 nsd 512 1 yes yes ready down system descdr_copy2 nsd 512 2 yes yes ready up system descdr_desc nsd 512 3 no no ready up system desc


Refer to Figure 4-4 on page 159 for a reminder of this architecture.

4.3.4 Fallback after the GPFS disks are recovered

When the storage is back up and running, you must run administrative commands to recover to the original situation:

1. If the GPFS cluster configuration has changed during the failure, run the command shown in Example 4-40 to ensure that the configuration level is the same on all nodes.

Example 4-40 Synchronize the cluster configuration (if changed during the failure)

root@austin1:/> mmchcluster -p LATEST

mmchcluster: Command successfully completed

2. Then, run the command shown in Example 4-41, from any node, to tell GPFS to accept the disk that is marked down since its failure.

Example 4-41 The failing disks reintegrates

root@austin1:/> mmchdisk disaster start -a

GPFS: 6027-589 Scanning file system metadata, phase 1 ...GPFS: 6027-552 Scan completed successfully.GPFS: 6027-589 Scanning file system metadata, phase 2 ...GPFS: 6027-552 Scan completed successfully.GPFS: 6027-589 Scanning file system metadata, phase 3 ...GPFS: 6027-552 Scan completed successfully.GPFS: 6027-589 Scanning file system metadata, phase 4 ...GPFS: 6027-552 Scan completed successfully.GPFS: 6027-565 Scanning user file metadata ...GPFS: 6027-552 Scan completed successfully.

As a result, the command in Example 4-42 shows that our disk is operational.

Example 4-42 GPFS-replicated file system configuration after disk failure

root@austin1:/> mmlsdisk disaster

disk driver sector failure holds holds storagename type size group metadata data status availability pool-------- -------- ------ ------- -------- ----- ------ ------------ -------dr_copy1 nsd 512 1 yes yes ready up systemdr_copy2 nsd 512 2 yes yes ready up systemdr_desc nsd 512 3 no no ready up system

3. The last action is to replicate the data and metadata (synchronize the mirror). Be aware that this can be an I/O intensive action, depending on the size of your file system. Example 4-43 on page 182 shows how to resynchronize the file system.

Note: When using a GPFS cluster with three nodes in three sites and a replicated GPFS file system on two storage devices, the failure of one storage device has no impact on the I/O. There is no freeze, and there is no data loss. Everything is managed transparently by GPFS.


Example 4-43 Resynchronization of the GPFS mirror

root@austin1:/home/michel> mmrestripefs disaster -b

GPFS: 6027-589 Scanning file system metadata, phase 1 ... 100 % complete on Mon Oct 1 11:13:20 2007GPFS: 6027-552 Scan completed successfully.GPFS: 6027-589 Scanning file system metadata, phase 2 ... 100 % complete on Mon Oct 1 11:13:20 2007GPFS: 6027-552 Scan completed successfully.GPFS: 6027-589 Scanning file system metadata, phase 3 ... 100 % complete on Mon Oct 1 11:13:20 2007GPFS: 6027-552 Scan completed successfully.GPFS: 6027-589 Scanning file system metadata, phase 4 ... 100 % complete on Mon Oct 1 11:13:20 2007GPFS: 6027-552 Scan completed successfully.GPFS: 6027-565 Scanning user file metadata ...GPFS: 6027-552 Scan completed successfully.

Now, you have fully recovered from the disaster. It was not difficult.

4.3.5 Site disaster (node and disk failure)

To simulate a disaster in Site A, both the node and storage device must stop suddenly and at the same time.

As in the previous examples, the node stopped is the cluster configuration manager and also the file system manager for the /disaster file system. Also, the mapping of dr_copy2 disk is removed (at the storage subsystem level) to simulate a storage device problem in site1, and the austin1 node is halted at the same time. So Site A is not responding anymore.

Although it represents two events at the same time, it does not differ from the node failure and disk failure cases that we discussed in 4.3.1, “Failure of a GPFS node” on page 174 and 4.3.3, “Loss of one storage unit” on page 177.

The GPFS log on a surviving node has captured both events, as shown in Example 4-44.

Example 4-44 GPFS log showing a disk failure (dr_copy2), and a node failure (austin1)

root@austin2:/var/mmfs/gen> cat mmfslog

Thu Sep 20 17:48:12 2007: GPFS: 6027-680 Disk failure. Volume disaster. rc = 22. Physical volume dr_copy2.Thu Sep 20 17:49:13 2007: GPFS: 6027-777 Recovering nodes: 10.1.100.31Thu Sep 20 17:49:13 2007: GPFS: 6027-630 Node 10.1.100.32 (austin2_interconnect) appointed as manager for disaster.Thu Sep 20 17:49:37 2007: GPFS: 6027-643 Node 10.1.100.32 (austin2_interconnect) completed take over for disaster.Thu Sep 20 17:49:37 2007: GPFS: 6027-2706 Recovered 1 nodes.

However, GPFS remains functional on both austin2 and gpfs_dr.


4.3.6 Recovery after the disaster

Recovering after a disaster and the separate node or disk failure cases that we tested previously are similar. Refer to 4.3.2, “Recovery when the GPFS node is back” on page 177 and 4.3.4, “Fallback after the GPFS disks are recovered” on page 181.

4.3.7 Loss of one Oracle Clusterware voting disk

Oracle data and code files are stored on GPFS so that they are secure and do not need extra care. Oracle Clusterware and Oracle 10g RAC are managing the availability of the instances by failing over the sessions of a failed instance to another instance using Transparent Application Failover (TAF).

In this scenario, we test the failure of one of the CRS voting disks.

We want to determine if Oracle 10g RAC can survive a site failure with a node and instance crash and storage outage (including one of the voting disks). We have already tested the GPFS layer, and we know it can survive a disaster without concerns. However, because the CRS voting disks are outside GPFS (two disks on the shared storage as LUNs, and one NFS is mounted on a third node), we test this scenario separately.

CRS voting disk failure is simulated (in the same way that the previous tests were simulated) by removing the LUN mapping on the storage subsystem. As a result, RAC remains up and running with no loss of service. In the CRS log, we can see these lines shown in Example 4-45.

Example 4-45 CRS logs during the failure of one voting disk

/orabin/crs/log/austin1/alertaustin1.log

[crsd(704594)]CRS-1012:The OCR service started on node austin1.2007-09-21 16:30:45.498[evmd(630950)]CRS-1401:EVMD started on node austin1.2007-09-21 16:30:46.946[crsd(704594)]CRS-1201:CRSD started on node austin1.2007-09-21 16:33:58.435[cssd(635070)]CRS-1601:CSSD Reconfiguration complete. Active nodes are austin1 austin2 .2007-09-21 17:04:08.000

Note: After a complete Site A disaster, GPFS remains available on Site B because:

� The node quorum is still matched due to the gpfs_dr third node on Site C.

� The file system is replicated (mirrored), so GPFS can use the copy#2 on the Site B storage unit.

� GPFS still has two good copies of the file system descriptors due to the tiebreaker node, gpfs_dr.

Note: We do not recommend that you store the voting disks on GPFS file systems, as stated in 1.3.1, “RAC with GPFS” on page 11. Use the raw hdisk (shared SAN) without any Logical Volume Manager or file system layer. One third voting disk is supported on NFS (see 4.2.4, “Oracle 10g RAC clusterware configuration using three voting disks” on page 166).


[cssd(635070)]CRS-1604:CSSD voting file is offline: /dev/votedisk2. Details in /orabin/crs/log/austin1/cssd/ocssd.log.2007-09-21 17:04:13.035

/orabin/crs/log/austin1/cssd/ocssd.log[ CSSD]2007-09-21 17:16:17.076 [1287] >ERROR: clssnmvReadBlocks: read failed 1 at offset 133 of /dev/votedisk2

4.3.8 Loss of a second Oracle Clusterware (CRS) voting disk

Next, we bring down a second voting disk. The CRS error log adds the lines shown in Example 4-46.

Example 4-46 CRS logs during the failure of two voting disks

/orabin/crs/log/austin2/alertaustin2.log

2007-09-21 17:48:09.659[cssd(479262)]CRS-1606:CSSD Insufficient voting files available [1 of 3]. Details in /orabin/crs/log/austin2/cssd/ocssd.log.

Because the voting disk quorum is not matched any longer (more than half must be accessible), all Oracle 10g RAC instances are stopped, and CRS reboots the servers. When nodes come back up, CRS reconfigures itself to use only one voting disk and restarts the instances. So, the database service is up again, but it is not disaster resilient any longer.

Note: Oracle 10g RAC can survive the loss of one of the three voting disks.

Note: Oracle 10g RAC cannot survive the loss of two out of the three voting disks.


Chapter 5. Disaster recovery using PPRC over SAN

This chapter describes a sample Oracle 10g RAC disaster recovery (DR) configuration with GPFS using storage replication that is provided by the storage subsystems. The storage replication mechanism uses SAN (Storage Area Network) as the transport infrastructure. The application itself is unaware of the mirroring, because the mirroring is done directly by the storage devices without involving GPFS or Logical Volume Manager (LVM) mirroring. In this scenario, we describe a solution that is based on synchronous mirroring, which is called Metro Mirror for IBM System Storage™ DS8000 (formerly Peer to Peer Remote Copy (PPRC)).

Advantages of this solution are:

� Only uses two sites, which both contain nodes and storage

� Other applications using the same storage can be protected as well

Disadvantages of this solution are:

� Manual operations are necessary in the case of a disaster (complete loss of a site) in order to make the second copy into the copy for the remaining node

5


5.1 Architecture

The diagram in Figure 5-1 presents the configuration that we propose for providing a disaster recovery solution for Oracle 10g RAC with GPFS.

Figure 5-1 DR configuration with two sites and Metro Mirror

In this configuration, both nodes A and B are part of the same GPFS cluster and Oracle RAC. Moreover, both nodes are active and can be used for submitting application workload. However, only storage in Site A is active and provides logical unit numbers (LUNs) for GPFS and RAC. Storage in Site B (secondary) provides replication for the LUNs in Site A. The LUNs in the secondary storage are unavailable to either node during normal operation.

During normal operation, the LUNs in the primary storage device are replicated synchronously to secondary storage device. In case the storage in Site A becomes

a u s tin 2 _ in te rc o n n1 0 .1 .1 0 0 .32

a u s tin 21 92 .1 6 8 .1 00 .32

a u s tin 1_ in te rc o n n1 0 .1 .1 0 0 .3 1

a u s tin 119 2 .1 68 .1 0 0 .3 1

R A C in te rc o n n e c tP u b lic n e tw o rk

N o d e 1

e n t2 e n t3

fcs0

a u s tin 1_v ip1 92 .16 8 .1 0 0 .3 1

a u s tin 2_ vip1 9 2 .1 68 .10 0 .32

L U N 0 1 L U N n n

S to ra g e A

S A N

N o d e 2

e n t2 e n t3

fcs 0

L U N 0 1 ' L U N n n '

S to ra g e BM etro M irro r

Active c lu s te r lin ks P P R C lin ks

S ite A S ite B

Note: The configuration in Figure 5-1 requires two nodes. In normal configurations, both nodes are active at the same time and access the LUNs in Storage A. One of the benefits of this configuration is that it does not require additional (standby) hardware, and in case Site A fails, Site B can provide service with degraded performance (as opposed to requiring dedicated contingency hardware). Although reasonably simple, this configuration requires extensive effort for implementation and testing.

A more sophisticated configuration consists of two active nodes in Site A and two backup nodes (inactive) in Site B. However, this configuration adds an additional complexity level for the (manually initiated) failover and failback operations.


unavailable, the system administrator must manually break the replication, activate the secondary copy, map the LUNs belonging to the secondary copy onto node B, and resume GPFS and Oracle operations.

You must take extra precaution when performing recovery. In normal operations, replicated LUNs in storage B are not mapped to any of the nodes (because they have the same IDs as the LUNs in primary storage). During the recovery process, you must prevent LUNs with same IDs from become active, because activating the LUNs confuses the application (GPFS or Oracle). Thus, you must make sure that the replicated LUNs belonging to primary (failing) storage are unmapped from both nodes before you resume the primary storage operation.

When the storage subsystem in Site A is restored to operational status, the system administrator must reinitiate the replication process to synchronize the copies. After the data has been synchronized, the secondary storage might remain active (the primary copy) or you must manually restore the original configuration.

5.2 Implementation

Metro Mirror for IBM System Storage DS8000 (formerly Peer to Peer Remote Copy (PPRC)) is a storage replication product that is totally platform or application independent. PPRC can provide replication between sites for all types of storage methods that are used for Oracle. It can be used for both stand-alone and RAC (clustered) databases. Database files can be plain files (JFS or JFS2), raw devices, ASM, or files in GPFS file systems.

In this test, we have used a two node RAC/GPFS cluster and two IBM System Storage DS8000 units. To simulate the two locations, the SAN provides two IBM 2109-F32 switches that are connected using long wave single mode optical fiber (1300nm - LW GBICs).

We have configured LUNs, masking, and zoning to support our configuration. We do not describe the masking and zoning process in this book.

In this section, we describe how we establish replication between the two storage units, the actions that we must take when storage in Site A becomes unavailable, and the steps to perform when the primary storage is recovered.

5.2.1 Storage and PPRC configuration

We use the DSS command line interface (dscli) that is installed on both nodes: A and B. We work on both nodes, as required. First, in Storage A, we check the storage IDs on the two systems that we want to use for replication, as shown in Example 5-1 on page 188 (ds_A.profile and ds_B.profile are used for connecting the dscli to the management consoles for storage A and also for storage B).

Note: This configuration is based on synchronous replication. The distance between sites is a factor that affects the performance of your application.

Important: Metro Mirror (PPRC) is used to replicate all of the LUNs that are used for our configuration, which include:

� Oracle Cluster Repository� CRS voting disks� GPFS NSDs

Chapter 5. Disaster recovery using PPRC over SAN 187

Example 5-1 Checking the storage subsystems’ IDs

root@dallas1:/> dscli -cfg /opt/ibm/dscli/profile/ds_A.profiledscli> lssiDate/Time: November 22, 2007 4:43:39 PM EET IBM DSCLI Version: 5.1.720.139Name ID Storage Unit Model WWNN State ESSNet================================================================================ds_A IBM.2107-75N0291 IBM.2107-75N0290 932 5005076306FFC1DE Online Enabled

# And for second storage:

root@dallas1:/> dscli -cfg /opt/ibm/dscli/profile/ds_B.profiledscli> lssiDate/Time: November 22, 2007 4:43:42 PM EET IBM DSCLI Version: 5.1.720.139Name ID Storage Unit Model WWNN State ESSNet=================================================================================ds_B IBM.2107-7572791 IBM.2107-7572790 922 5005076303FFC46A Online Enabled

We assume that the LUN configuration has already been performed, and we use only one pair of LUNs to show our configuration.

Check the pair of LUNs that are going to be used for replication on both storage subsystems, as shown in Example 5-2.

Example 5-2 Checking the LUNs

dscli> lsfbvol 9070Date/Time: November 22, 2007 4:47:25 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-75N0291Name ID accstate datastate configstate deviceMTM datatype extpool cap (2^30B) cap (10^9B) cap (blocks)==================================================================================vol1_A 9070 Online Normal Normal 2107-900 FB 512 P32 10.0 - 20971520

# On second storage:

dscli> lsfbvol 9070Date/Time: November 22, 2007 4:46:21 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-7572791Name ID accstate datastate configstate deviceMTM datatype extpool cap (2^30B) cap (10^9B) cap (blocks)==================================================================================vol1_B 9070 Online Normal Normal 2107-900 FB 512 P20 10.0 - 20971520

Check the PPRC links available between the two storage subsystems, as shown in Example 5-3 on page 189.

Note: In your environment, you must make sure that all LUNs belonging to your application are replicated, including OCR and CRS voting disks and GPFS tiebreaker disks. We recommend that you script the failover and failback process and test it thoroughly before deploying the production environment.


Example 5-3 Checking the PPRC links

# on Storage A:

dscli> lspprcpath 90Date/Time: November 22, 2007 4:42:30 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-75N0291Src Tgt State SS Port Attached Port Tgt WWNN=========================================================90 90 Success FF90 I0100 I0332 5005076303FFC46A90 90 Success FF90 I0101 I0333 5005076303FFC46A90 90 Success FF90 I0230 I0302 5005076303FFC46A90 90 Success FF90 I0231 I0231 5005076303FFC46A

# and on Storage B:

dscli> lspprcpath 90Date/Time: November 22, 2007 4:42:26 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-7572791Src Tgt State SS Port Attached Port Tgt WWNN=========================================================90 90 Success FF90 I0231 I0231 5005076306FFC1DE90 90 Success FF90 I0302 I0230 5005076306FFC1DE90 90 Success FF90 I0332 I0100 5005076306FFC1DE90 90 Success FF90 I0333 I0101 5005076306FFC1DE

Create the PPRC relationship between storage in A and B (A → B), as shown in Example 5-4.

Example 5-4 Creating PPRC relationship A → B

# on Storage A:dscli> mkpprc -remotedev IBM.2107-7572791 -type mmir 9070:9070Date/Time: November 22, 2007 4:49:08 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-75N0291CMUC00153I mkpprc: Remote Mirror and Copy volume pair relationship 9070:9070 successfully created.

List the PPRC relationship as shown in Example 5-5. At this point, the two copies are not synchronized yet.

Example 5-5 PPRC relationship is not synchronized

dscli> lspprc -remotedev IBM.2107-7572791 -l 9070:9070Date/Time: November 22, 2007 4:50:12 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-75N0291ID State Reason Type Out Of Sync Tracks Tgt Read Src Cascade Tgt Cascade Date Suspended SourceLSS Timeout (secs) Critical Mode First Pass Status GMIR CG PPRC CG========================================================================================================================================================================================9070:9070 Copy Pending - Metro Mirror 10811 Disabled Disabled Invalid - 90 300 Disabled Invalid Disabled Disabled


After a while (depending on the distance and LUN size), the copies are synchronized, as shown in Example 5-6.

Example 5-6 Synchronized copies

dscli> lspprc -remotedev IBM.2107-7572791 -l 9070:9070Date/Time: November 22, 2007 4:54:38 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-75N0291ID State Reason Type Out Of Sync Tracks Tgt Read Src Cascade Tgt Cascade Date Suspended SourceLSS Timeout (secs) Critical Mode First Pass Status GMIR CG PPRC CG=======================================================================================================================================================================================9070:9070 Full Duplex - Metro Mirror 0 Disabled Disabled Invalid - 90 300 Disabled Invalid Disabled Disabled

Example 5-7 lists the relationship as seen from storage B (to connect to the storage B console, use dscli -cfg /opt/ibm/dscli/profile/ds_B.profile):

Example 5-7 Checking PPRC relationship from storage B

dscli> lspprc -remotedev IBM.2107-75N0291 9070:9070Date/Time: November 22, 2007 4:56:51 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-7572791CMUC00234I lspprc: No Remote Mirror and Copy found.

As you can see, the PPRC relationship is only from A to B.

5.2.2 Recovering from a disaster

We have simulated a disaster by pausing the PPRC replica, as shown in Example 5-8. This is equivalent to the storage in Site A being lost.

This situation is the most complicated of all of the recovery situations, because the node in Site A is still available. Therefore, we must take extra precautions when reconfiguring the LUN mapping on both nodes.

Example 5-8 Pausing the PPRC replication

# on Storage A:dscli> pausepprc -remotedev IBM.2107-7572791 9070:9070Date/Time: November 22, 2007 5:02:08 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-75N0291CMUC00157I pausepprc: Remote Mirror and Copy volume pair 9070:9070 relationship successfully paused.

dscli> lspprc -remotedev IBM.2107-7572791 9070:9070Date/Time: November 22, 2007 5:04:29 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-75N0291ID State Reason Type SourceLSS Timeout (secs) Critical Mode First Pass Status=====================================================================================================


9070:9070 Suspended Host Source Metro Mirror 90 300 Disabled Invalid

Connected to storage B, we activate the secondary copy using the command that is shown in Example 5-9.

Example 5-9 Reversing PPRC replicas

# on Storage B:dscli> failoverpprc -remotedev IBM.2107-75N0291 -type mmir 9070:9070Date/Time: November 22, 2007 5:03:21 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-7572791CMUC00196I failoverpprc: Remote Mirror and Copy pair 9070:9070 successfully reversed.

dscli> lspprc -remotedev IBM.2107-75N0291 9070:9070Date/Time: November 22, 2007 5:04:10 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-7572791ID State Reason Type SourceLSS Timeout (secs) Critical Mode First Pass Status=====================================================================================================9070:9070 Suspended Host Source Metro Mirror 90 300 Disabled Invalid

Next, make sure that Oracle and GPFS are stopped on both nodes. Then, unmap the LUNs belonging to Storage A from both nodes, A and B; if Storage A is unavailable, make sure that when it comes back up, the LUNs used for PPRC are not available to either node A or node B.

After LUNs in storage A have been unmapped, map the replicated LUNs (in this case vol1_B) to both nodes (A and B). Start GPFS and check if it can see the NSDs. Make sure also that OCR and CRS voting disks are available from Storage B.

Verify the NSDs’ availability and file system quorum as shown in Example 5-10. Make sure that all disks are accessed via the localhost (direct access according to RAC requirements).

Example 5-10 Checking disk availability

root@austin1:/> mmlsdisk oradata -Ldisk driver sector failure holds holds storagename type size group metadata data status availability disk id pool remarks ------------ -------- ------ ------- -------- ----- ------------- ------------ ------- ---------- ---------nsd01 nsd 512 1 yes yes ready up 1 system descnsd02 nsd 512 1 yes yes ready up 2 system descnsd03 nsd 512 2 yes yes ready up 3 system descnsd04 nsd 512 2 yes yes ready up 4 systemNumber of quorum disks: 3Read quorum value: 2Write quorum value: 2

root@austin1:/> mmlsdisk oradata -M

Note: Unmapping LUNs is storage-specific, and we do not discuss it in this publication. For storage subsystem operations, check with your storage/SAN administrator to make sure that you understand the consequences of any action that you might take.


Disk name IO performed on node Device Availability------------ ----------------------- ----------------- ------------nsd01 localhost /dev/hdisk10 upnsd02 localhost /dev/hdisk11 upnsd03 localhost /dev/hdisk12 upnsd04 localhost /dev/hdisk13 uproot@austin1:/>

Check the GPFS cluster quorum and node availability using the mmgetstate -a -L command, as shown in Example 5-11.

Example 5-11 Checking cluster quorum

root@austin1:/> mmgetstate -a -L

Node number Node name Quorum Nodes up Total nodes GPFS state Remarks------------------------------------------------------------------------------------ 1 austin1_interconnect 2* 2 3 active quorum node 2 austin2_interconnect 2* 2 3 active quorum noderoot@austin1:/>

When the GPFS file system is available, you can start Oracle Clusterware (CRS), then Oracle RAC, and resume operation.

5.2.3 Restoring the original configuration (primary storage in site A)

The failback operation requires more attention, because restoring the primary PPRC relationship (A → B) requires more steps than the failover operation.

The failback process requires the following steps:

1. Stop all operations (database, CRS, and GPFS).

2. On both AIX nodes, delete all disks (hdisk*) belonging to GPFS, Oracle OCR, and CRS.

3. Perform the PPRC steps that are required to restore the original A → B relationship. Make sure that the replicas are in sync before switching back to Site A.

4. Restore the original mapping on both storage subsystems.

5. Run cfgmgr on both nodes.

6. Make sure that disks are available to RAC and GPFS.

7. Start GPFS and RAC.

In this section, we describe only step 3, because the other steps have been discussed in other sections or materials.

Important: Restoring the original configuration is a disruptive action and requires planned downtime.

We recommend that you script all operations and check the procedures before putting your system in production.


Step 3: Perform the PPRC stepsPerform the PPRC steps that are required to restore the original A → B relationship. Make sure that the replicas are in sync before switching back to Site A. The steps are:

1. Delete the original A → B PPRC relationships:

On Storage A, run the command that is shown in Example 5-12.

Example 5-12 Deleting original PPRC relation ship on storage A

dscli> rmpprc -remotedev IBM.2107-7572791 9070:9070Date/Time: November 22, 2007 5:08:02 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-75N0291CMUC00160W rmpprc: Are you sure you want to delete the Remote Mirror and Copy volume pair relationship 9070:9070:? [y/n]:yCMUC00155I rmpprc: Remote Mirror and Copy volume pair 9070:9070 relationship successfully withdrawn.

2. Repeat on Storage B, as shown in Example 5-13.

Example 5-13 Deleting original PPRC relationship on Storage B

dscli> rmpprc -remotedev IBM.2107-75N0291 9070:9070Date/Time: November 22, 2007 5:10:16 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-7572791CMUC00160W rmpprc: Are you sure you want to delete the Remote Mirror and Copy volume pair relationship 9070:9070:? [y/n]:yCMUC00155I rmpprc: Remote Mirror and Copy volume pair 9070:9070 relationship successfully withdrawn.

3. Next, recreate a new (scratch) PPRC relationship, B → A.

On Storage B, run the command shown in Example 5-14.

Example 5-14 Creating B → A PPRC relationship

dscli> mkpprc -remotedev IBM.2107-75N0291 -type mmir 9070:9070Date/Time: November 22, 2007 5:11:28 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-7572791CMUC00153I mkpprc: Remote Mirror and Copy volume pair relationship 9070:9070 successfully created.

4. The synchronization process starts automatically when you create the relationship. Check for the synchronized copy by using the command shown in Example 5-15.

Example 5-15 Checking PPRC status

dscli> lspprc -remotedev IBM.2107-75N0291 9070:9070Date/Time: November 22, 2007 5:11:39 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-7572791ID State Reason Type SourceLSS Timeout (secs) Critical Mode First Pass Status===================================================================================================9070:9070 Copy Pending - Metro Mirror 90 300 Disabled Invalid

5. After the synchronization, start the process of moving the primary copy back to Site A:

a. Pause the B → A relationship, as shown in Example 5-16 on page 194 (commands run on Storage B).


Example 5-16 Suspending the PPRC relationship

dscli> pausepprc -remotedev IBM.2107-75N0291 9070:9070Date/Time: November 22, 2007 5:15:38 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-7572791CMUC00157I pausepprc: Remote Mirror and Copy volume pair 9070:9070 relationship successfully paused.

dscli> lspprc -remotedev IBM.2107-75N0291 9070:9070Date/Time: November 22, 2007 5:15:57 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-7572791ID State Reason Type SourceLSS Timeout (secs) Critical Mode First Pass Status=====================================================================================================9070:9070 Suspended Host Source Metro Mirror 90 300 Disabled Invalid

b. Fail over to Site A. On Storage A, execute the commands shown in Example 5-17.

Example 5-17 Fail over to Storage A (PPRC relationship is still suspended)

Storage Adscli> failoverpprc -remotedev IBM.2107-7572791 -type mmir 9070:9070Date/Time: November 22, 2007 5:17:28 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-75N0291CMUC00196I failoverpprc: Remote Mirror and Copy pair 9070:9070 successfully reversed.

Asa se vede in Adscli> lspprc -remotedev IBM.2107-7572791 9070:9070Date/Time: November 22, 2007 5:17:39 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-75N0291ID State Reason Type SourceLSS Timeout (secs) Critical Mode First Pass Status=====================================================================================================9070:9070 Suspended Host Source Metro Mirror 90 300 Disabled Invalid

c. Fail back A → B, as shown in Example 5-18 (commands run on Storage A). Check for the synchronized copy.

Example 5-18 Checking for synchronized copy

dscli> failbackpprc -remotedev IBM.2107-7572791 -type mmir 9070:9070Date/Time: November 22, 2007 5:18:44 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-75N0291CMUC00197I failbackpprc: Remote Mirror and Copy pair 9070:9070 successfully failed back.

dscli> lspprc -remotedev IBM.2107-7572791 9070:9070Date/Time: November 22, 2007 5:19:03 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-75N0291ID State Reason Type SourceLSS Timeout (secs) Critical Mode First Pass Status==================================================================================================9070:9070 Full Duplex - Metro Mirror 90 300 Disabled Invalid

d. Check the relationship on Storage B, as shown in Example 5-19.

Example 5-19 Checking PPRC relationship on Storage B

dscli> lspprc -remotedev IBM.2107-75N0291 9070:9070Date/Time: November 22, 2007 5:19:27 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-7572791CMUC00234I lspprc: No Remote Mirror and Copy found.

6. At this point, redo the original LUN mapping (both AIX nodes can only see the LUNs that belong to the primary PPRC copy, located in Storage A).

7. Next, run cfgmgr on both AIX nodes and make sure that all LUNs are available, including OCR and CRS voting disks.

8. Start CRS, GPFS, and Oracle RAC.


Chapter 6. Maintaining your environment

In this chapter, we discuss how Oracle and GPFS work together to make a system and database administrator’s tasks easier and safer. Maintaining a database environment includes tasks, such as backups, and database cloning for test and validation purposes. GPFS V3.1 provides certain facilities that allow for the simplification of these tasks. We describe how these facilities are used with examples that we tested in our environment. This chapter contains the following information:

� Database backups and cloning with GPFS snapshots

� GPFS storage pools and Oracle data partitioning:

– GPFS 3.1 storage pools

– GPFS 3.1 filesets

– GPFS policies and rules

– Oracle data partitioning

6


6.1 Database backups and cloning with GPFS snapshots

This section describes how to use the GPFS snapshot to clone a database for a test environment or for taking an offline backup.

6.1.1 Overview of GPFS snapshots

A GPFS snapshot is a space-efficient logical copy of a GPFS at a single point in time. Snapshot™ contains a copy of the file system data that has changed since it was created. It enables a backup or mirroring application to run concurrently with user updates and still obtain a consistent copy of the file system at the time that the copy was created.

The GPFS snapshot is designed to be fast. GPFS basically performs a file system synchronization of all dirty data, blocks new requests, performs file system sync again of any new dirty data, then creates the empty snapshot inode file, and then resumes. The slow part is waiting for all existing file system write requests to complete and get synchronized to disk. A really busy file system can be blocked for several seconds to flush all dirty data.

The GPFS mmbackup utility uses snapshots to back up the contents of a GPFS at a moment in time to a Tivoli® Storage Manager server. GPFS snapshots also provide online backup means to recover quickly from accidently deleting files.

Snapshots are read-only, so changes are only made in active files and directories. Because snapshots are not a copy of the entire file system, they cannot be used as protection against disk subsystem failures.

When using GPFS snapshots with databases that perform Direct I/O to disk (Oracle uses this feature), there is a severe performance penalty while the snapshot exists and is being backed up. Every time that a write occurs, GPFS checks to make sure that the old block is copied-on-write to the snapshot file. This extra checking overhead can double or triple the normal I/O time.

The default name of the GPFS snapshots subdirectory is “.snapshots”.

GPFS snapshots commands This section quickly describes useful commands to handle GPFS snapshots.

mmcrsnapshotThe mmcrsnapshot command creates a snapshot of an entire GPFS file system at a single point in time. The command syntax is:

mmcrsnapshot Device Directory

Where:

� Device is the device name of the file system for which the snapshot is to be created. File system names do not need to be fully qualified. Using oradata is just as acceptable as /dev/oradata.

� Directory is the subdirectory name where the snapshots are stored. This is a subdirectory of the root directory and must be a unique name within the root directory.


mmlssnapshotThe mmlssnapshot command displays GPFS snapshot information for the specified file system.

The syntax is:

mmlssnapshot Device [-d] [-Q]

Where:

� Device is the device name of the file system for which snapshot information is to be shown.

� -d displays the amount of storage used by the snapshot.

� -Q displays whether quotas were set to be automatically activated upon mounting the file system at the time that the snapshot was taken.

mmdelsnapshotThe mmdelsnapshot command deletes a GPFS snapshot. It has the following syntax:

mmdelsnapshot Device Directory

Where:

� Device is the device name of the file system for which the snapshot is to be deleted.

� Directory is the snapshot subdirectory to be deleted.

mmrestorefsThe mmrestorefs command restores a file system from a GPFS snapshot. The syntax is:

mmrestorefs Device Directory [-c]

Where:

� Device is the device name of the file system for which the restore is to be run.

� Directory is the snapshot with which to restore the file system.

� -c continues to restore the file system in the event that errors occur.

mmsnapdirThe mmsnapdir command creates and deletes invisible directories that connect to the snapshots of a GPFS file system and changes the name of the snapshots subdirectory. The syntax is:

mmsnapdir Device {[-r | -a] [-s SnapDirName]}mmsnapdir Device [-q]

Where:

� Device is the device name of the file system.

� -a adds a snapshots subdirectory to all subdirectories in the file system.

� -q displays current settings if it issued without any other flags.

� -r reverses the effect of the -a option. All invisible snapshot directories are removed. The snapshot directory under the file system root directory is not affected.

Chapter 6. Maintaining your environment 197

GPFS snapshot command examplesHere are GPFS snapshot command examples from our test cluster.

In Example 6-1, we use the time command to check the elapsed time during the execution of the mmcrsnapshot command.

Example 6-1 mmcrsnapshot example

root@alamo1:/tmp/mah> time mmcrsnapshot oradata snap1Writing dirty data to diskQuiescing all file system operationsWriting dirty data to disk againCreating snapshot.Resuming operations.

real 0m0.64suser 0m0.19ssys 0m0.05s

root@alamo1:/tmp/mah> mmlssnapshot oradata -dSnapshots in file system oradata: [data and metadata in KB]Directory SnapId Status Created Data Metadatasnap1 4 Valid Wed Sep 26 15:13:10 2007 1536 1120root@alamo1:/tmp/mah> ls -l /oradata/total 33dr-xr-xr-x 3 root system 8192 Sep 26 15:13 .snapshotsdrwxr-xr-x 2 oracle dba 8192 Sep 25 15:10 ALAMOroot@alamo1:/tmp/mah> ls -l /oradata/.snapshots/total 32drwxr-xr-x 3 oracle dba 8192 Sep 24 16:28 snap1root@alamo1:/tmp/mah> ls -l /oradata/.snapshots/snap1/total 32drwxr-xr-x 2 oracle dba 8192 Sep 24 16:44 ALAMO

The last few ls commands show that the default snapshots directory name is .snapshots and the default subdirectory exists on the root directory of the file system.

You can change the default subdirectory using the mmsnapdir command as shown in Example 6-2.

Example 6-2 mmsnapdir command example

root@alamo1:/tmp/mah> mmsnapdir oradata -s .oradatasnapshotsroot@alamo1:/tmp/mah> ls -l /oradata/total 33dr-xr-xr-x 3 root system 8192 Sep 26 15:13 .oradatasnapshotsdrwxr-xr-x 2 oracle dba 8192 Sep 25 15:10 ALAMOroot@alamo1:/tmp/mah> ls -l /oradata/.oradatasnapshots/total 32drwxr-xr-x 3 oracle dba 8192 Sep 24 16:28 snap1root@alamo1:/tmp/mah> ls -l /oradata/.oradatasnapshots/snap1/total 32drwxr-xr-x 2 oracle dba 8192 Sep 25 15:10 ALAMO

Example 6-3 on page 199 shows how to delete a GPFS snapshot.


Example 6-3 Deleting GPFS snapshot with mmdelsnapshot command

root@alamo1:/tmp/mah> mmdelsnapshot oradata snap1Deleting snapshot files...Delete snapshot snap1 complete, err = 0root@alamo1:/tmp/mah> mmlssnapshot oradata -dSnapshots in file system oradata: [data and metadata in KB]Directory SnapId Status Created Data Metadata

GPFS snapshots possible errorsThe following list presents possible errors that you might get while creating GPFS snapshots:

� mmcrsnapshot is unable to create a snapshot, because there are 31 existing snapshots already in file system.

� mmcrsnapshot cannot create a snapshot, because the specified snapshot subdirectory exists in file system.

� If there is a conflicting command running at the same time, such as another mmcrsnapshot or mmdelsnapshot, the second command will wait.

� mmdelsnapshot cannot delete snapshot, because the specified snapshot does not exist.

� mmrestorefs cannot restore snapshot, because file system is mounted on any number of nodes. GPFS file system must be unmounted on all nodes before a restore is attempted.

� If mmrestorefs is interrupted, the file system could be inconsistent. GPFS will not mount until the restore command completes.

� Snapshot commands cannot be executed, because GPFS is down on all of the nodes or none of GPFS nodes are reachable.

6.1.2 GPFS snapshots and Oracle Database

Whenever administrators deploy a test or development Oracle environment, a process of creating a copy of the entire production database is mandatory. For large databases and systems that have restricted maintenance windows, it is usually not an option, because it takes too long.

Storage manufacturers develop functions in their disk systems called flash copy or snapshots. They provide a point-in-time view of a specified volume on a storage level, and they are not discussed in this document. GPFS snapshot is the similar mechanism that is built into GPFS and is storage subsystem independent.

Note: For an overview of GPFS snapshots, refer to the chapter titled “Creating and maintaining snapshots of GPFS file system” in the GPFS V3.1 Advanced Administration Guide, SC23-5182.

A detailed description of snapshot commands is documented in the manual GPFS V3.1 Administration and Programming Reference, SA23-2221.


You might consider GPFS snapshots an ideal solution when it is used with the Oracle database for the following purposes:

� Cloning databases

� Performing fast backup and recovery of databases with short life spans (test or development systems)

� Reducing downtime when creating cold backups of databases

We present an overview of GPFS snapshots that are used with Oracle Database in Figure 6-1.

Figure 6-1 GPFS snapshots overview

In our test, the production database is located on the /oradata file system. Each horizontal block of the production file system (at the bottom of Figure 6-1) represents a single System Change Number (SCN), which is assigned to each transaction in the database. SCN numbers increase with time. They are used for consistency and recovery purposes.

At SCN+7, the administrator creates the first GPFS snapshot of the database file system. Because snapshots do not change with time (they are consistent and read-only), they can be used as the source for a database backup or database clone. Remember that in this scenario, the backup will not be consistent from a database point of view. After restoration from this backup, you must perform a recovery by using online redo logs. If redo logs are unavailable, recovery is impossible, which makes the backup unusable. After a copy of files within the /oradata/.snapshots/snap1 directory is complete, snapshot 1 can be deleted to save disk space.

At SCN+14, another snapshot is created. It might coexist with snapshot 1, but additional disk space is required. In this example, the user accidentally deletes data at time SCN+18. The database can be restored from snapshot 2, and the data reflects the state at SCN+14.

time

GPFS filesystem

/oradata/.snapshots/snap1

/oradata/.snapshots/snap2

/oradata

GPFS snapshot 1 at SCN + 7

GPFS snapshot 2 at SCN + 14

Oracle Database backup(SCN + 7)

crea

tesn

apsh

ot

crea

tesn

apsh

ot

restorefrom

snapshot

copy to remote location

delete snapshot

Oracle Database

user errorat SCN + 18

XSCN + 3 SCN + 7 SCN + 14 SCN + 14


Creating a cold database backup with minimum downtimeIn this scenario, we create a consistent backup of the Oracle database. Consistent database backup means that all transactions are written to datafiles, and the database recovery process does not need to be performed during the instance startup after the recovery. To create a consistent backup, the database must be shut down before taking the file system snapshot. Database archivelog mode is not required. To perform a cold database backup, execute the following steps:

1. Shut down the database cleanly (shutdown or shutdown immediate only, because shutdown abort does not provide datafile consistency). From this time, the database is unavailable.

2. Take a snapshot of the GPFS with the database files. If the database files are spread across many GPFSs, snapshots must be created for all of the file systems.

3. Start and open the database. After the startup, the database is available again.

4. Copy all snapshot files to tape, or a different disk. No matter how long it takes to copy or how heavy the database workload is, the snapshot files remain consistent.

5. After completing the copy process, you can remove snapshot.

When the restore process is necessary, backup files must be copied back to the original database location, not to the snapshots directory. After that operation, you can open the Oracle database, and no recovery process is performed while starting up the instance.

The database has to be stopped before creating a snapshot, which means all users have to be logged off, and all applications connected to the database must be shut down. Downtime caused by stopping and starting the database might be as long as several minutes, but taking a GPFS snapshot of the file system with Oracle datafiles takes only a fraction of a second. Overall downtime might be as long as 10 - 15 minutes (or longer, depending on how much time is required to start the remaining applications), but when considering this method for a large database, using GPFS snapshots can reduce system unavailability from hours to minutes.

Cloning databasesDatabase “cloning” means creating a copy or multiple copies of a database, usually for testing and development purposes.

To create a database clone, follow these steps:

1. On the source database, back up the controlfile to the trace directory with the following command:

SQL> alter database backup controlfile to trace

2. Shut down the source database:

SQL> shutdown (or shutdown immediate)

3. Create a snapshot of the GPFS file system with database files.

4. Start and open the source database.

5. Create a new init.ora file for the cloned Oracle instance. The original (source) database init.ora can be used, but make sure to modify the file paths so that they reflect the new values of the target database. If the database uses spfile, create a pfile first.

6. Copy the controlfile trace file created in step 1 from the user dump directory and change the database name and paths. Insert the set command before the database name.

7. Copy the GPFS snapshot files to a new location (cloned database files).

8. Remove the GPFS snapshot (if necessary).


9. Export the new ORACLE_SID and start the new database in NOMOUNT mode. Recreate the control file using the file that you prepared in step 6.

When cloning the database to a new host, the database name and paths do not have to be modified. The process is easier, because the control files do not have to be recreated, and init.ora parameters do not have to be changed. Only the source files from the snapshots directory have to be transferred to the target host.

6.1.3 Examples

We provide several examples in this section.

Cold database backupThe following example is a step-by-step procedure of how to create a cold (consistent) database backup. In this example, a clustered database is backed up.

The initial configuration of this example is:

� The database name is ALAMO, and it is a clustered two-node RAC database.� There is one GPFS file system for Oracle data files (/oradata).� All database files are located in /oradata/ALAMO directory.

The steps are:

1. The first step is to shut down all database instances by using the command shown in Example 6-4.

Example 6-4 Shutting down ALAMO database instances

$ srvctl stop database -d ALAMO

2. Next, create a GPFS snapshot of the database file system. If the database is spanned across multiple GPFS file systems, a snapshot has to be created on each of these file systems. We have created a snapshot by using the command shown in Example 6-5.

Example 6-5 Creating a snapshot of the GPFS file system

root@alamo1:/oradata> time mmcrsnapshot oradata snap1Writing dirty data to diskQuiescing all file system operationsWriting dirty data to disk againCreating snapshot.Resuming operations.


All database files (all /oradata file system) are frozen in the /oradata/.snapshots/snap1 directory, as presented in Example 6-6 on page 203. They will not change after the modification of the original files (/oradata).


Example 6-6 Contents of the GPFS snapshot directory

root@alamo1:/oradata> ls -l /oradata/.snapshots/snap1/ALAMOtotal 41531904-rw-r----- 1 oracle dba 11616256 Sep 27 10:21 control01.ctl-rw-r----- 1 oracle dba 11616256 Sep 27 10:21 control02.ctl-rw-r----- 1 oracle dba 11616256 Sep 27 10:21 control03.ctl-rw-r----- 1 oracle dba 1536 Sep 24 17:11 orapwALAMO-rw-r----- 1 oracle dba 52429312 Sep 27 09:00 redo01.log-rw-r----- 1 oracle dba 52429312 Sep 27 10:21 redo02.log-rw-r----- 1 oracle dba 52429312 Sep 25 15:15 redo03.log-rw-r----- 1 oracle dba 52429312 Sep 27 10:21 redo04.log-rw-r----- 1 oracle dba 3584 Sep 25 15:15 spfileALAMO.ora-rw-r----- 1 oracle dba 146808832 Sep 27 10:21 sysaux01.dbf-rw-r----- 1 oracle dba 398467072 Sep 27 10:21 system01.dbf-rw-r----- 1 oracle dba 51388416 Sep 26 22:01 temp01.dbf-rw-r----- 1 oracle dba 990912512 Sep 27 10:21 undotbs01.dbf-rw-r----- 1 oracle dba 209723392 Sep 27 10:21 undotbs02.dbf-rw-r----- 1 oracle dba 1073750016 Sep 27 10:21 userd01.dbf-rw-r----- 1 oracle dba 1073750016 Sep 27 10:21 userd02.dbf-rw-r----- 1 oracle dba 1073750016 Sep 27 10:21 userd03.dbf-rw-r----- 1 oracle dba 1073750016 Sep 27 10:21 userd04.dbf-rw-r----- 1 oracle dba 1073750016 Sep 27 10:21 userx01.dbf-rw-r----- 1 oracle dba 1073750016 Sep 27 10:21 userx02.dbf-rw-r----- 1 oracle dba 1073750016 Sep 27 10:21 userx03.dbf-rw-r----- 1 oracle dba 1073750016 Sep 27 10:21 userx04.dbf

3. After creating the snapshot, you can start the database. Because taking the GPFS snapshot was very quick and the files will be backed up later, the overall downtime of the database is much shorter than taking the database down until the backup process is finished. Start the database as shown in Example 6-7.

Example 6-7 Starting ALAMO database instances

$ srvctl start database -d ALAMO

Because Oracle opens all of the database files, while the database is running, the snapshot contents do not change. Snapshot files can be backed up without stopping the database and the consistency of the database is not threatened. When the database is running, all file changes are reflected in snapshot, as shown in Example 6-8. We can see that (for now) only 40 MB of data and 1.7 MB of metadata in the GPFS file system have changed so far, but the numbers will increase with time.

Example 6-8 mmlssnapshot command output

root@alamo1:/> mmlssnapshot oradata -dSnapshots in file system oradata: [data and metadata in KB]Directory SnapId Status Created Data Metadatasnap1 5 Valid Thu Sep 27 10:23:13 2007 40464 1760


Example 6-9 demonstrates how to back up snapshot files using the tar command.

Example 6-9 Backing up snapshot files with the tar and gzip commands

root@alamo1:/> cd /oradata/.snapshots/snap1/ALAMOroot@alamo1:/oradata/.snapshots/snap1/ALAMO> tar cfv - * | gzip > /backup/ALAMO.tar.gz

After all of the files are archived, the GPFS snapshot is no longer necessary and can be deleted to preserve disk space, as shown in Example 6-10.

Example 6-10 Deleting the GPFS snapshot

root@alamo1:/> mmdelsnapshot oradata snap1Deleting snapshot files...Delete snapshot snap1 complete, err = 0

root@alamo1:/> mmlssnapshot oradata -dSnapshots in file system oradata: [data and metadata in KB]Directory SnapId Status Created Data Metadata

Whenever restore is necessary, the .tar.gz file created in the previous step can be restored to the original database location.

Creating an inconsistent database backup with zero downtimeGPFS snapshots can be created regardless of whether the file system is being used (changed), and the snapshots are always consistent in the file system. However, when the database backup is inconsistent, a recovery process is needed after the restore and before users can log in. This example demonstrates how to create a GPFS snapshot while the database is running and how to restore and recover the database.

The initial system configuration for this example was:

� The database ALAMO is in a two-node RAC configuration.� All of the database files are on a single GPFS file system named /oradata.

The database is open all of the time, and the sample load is generated from a different machine on the network.

The steps are:

1. As a first step, a GPFS snapshot is created, as shown in Example 6-11.

Example 6-11 Creating a GPFS snapshot

root@alamo1:/> time mmcrsnapshot oradata snap1Writing dirty data to diskQuiescing all file system operationsWriting dirty data to disk againCreating snapshot.Resuming operations.



2. Snapshot data (the files in /oradata/.snapshots/snap1 directory) was backed up with tar and gzip, as in the previous example. After this step, snapshot was removed. Example 6-12 shows the detailed command.

Example 6-12 Deleting snapshot

root@alamo1:/> mmdelsnapshot oradata snap1Deleting snapshot files...Delete snapshot snap1 complete, err = 0

3. After the database corruption or the deletion of a datafile, the database is stopped on both cluster nodes to restore the files. Files are restored to the original database location using the commands shown in Example 6-13.

Example 6-13 Restoring database files to the original location

{alamo1:oracle}/home/oracle -> cd /oradata/ALAMO{alamo1:oracle}/oradata/ALAMO -> gunzip -c /home/oracle/ALAMO.tar.gz | tar xvf -

Because the snapshot was taken with the database active, the database is inconsistent, and a recovery process is necessary before the database is usable. In this case, all of the database files were a part of a single GPFS file system. It makes this scenario much easier, because consistency at the file system level is preserved.

4. In this case, the recovery process is handled automatically by the Oracle database. When starting the instance, after the MOUNT phase and before OPEN, a recovery process will be performed. Example 6-14 shows the alert.log entries for instance ALAMO1 that were logged after the performing MOUNT phase.

Example 6-14 Alert.log entries for the ALAMO1 instance

ALTER DATABASE MOUNTThu Sep 27 12:14:55 2007This instance was first to mountSetting recovery target incarnation to 1Thu Sep 27 12:14:59 2007Successful mount of redo thread 1, with mount id 275443199Thu Sep 27 12:14:59 2007Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)Completed: ALTER DATABASE MOUNTThu Sep 27 12:14:59 2007ALTER DATABASE OPENThis instance was first to openThu Sep 27 12:14:59 2007Beginning crash recovery of 2 threads parallel recovery started with 3 processesThu Sep 27 12:15:00 2007Started redo scanThu Sep 27 12:15:00 2007Completed redo scan 12723 redo blocks read, 4206 data blocks need recoveryThu Sep 27 12:15:01 2007Started redo application at Thread 1: logseq 802, block 3741 Thread 2: logseq 4, block 93783, scn 3122006Thu Sep 27 12:15:01 2007Recovery of Online Redo Log: Thread 1 Group 2 Seq 802 Reading mem 0 Mem# 0: /oradata/ALAMO/redo02.log


Thu Sep 27 12:15:01 2007Recovery of Online Redo Log: Thread 2 Group 4 Seq 4 Reading mem 0 Mem# 0: /oradata/ALAMO/redo04.logThu Sep 27 12:15:02 2007Completed redo applicationThu Sep 27 12:15:03 2007Completed crash recovery at Thread 1: logseq 802, block 16464, scn 3151958 Thread 2: logseq 4, block 93783, scn 3142007 4206 data blocks read, 4206 data blocks written, 12723 redo blocks read

In this scenario, the redo log files were a part of the same file system as the rest of the database files. If the database spans across multiple GPFS file systems, it is impossible to create several GPFS snapshots (one for each file system) at exactly the same time; thus, recovery is impossible.

In this case, to guarantee consistency across several snapshots, the I/O must be frozen at the database level using the alter system suspend command, and after creating GPFS snapshots, resumed with the alter system resume command as presented in Example 6-15. In the case of Oracle RAC, alter system suspend and alter system resume are cluster-aware and global; therefore, the operations on all cluster nodes will be suspended or resumed accordingly.

Example 6-15 Suspending I/O operations in Oracle Database

SQL> alter system suspend;System altered.

SQL> select database_status from v$instance;DATABASE_STATUS-----------------SUSPENDED

SQL> alter system resume;System altered.

SQL> select database_status from v$instance;DATABASE_STATUS-----------------ACTIVE

Moreover, to be absolutely sure that the database is recovered, switch the database to the archivelog mode, which puts all tablespaces (or the whole database) in backup mode. The Oracle Database Backup and Recovery Advanced User’s Guide, B14191-01, describes this information in detail.

6.2 GPFS storage pools and Oracle data partitioning

Both GPFS 3.1 and the Oracle database have functions that help maintain data in the Information Lifecycle Management process. Information Lifecycle Management (ILM) is a process for managing data throughout its life cycle, from creation until deletion, in a way that reduces the total cost of ownership by better managing the storage resources required for running and backing up or archiving the data. ILM is used to manage data placement (at


creation time), data migration (moving data in storage hierarchy) as its value changes, and data storage for disaster recovery or document retention.

GPFS Release 3.1 provides the following ILM tools:

� Storage pools� Filesets� Policies and rules

6.2.1 GPFS 3.1 storage pools

Storage pools are a logical organization of the underlying disks of a file system. They are a collection of disks with similar attributes that are managed as a group. Storage pools allow system administrators to manage file system storage based on performance, location, or reliability:

� The storage pool is an attribute of disk inside a file system. It is defined as a field in each disk descriptor when the file system is created or when disks are added to an existing file system.

� Each file system can have up to eight storage pools.

� Files are placed in a storage pool when they are created or moved to a storage pool, based on a policy.

� Storage pool names must be unique within a file system, but not across file systems.

� Storage pool names are case sensitive.

� If the disk descriptor does not include a storage pool name, the disk is assigned to the system storage pool.

� There are two types of storage pools: the system storage pool and user storage pools.

The system storage pool:

� The system storage pool is created when the file system is created.

� There is only one system storage pool per file system.

� It contains file system metadata and metadata associated with regular files.

� Disks used for system storage pools must be extremely reliable and highly available in order to keep the file system online.

� If there is no policy installed, only the system storage pool is used.

� The system storage pools must be monitored for available space.

� Each file system’s metadata is stored in the system storage pool.

The user storage pools:

� There can be more than one user storage pool in a file system.

� All file user data is stored in an assigned user storage pool.

� The file’s data can be moved to a different storage pool based on a policy.


Storage pool commands and optionsWe briefly describe the storage pool commands next.

mmlsfsThis command displays file system attributes. The syntax is:

mmlsfs [-P]

–P displays storage pools that are defined within the file system.

mmdfThis command queries the available file space on a GPFS file system. The syntax is:

mmdf [-P poolName]

-P poolName lists only the disks that belong to the requested storage pool.

mmlsattrThis command queries file attributes. The syntax is:

mmlsattr [-L] FileName

Where:

� –L displays additional file attributes.� FileName is the name of the file to be queried.

mmchattrThis command changes the replication attributes, storage pool assignment, and I/O caching policy for one or more GPFS files. The syntax is:

mmchattr [-P PoolName] [-I {yes|defer}] Filename

Where:

� –P PoolName changes the file’s assigned storage pool to the specified user pool name.

� -I {yes | defer} specifies if migration between pools is to be performed immediately(-I yes) or deferred until a later call to mmrestripefs or mmrestripefile (-I defer). By deferring the updates to more than one file, the data movement can be done in parallel. The default is yes

� Filename is the name of the file to be changed.

mmrestripefsThis command rebalances or restores the replication factor of all files in a file system. The syntax is:

mmrestripefs {-p} [-P PoolName]

Where:

� -p indicates mmrestripefs will repair the file placement within the storage pool.

� -P PoolName indicates mmrestripefs will repair only files that are assigned to the specified storage pool.


mmrestripefileThis command performs a repair operation over the specified list of files. The syntax is:

mmrestripefile {-m|-r|-p|-b} {[ -F FilenameFile] | Filename[ Filename...]}

Where:

� Filename is the name of one or more files to be restriped.

� -m migrates all critical data off any suspended disk in this file system.

� -r migrates all data off of the suspended disks and restores all replicated files in the file system to their designated degree of replication.

� -p repairs the file placement within the storage pool.

� -b rebalances all files across all disks that are not suspended.

6.2.2 GPFS 3.1 filesets

Filesets are subtrees of the GPFS file system namespace. They are used to organize data in the file system. Filesets provide a way of partitioning the file system to allow system administrators to perform operations at the fileset level rather than across the entire file system. For example, a system administrator can set quota limits for a particular fileset or specify a fileset in a policy rule for data placement or migration.

6.2.3 GPFS policies and rules

GPFS has a policy engine that allows you to create rules to determine initial file placement when a file is created and how it is managed through its life cycle until disposal. For example, rules can be used to place a file in a specific storage pool, migrate from one storage pool to another, or delete a file based on specific attributes, such as owner, file size, file name, the time it was last modified, and so forth:

� GPFS supports file placement policies and a file management policy.

� The file placement policy is used to store newly created files in a specific storage pool.

� The file management policy is used to manage files during their life cycle, migrate data to another storage pool, or delete files.

� If the GPFS file system does not have an installed policy established, all data is placed into the system storage pool.

� You can only have one installed placement policy in effect at a time.

� Any newly created files are placed according to the currently installed placement policy.

� The mmapplypolicy command is used to migrate, delete, or exclude data.

� A policy file is limited to a size of 1 MB.

� GPFS verifies the basic syntax of all rules in a policy file.

� If a rule in a policy refers to a storage pool that does not exist, GPFS returns an error and the policy will not be installed.

Policy commandsWe present the GPFS policy commands and options next.


mmchpolicyThis GPFS policy command establishes policy rules for a GPFS file system. The syntax is:

mmchpolicy Device PolicyFileName [-t DescriptiveName ][-I {yes | test} ]

Where:

� Device is the device name of the file system for which policy information is to be established or changed.

� PolicyFileName is the name of the file containing the policy rules.

� -t DescriptiveName is the optional descriptive name to be associated with the policy rules.

� -I {yes | test} specifies whether to activate the rules in the policy file PolicyFileName.yes means that policy rules are validated and immediately activated, which is the default.test means that policy rules are validated, but not installed.

mmlspolicyThis GPFS policy command displays policy information for the file system. The syntax is:

mmlspolicy Device [-L]

Where:

� Device is the device name of the file system for which policy information is to be displayed.

� -L shows the entire original policy file.

mmapplypolicyThis GPFS policy command deletes files or migrates file data between storage pools in accordance with policy rules. The syntax is:

mmapplypolicy {Device|Directory} [-P PolicyFile] [-I {yes|defer|test}] [-L n ] [-D yyyy-mm-dd[@hh:mm[:ss]]] [-s WorkDirectory]

Where:

� Device is the device name of the file system from which files are to be deleted or migrated.

� -P PolicyFile is the name of the policy file name.

� Directory is the fully qualified path name of a GPFS file system subtree from which files are to be deleted or migrated.

� -I {yes | defer | test} determines which actions the mmapplypolicy command performs on files.yes means that all applicable MIGRATE and DELETE policy rules are run, and the data movement between pools is done during the processing of the mmapplypolicy command. This is the default action.defer means that all applicable MIGRATE and DELETE policy rules are run, but actual data movement between pools is deferred until the next mmrestripefs or mmrestripefile command.test means that all policy rules are evaluated, but the mmapplypolicy command only displays the actions that are performed if -I defer or -I yes is specified.

� -L n controls the level of information that is displayed by the mmapplypolicy command.

� -D yyyy-mm-dd[@hh:mm[:ss]] specifies a date and optionally a Coordinated Universal Time (UTC) as year-month-day at hour:minute:second.

� -s WorkDirectory is the directory to be used for temporary storage during the mmapplypolicy command processing. The default directory is /tmp.


The file attributes in policy rules are:

� NAME specifies the name of the file.

� GROUP_ID specifies the numeric group ID.

� USER_ID specifies the numeric user ID.

� FILESET_NAME specifies the fileset where the file’s path name is located or is to be created.

� MODIFICATION_TIME specifies an SQL time stamp value for the date and time that the file was last modified.

� ACCESS_TIME specifies an SQL value for the time stamp date and time that the file was last accessed.

� PATH_NAME specifies the fully qualified path name.

� POOL_NAME specifies the current location of the file data.

� FILE_SIZE is the size or length of the file in bytes.

� KB_ALLOCATED specifies the number of kilobytes of disk space that are allocated for the file data.

� CURRENT_TIMESTAMP determines the current date and time on the GPFS server.

Policy rules syntaxFile placement, migration, deletion, and exclusion have the formats that are presented in Example 6-16.

Example 6-16 GPFS policy rules syntax

RULE [’rule_name’] SET POOL ’pool_name’

[ REPLICATE(data-replication) ] [ FOR FILESET( ’fileset_name1’, ’fileset_name2’, ... )][ WHERE SQL_expression ]

RULE [‘rule_name’] [WHEN time-boolean-expression]MIGRATE [ FROM POOL ’pool_name_from’ [THRESHOLD(high-%[,low-%])]] [ WEIGHT(weight_expression)] TO POOL ’pool_name’ [ LIMIT(occupancy-%) ] [ REPLICATE(data-replication) ] [ FOR FILESET( ‘fileset_name1’, ‘fileset_name2’, ... )] [ WHERE SQL_expression]

RULE [‘rule_name’] [ WHEN time-boolean-expression]DELETE [ FROM POOL ’pool_name_from’ [THRESHOLD(high-%[,low-%])]] [ WEIGHT(weight_expression)] [ FOR FILESET( ‘fileset_name1’, ‘fileset_name2’, ... )] [ WHERE SQL_expression]

RULE [‘rule_name’] [ WHEN time-boolean-expression]EXCLUDE [ FROM POOL ’pool_name_from’ [ FOR FILESET( ‘fileset_name1’, ‘fileset_name2’, ... )] [ WHERE SQL_expression]


Consider the storage pool feature as a way to separate certain types of Oracle files across physical disks or storage LUNs. The most typical example is to locate database online redo logs on separate volumes to achieve better performance and avoid concurrency between other database files. By using GPFS storage pools, redo log files can be placed on different physical volumes while they are still on the same file system, which greatly simplifies administration.

In this chapter, we discuss storage pools and policy rules in the context of data partitioning in an Oracle database test environment.

6.2.4 Oracle data partitioning

Data partitioning enables splitting large amounts of data into smaller chunks. Tables, as well as indexes, can be partitioned. Although all partitions of a specific table or index share the same logical attributes (columns and constraint definitions), the physical attributes, such as the table space where a table or index is located, can differ. Partitioning is transparent to applications and Data Manipulation Language (DML) statements. In addition to table or index partitioning, you can divide each partition into subpartitions for a finer level of granularity.

Partitioning is beneficial for:

� Achieving better performance by ignoring partitions that do not have the requested (in the WHERE clause) rows of tables, thus, fewer blocks are scanned

� Achieving better performance with parallel query where multiple processes scan many partitions at the same time

� Enabling ILM by locating partitions with less frequent access and separate tablespaces (and datafiles) on less expensive storage

� Improving manageability because each partition can be backed up or recovered independently when the partitions reside on different tablespaces

Oracle Database 10g offers the following partitioning methods:

� range� hash� list� composite range-hash� composite range-list

The Oracle Database Administrator’s Guide 10g Release 2, B14231-01, provides a detailed explanation of all partitioning methods.

In this chapter, we used a range partitioning, because it is the closest to ILM strategy and the easiest to demonstrate the described features.

Range partitioning is useful when data has logical ranges into which it can be distributed. Accounting data is an example of this type of data. Every accounting operation has an assigned date (a time stamp). By splitting this data into partitions, each period can reside on a different tablespace. In addition, with GPFS 3.1 storage pools, each of those tablespaces can have a different storage pool assigned while being located on the same file system.

Note: For details about GPFS policy-based data management implementations (storage pools, filesets, policies, and rules), refer to the GPFS V3.1 Advanced Administration Guide, SC23-5182.


Using these mechanisms, administrators can shift noncurrent (less frequently accessed) data to less expensive disk devices, which is one of the goals of ILM. The current and frequently read and modified data stays on high performance storage. With the time stamp, in the Oracle database, new table partitions are created for new (current) periods. Datafiles containing older data are shifted to slower (less expensive disk) storage pools. Figure 6-2 illustrates the way that data partitioning works.

Figure 6-2 Oracle Database partitioning and GPFS storage pools idea

With these mechanisms, it is possible to achieve high performance while reducing the cost of hardware.

The following sections describe GPFS storage pools and Oracle Database partitioning working together.

6.2.5 Storage pools and Oracle data partitioning example

Next, we show you an example of creating a GPFS with storage pools.

Creating a GPFS with storage poolsYou can create storage pools in two ways:

� At file system creation time, you can create storage pools by using the mmcrfs command.

� If the file system exists, you can add new disks to the file system by using the mmadddisk command and specifying a new storage pool.

Q3 2007 data

Q1 2007 data

Q1 2006 data

Q2 2006 data

Q3 2006 data

Q4 2006 data

Q2 2007 data

Q4 2007 data

Ora

cle

rang

e-pa

rtiti

oned

tabl

e

Entry-level

Midrange

Enterprise(highest performance)

SYST

EMPO

OL1

PO

OL2

GPF

S st

orag

epo

ols

Phys

ical

stor

age

laye

r


We have created one file system named oradata with a storage pool by using the mmcrfs command. Example 6-17 lists a cluster configuration.

Example 6-17 GPFS cluster configuration

root@alamo1:/tmp/mah> mmlscluster

GPFS cluster information======================== GPFS cluster name: alamo1_interconnect GPFS cluster id: 720967595342276183 GPFS UID domain: alamo1_interconnect Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp

GPFS cluster configuration servers:----------------------------------- Primary server: alamo1 Secondary server: alamo2

Node Daemon node name IP address Admin node name Designation ------------------------------------------------------------------------------ 1 alamo1_interconnect 10.1.100.53 alamo1 quorum-manager 2 alamo2_interconnect 10.1.100.54 alamo2 quorum-manager

Example 6-18 shows creating a working collective that includes GPFS nodes so that we can ssh/scp across the cluster nodes.

Example 6-18 Creating a working collective

root@alamo1:/tmp/mah> mmcommon getNodeList|awk '{print $3}' > gpfs.nodesroot@alamo1:/tmp/mah> cat gpfs.nodesalamo1_interconnectalamo2_interconnectroot@alamo1:/tmp/mah> export WCOLL=gpfs.nodesroot@alamo1:/tmp/mah> mmdsh datealamo1_interconnect: Fri Sep 21 08:43:19 CDT 2007alamo2_interconnect: Fri Sep 21 08:43:19 CDT 2007

We created a disk descriptor file named disk.desc (Example 6-19 on page 215). We used this file to create Network Shared Disks (NSDs).


Example 6-19 Disk descriptor file

root@alamo1:/tmp/mah> cat disks.deschdisk8:alamo1_interconnect:alamo2_interconnecthdisk10:alamo1_interconnect:alamo2_interconnecthdisk11:alamo1_interconnect:alamo2_interconnecthdisk13:alamo1_interconnect:alamo2_interconnecthdisk16:alamo1_interconnect:alamo2_interconnecthdisk17:alamo1_interconnect:alamo2_interconnect

Example 6-20 lists currently configured NSDs.

Example 6-20 Currently configured NSDs

root@alamo1:/tmp/mah> mmlsnsd

File system Disk name Primary node Backup node --------------------------------------------------------------------------- orabin gpfs3nsd (directly attached) orabin gpfs8nsd (directly attached) (free disk) gpfs1nsd (directly attached) (free disk) gpfs6nsd (directly attached) (free disk) gpfs9nsd (directly attached)

root@alamo1:/tmp/mah> mmcrnsd -F disks.ddesc mmcrnsd: Processing disk hdisk8mmcrnsd: Processing disk hdisk10mmcrnsd: Processing disk hdisk11mmcrnsd: Processing disk hdisk13mmcrnsd: Processing disk hdisk16mmcrnsd: Processing disk hdisk17mmcrnsd: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.

Example 6-21displays the newly created NSDs.

Example 6-21 Newly created NSDs

root@alamo1:/tmp/mah> mmlsnsd -X

Disk name NSD volume ID Device Devtype Node name Remarks----------------------------------------------------------------------------gpfs14nsd C0A8643546F3CD2C /dev/hdisk8 hdisk alamo1 primary nodegpfs14nsd C0A8643546F3CD2C /dev/hdisk8 hdisk alamo2 backup nodegpfs15nsd C0A8643546F3CD2F /dev/hdisk10 hdisk alamo1 primary nodegpfs15nsd C0A8643546F3CD2F /dev/hdisk10 hdisk alamo2 backup nodegpfs16nsd C0A8643546F3CD31 /dev/hdisk11 hdisk alamo1 primary nodegpfs16nsd C0A8643546F3CD31 /dev/hdisk11 hdisk alamo2 backup nodegpfs17nsd C0A8643546F3CD33 /dev/hdisk13 hdisk alamo1 primary nodegpfs17nsd C0A8643546F3CD33 /dev/hdisk13 hdisk alamo2 backup nodegpfs18nsd C0A8643546F3CD35 /dev/hdisk16 hdisk alamo1 primary nodegpfs18nsd C0A8643546F3CD35 /dev/hdisk16 hdisk alamo2 backup nodegpfs19nsd C0A8643546F3CD38 /dev/hdisk17 hdisk alamo1 primary nodegpfs19nsd C0A8643546F3CD38 /dev/hdisk17 hdisk alamo2 backup nodegpfs1nsd C0A8643546F1A7F4 /dev/hdisk7 hdisk alamo1 directly attachedgpfs3nsd C0A8643546F1A7F6 /dev/hdisk9 hdisk alamo1 directly attached


gpfs6nsd C0A8643546F1A7F9 /dev/hdisk12 hdisk alamo1 directly attachedgpfs8nsd C0A8643546F1A7FB /dev/hdisk14 hdisk alamo1 directly attachedgpfs9nsd C0A8643546F1A7FC /dev/hdisk15 hdisk alamo1 directly attached

In the next step, we edited the disk descriptor file, disks.desc, as shown in Example 6-22.

Example 6-22 Disk descriptor file contents

root@alamo1:/tmp/mah> cat disks.desc# hdisk8:alamo1_interconnect:alamo2_interconnectgpfs14nsd:::dataAndMetadata:1::# hdisk10:alamo1_interconnect:alamo2_interconnectgpfs15nsd:::dataAndMetadata:1::# hdisk11:alamo1_interconnect:alamo2_interconnectgpfs16nsd:::dataOnly:2::pool1# hdisk13:alamo1_interconnect:alamo2_interconnectgpfs17nsd:::dataOnly:2::pool1# hdisk16:alamo1_interconnect:alamo2_interconnectgpfs18nsd:::dataOnly:3::pool2# hdisk17:alamo1_interconnect:alamo2_interconnectgpfs19nsd:::dataOnly:3::pool2

In this example:

� gpfs14nsd and gpfs15nsd will be used for the system storage pool.� gpfs16nsd and gpfs17nsd are for user storage pool1.� gpfs18nsd and gpfs19nsd are for user storage pool2.

In the next step, we created the /oradata file system by using the mmcrfs command. Example 6-23 shows the log of the detailed output.

Example 6-23 Creating the GPFS file system with storage pools

root@alamo1:/tmp/mah> mmcrfs /oradata oradata -F disks.desc

GPFS: 6027-531 The following disks of oradata will be formatted on node alamo2: gpfs14nsd: size 10485760 KB gpfs15nsd: size 10485760 KB gpfs16nsd: size 10485760 KB gpfs17nsd: size 10485760 KB gpfs18nsd: size 10485760 KB gpfs19nsd: size 10485760 KBGPFS: 6027-540 Formatting file system ...GPFS: 6027-535 Disks up to size 24 GB can be added to storage pool 'system'.GPFS: 6027-535 Disks up to size 24 GB can be added to storage pool 'pool1'.GPFS: 6027-535 Disks up to size 24 GB can be added to storage pool 'pool2'.Creating Inode FileCreating Allocation MapsClearing Inode Allocation MapClearing Block Allocation MapGPFS: 6027-572 Completed creation of file system /dev/oradata.mmcrfs: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.

root@alamo1:/tmp/mah> mmlsdisk oradata -Lldisk driver sector failure holds holds availa disk storagename type size group metadata data status bility id pool remarks


--------- ------ ------ ------- -------- ----- ------ ------ ---- ------- -------gpfs14nsd nsd 512 1 yes yes ready up 1 system descgpfs15nsd nsd 512 1 yes yes ready up 2 system gpfs16nsd nsd 512 2 no yes ready up 3 pool1 descgpfs17nsd nsd 512 2 no yes ready up 4 pool1 gpfs18nsd nsd 512 3 no yes ready up 5 pool2 descgpfs19nsd nsd 512 3 no yes ready up 6 pool2 Number of quorum disks: 3Read quorum value: 2Write quorum value: 2

All of the disks are up, and the file system is ready. The file system was mounted by using the mmdsh mount /oradata command.

We installed an Oracle Clusterware and the Oracle database code files in another file system, /orabin, which is also shared between cluster nodes. We created a sample database to demonstrate partitioning, and we located the database files on the /oradata/ALAMO directory, where ALAMO is the database name.

Creating partitioned objects within Oracle databaseWe created a sample cluster database and named it ALAMO. We placed all of the database files on the same shared GPFS file system that is mounted in the /oradata directory.

To make use of storage pools, we had to create all of the tablespaces necessary for table and index partitions first. In our scenario, we assume that both the data and index segments are located on the same table space. Example 6-24 shows the table space creation process.

Example 6-24 Creating tablespaces

SQL> create tablespace data2006q1 logging datafile '/oradata/ALAMO/data2006q1.dbf' size 1024M reuse extent management local segment space management auto;

Tablespace created.


Tablespace created.


Tablespace created.


Tablespace created.


Tablespace created.



Tablespace created.


Tablespace created.


Tablespace created.

We created a sample table, TRANSACTIONS, and within the table, we defined several partitions by using the range partitioning key.

The example below (Example 6-25) creates the TRANSACTIONS table with eight partitions. Each table partition corresponds to one quarter of the year, and the corresponding data is stored in separate tablespaces. Partition TRANS2007Q1 will contain the transactions for only the first quarter of year 2007.

Then, we created the range-partitioned index global index on the TRANSACTIONS table. Each index’s partition is stored in a different tablespace in the same manner that it is stored for the table.

Example 6-25 Creating partitioned table and global index

SQL> create table TRANSACTIONS (trans_number NUMBER NOT NULL, trans_date DATE NOT NULL, trans_type NUMBER NOT NULL, trans_status NUMBER NOT NULL) 2 partition by range (trans_date) 3 ( 4 partition TRANS2006Q1 values less than (to_date('2006-04-01','YYYY-MM-DD')) tablespace data2006q1, 5 partition TRANS2006Q2 values less than (to_date('2006-07-01','YYYY-MM-DD')) tablespace data2006q2, 6 partition TRANS2006Q3 values less than (to_date('2006-10-01','YYYY-MM-DD')) tablespace data2006q3, 7 partition TRANS2006Q4 values less than (to_date('2007-01-01','YYYY-MM-DD')) tablespace data2006q4, 8 partition TRANS2007Q1 values less than (to_date('2007-04-01','YYYY-MM-DD')) tablespace data2007q1, 9 partition TRANS2007Q2 values less than (to_date('2007-07-01','YYYY-MM-DD')) tablespace data2007q2, 10 partition TRANS2007Q3 values less than (to_date('2007-10-01','YYYY-MM-DD')) tablespace data2007q3, 11 partition TRANS2007Q4 values less than (to_date('2008-01-01','YYYY-MM-DD')) tablespace data2007q4 12 ) enable row movement;

Table created.

SQL> desc transactions; Name Null? Type ----------------------------------------- -------- ----------------------------


TRANS_NUMBER NOT NULL NUMBER TRANS_DATE NOT NULL DATE TRANS_TYPE NOT NULL NUMBER TRANS_STATUS NOT NULL NUMBER

SQL> create index TRANSACTIONS_DATE on TRANSACTIONS (trans_date) 2 global partition by range (trans_date) 3 ( 4 partition TRANS2006Q1 values less than (to_date('2006-04-01','YYYY-MM-DD')) tablespace data2006q1, 5 partition TRANS2006Q2 values less than (to_date('2006-07-01','YYYY-MM-DD')) tablespace data2006q2, 6 partition TRANS2006Q3 values less than (to_date('2006-10-01','YYYY-MM-DD')) tablespace data2006q3, 7 partition TRANS2006Q4 values less than (to_date('2007-01-01','YYYY-MM-DD')) tablespace data2006q4, 8 partition TRANS2007Q1 values less than (to_date('2007-04-01','YYYY-MM-DD')) tablespace data2007q1, 9 partition TRANS2007Q2 values less than (to_date('2007-07-01','YYYY-MM-DD')) tablespace data2007q2, 10 partition TRANS2007Q3 values less than (to_date('2007-10-01','YYYY-MM-DD')) tablespace data2007q3, 11 partition TRANS2007Q4 values less than (MAXVALUE) tablespace data2007q4 12 );

Index created.

In this table, a sample row with transaction date=2007-09-19 is stored in partition TRANS2007Q3. We specified the ENABLE ROW MOVEMENT clause to allow the migration of a row to a new partition if an update to a key value is made that places the row in a different partition.

Assigning storage poolsAll the database files were stored on the system storage pool, because it is the default storage pool. The user storage pools (poo1 and pool2) are empty, as shown in Example 6-26.

Example 6-26 Storage pools’ free space

root@alamo1:/tmp/mah> mmdf oradatadisk disk size failure holds holds free KB free KBname in KB group metadata data in full blocks in fragments--------------- ------------- ------- -------- ----- --------------- ------------Disks in storage pool: systemgpfs14nsd 10485760 1 yes yes 5229312 ( 50%) 6160 ( 0%)gpfs15nsd 10485760 1 yes yes 5229568 ( 50%) 6424 ( 0%) ------------- --------------- ------------(pool total) 20971520 10458880 ( 50%) 12584 ( 0%)

Disks in storage pool: pool1gpfs16nsd 10485760 2 no yes 10483456 (100%) 248 ( 0%)gpfs17nsd 10485760 2 no yes 10483456 (100%) 248 ( 0%) ------------- --------------- ------------(pool total) 20971520 20966912 (100%) 496 ( 0%)

Disks in storage pool: pool2


gpfs18nsd 10485760 3 no yes 10483456 (100%) 248 ( 0%)gpfs19nsd 10485760 3 no yes 10483456 (100%) 248 ( 0%) ------------- --------------- ------------(pool total) 20971520 20966912 (100%) 496 ( 0%)

============= =============== ============(data) 62914560 52392704 ( 83%) 13576 ( 0%)(metadata) 20971520 10458880 ( 50%) 12584 ( 0%) ============= =============== ============(total) 62914560 52392704 ( 83%) 13576 ( 0%)


root@alamo1:/oradata/ALAMO> ls -ltr data*.dbf -rw-r----- 1 oracle dba 1073750016 Sep 21 09:17 data2007q4.dbf-rw-r----- 1 oracle dba 1073750016 Sep 21 09:29 data2007q1.dbf-rw-r----- 1 oracle dba 1073750016 Sep 21 09:29 data2006q4.dbf-rw-r----- 1 oracle dba 1073750016 Sep 21 09:29 data2006q3.dbf-rw-r----- 1 oracle dba 1073750016 Sep 21 09:29 data2006q2.dbf-rw-r----- 1 oracle dba 1073750016 Sep 21 09:29 data2006q1.dbf-rw-r----- 1 oracle dba 1073750016 Sep 21 09:29 data2007q3.dbf-rw-r----- 1 oracle dba 1073750016 Sep 21 09:29 data2007q2.dbf

We decided to migrate all 2006 data to the storage pool pool2 and Q1 and Q2 2007 data to pool1. We created and tested the GPFS policy file, as shown in Example 6-27.

Example 6-27 Creating GPFS policy

root@alamo1:/tmp/mah> cat gpfs.policy RULE 'migrate_oldest' MIGRATE FROM POOL 'system' TO POOL 'pool2' WHERE lower(NAME) LIKE '%.dbf' AND LOWER(SUBSTR(NAME,1,8))='data2006'RULE 'migrate_old_q1' MIGRATE FROM POOL 'system' TO POOL 'pool1' WHERE lower(NAME) LIKE '%.dbf' AND LOWER(SUBSTR(NAME,1,10))='data2007q1' RULE 'migrate_old_q2' MIGRATE FROM POOL 'system' TO POOL 'pool1' WHERE lower(NAME) LIKE '%.dbf' AND LOWER(SUBSTR(NAME,1,10))='data2007q2' RULE 'DEFAULT' SET POOL 'system'

root@alamo1:/oradata/ALAMO> mmapplypolicy /dev/oradata -P /tmp/mah/gpfs.policy -I test GPFS Current Data Pool Utilization in KB and %pool1 4608 20971520 0.021973%pool2 4608 20971520 0.021973%system 10512640 20971520 50.128174%Loaded policy rules from /tmp/mah/gpfs.policy.Evaluating MIGRATE/DELETE/EXCLUDE rules with CURRENT_TIMESTAMP = 2007-09-21@14:32:39 UTCparsed 1 Placement Rules, 3 Migrate/Delete/Exclude RulesDirectories scan: 23 files, 2 directories, 0 other objects, 0 'skipped' files and/or errors.Inodes scan: 23 files, 0 'skipped' files and/or errors.Summary of Rule Applicability and File Choices:


Rule# Hit_Cnt Chosen KB_Chosen KB_Ill Rule 0 4 4 4194336 0 RULE 'migrate_oldest' MIGRATE FROM POOL 'system' TO POOL 'pool2' WHERE(.) 1 1 1 1048584 0 RULE 'migrate_old_q1' MIGRATE FROM POOL 'system' TO POOL 'pool1' WHERE(.) 2 1 1 1048584 0 RULE 'migrate_old_q2' MIGRATE FROM POOL 'system' TO POOL 'pool1' WHERE(.)GPFS Policy Decisions and File Choice Totals: Chose to migrate 6291504KB: 6 of 6 candidates; Chose to delete 0KB: 0 of 0 candidates; 0KB of chosen data is illplaced or illreplicated;Predicted Data Pool Utilization in KB and %:pool1 2101776 20971520 10.022049%pool2 4198944 20971520 20.022125%system 4221136 20971520 20.127945%

The tested GPFS policy file was executed and the datafiles were moved according to defined rules. Example 6-28 shows the output of the mmapplypolicy command.

Example 6-28 Applying the policy

root@alamo1:/oradata/ALAMO> mmapplypolicy /dev/oradata -P /tmp/mah/gpfs.policy -I yesGPFS Current Data Pool Utilization in KB and %pool1 4608 20971520 0.021973%pool2 4608 20971520 0.021973%system 10512640 20971520 50.128174%Loaded policy rules from /tmp/mah/gpfs.policy.Evaluating MIGRATE/DELETE/EXCLUDE rules with CURRENT_TIMESTAMP = 2007-09-21@14:33:04 UTCparsed 1 Placement Rules, 3 Migrate/Delete/Exclude RulesDirectories scan: 23 files, 2 directories, 0 other objects, 0 'skipped' files and/or errors.Inodes scan: 23 files, 0 'skipped' files and/or errors.Summary of Rule Applicability and File Choices: Rule# Hit_Cnt Chosen KB_Chosen KB_Ill Rule 0 4 4 4194336 0 RULE 'migrate_oldest' MIGRATE FROM POOL 'system' TO POOL 'pool2' WHERE(.) 1 1 1 1048584 0 RULE 'migrate_old_q1' MIGRATE FROM POOL 'system' TO POOL 'pool1' WHERE(.) 2 1 1 1048584 0 RULE 'migrate_old_q2' MIGRATE FROM POOL 'system' TO POOL 'pool1' WHERE(.)GPFS Policy Decisions and File Choice Totals: Chose to migrate 6291504KB: 6 of 6 candidates; Chose to delete 0KB: 0 of 0 candidates; 0KB of chosen data is illplaced or illreplicated;Predicted Data Pool Utilization in KB and %:pool1 2101776 20971520 10.022049%pool2 4198944 20971520 20.022125%system 4221136 20971520 20.127945%A total of 6 files have been migrated and/or deleted; 0 'skipped' files and/or errors.


After the data files were migrated to their assigned storage pools, the file system space looked similar to Example 6-29. The migration process is transparent to users and applications, and in this example, it was fully transparent to the database.

Example 6-29 Storage pools’ free space after applying the policy

root@alamo1:/oradata/ALAMO> mmdf oradatadisk disk size failure holds holds free KB free KBname in KB group metadata data in full blocks in fragments--------------- ------------- ------- -------- ----- --------------- ------------Disks in storage pool: systemgpfs14nsd 10485760 1 yes yes 8342784 ( 80%) 6184 ( 0%)gpfs15nsd 10485760 1 yes yes 8343552 ( 80%) 6448 ( 0%) ------------- --------------- ------------(pool total) 20971520 16686336 ( 80%) 12632 ( 0%)

Disks in storage pool: pool1gpfs16nsd 10485760 2 no yes 9435136 ( 90%) 488 ( 0%)gpfs17nsd 10485760 2 no yes 9434368 ( 90%) 248 ( 0%) ------------- --------------- ------------(pool total) 20971520 18869504 ( 90%) 736 ( 0%)

Disks in storage pool: pool2gpfs18nsd 10485760 3 no yes 8386816 ( 80%) 496 ( 0%)gpfs19nsd 10485760 3 no yes 8385280 ( 80%) 480 ( 0%) ------------- --------------- ------------(pool total) 20971520 16772096 ( 80%) 976 ( 0%)

============= =============== ============(data) 62914560 52327936 ( 83%) 14344 ( 0%)(metadata) 20971520 16686336 ( 80%) 12632 ( 0%) ============= =============== ============(total) 62914560 52327936 ( 83%) 14344 ( 0%)


root@alamo1:/oradata/ALAMO> ls -ltr data*.dbf -rw-r----- 1 oracle dba 1073750016 Sep 21 09:17 data2007q4.dbf-rw-r----- 1 oracle dba 1073750016 Sep 21 09:29 data2007q1.dbf-rw-r----- 1 oracle dba 1073750016 Sep 21 09:29 data2006q4.dbf-rw-r----- 1 oracle dba 1073750016 Sep 21 09:29 data2006q3.dbf-rw-r----- 1 oracle dba 1073750016 Sep 21 09:29 data2006q2.dbf-rw-r----- 1 oracle dba 1073750016 Sep 21 09:29 data2006q1.dbf-rw-r----- 1 oracle dba 1073750016 Sep 21 09:29 data2007q3.dbf-rw-r----- 1 oracle dba 1073750016 Sep 21 09:29 data2007q2.dbf

As seen in Example 6-29, we used user storage pools pool1 and pool2.


Management of partitions and storage poolsWhenever it is necessary to create new partitions, use Example 6-30 to demonstrate the process. Example 6-30 shows how to create a tablespace for new object partitions and how to create additional partitions within the partitioned table and index.

Example 6-30 Creating space for new data


Tablespace created.

SQL> alter table transactions add partition trans2008q1 values less than (to_date('2008-04-01','YYYY-MM-DD')) tablespace data2008q1;

Table altered.

SQL> alter index transactions_date split partition trans2007q4 at (to_date('2008-01-01','YYYY-MM-DD')) into 2> ( 3> partition trans2007q4 tablespace data2007q4, 4> partition trans2008q1 tablespace data2008q1 5> );

Index altered.

After this operation, you can shift older partitions to other GPFS storage pools.

We list several examples of useful GPFS commands in Example 6-31.

Example 6-31 Useful GPFS commands that are related to storage pools

root@alamo1:/oradata/ALAMO> mmlsattr -L data2006q3.dbffile name: data2006q3.dbfmetadata replication: 1 max 1data replication: 1 max 1flags: storage pool name: pool2fileset name: rootsnapshot name:

root@alamo1:/oradata/ALAMO> mmlsattr -L data2007q1.dbffile name: data2007q1.dbfmetadata replication: 1 max 1data replication: 1 max 1flags: storage pool name: pool1fileset name: rootsnapshot name:

root@alamo1:/oradata/ALAMO> mmlslattr -L data2007q3.dbffile name: data2007q3.dbfmetadata replication: 1 max 1data replication: 1 max 1flags: storage pool name: system


fileset name: rootsnapshot name:

root@alamo1:/oradata/ALAMO> mmdf oradtata -P systemdisk disk size failure holds holds free KB free KBname in KB group metadata data in full blocks in fragments--------------- ------------- -------- -------- ----- --------------- ------------Disks in storage pool: systemgpfs14nsd 10485760 1 yes yes 8374528 ( 80%) 6184 ( 0%) gpfs15nsd 10485760 1 yes yes 8375808 ( 80%) 6448 ( 0%) ------------- --------------- ------------(pool total) 20971520 16750336 ( 80%) 12632 ( 0%)

root@alamo1:/oradata/ALAMO> mmdf oradata -P pool1disk disk size failure holds holds free KB free KBname in KB group metadata data in full blocks in fragments--------------- ------------- -------- -------- ----- --------------- ------------Disks in storage pool: pool1gpfs16nsd 10485760 2 no yes 9435136 ( 90%) 488 ( 0%) gpfs17nsd 10485760 2 no yes 9434368 ( 90%) 248 ( 0%) ------------- --------------- ------------(pool total) 20971520 18869504 ( 90%) 736 ( 0%)

root@alamo1:/oradata/ALAMO> mmrestripefs - oradata -b -P pool1Scanning pool1 storage poolGPFS: 6027-565 Scanning user file metadata ...GPFS: 6027-552 Scan completed successfully.

In Example 6-31 on page 223, the mmlsattr outputs show that each file is in its assigned storage pool.

Note: For specific command syntax, refer to the section titled “GPFS commands” in GPFS V3.1 Administration and Programming Reference, SA23-2221.


Part 4 Virtualization scenarios

Part 4 contains information and examples about how you can use the IBM System p virtualization features. Although virtual resources do not always provide the performance that is required for a high volume production database, they are still useful for testing various configurations and scenarios, which is known as proof of concept (POC).

The flexibility of the IBM System p virtualization features provides a cost-effective solution for quick deployment of test environments to validate solutions before they are put into production.

Part 4



Chapter 7. Highly available virtualized environments

IBM Virtualization on System p servers provides a flexible, cost-effective way to deploy complex IT environments with good resource separation and usage and excellent management capabilities. However, the virtual I/O (VIO) server partition is considered a single point of failure. Virtualization is considered by many administrators as less highly available than other solutions.

For example, in the case of loss or temporary unavailability of the VIO server partition, all resources associated with this partition are unavailable, which causes the outage of all associated client partitions. Because Oracle RAC is designed to be highly available, architects and administrators do not want to compromise this configuration by introducing the VIO server as a single point of failure. In this chapter, we demonstrate that an Oracle RAC solution can be deployed with good availability in a System p environment utilizing virtual resources.

With careful design and planning and relying on the IBM System p exceptional virtualization capabilities, you can achieve high redundancy in the virtualized System p environment. For example, by using two VIO server partitions per server and the proper configuration of virtual devices (Multi-Path I/O (MPIO), Logical Volume Manager (LVM) mirroring, EtherChannel, and so on), you can mask the failure of hardware resources and even an entire VIO server.

These configurations allow you to shut down one of the VIO servers for maintenance purposes, software upgrade, or reconfiguration while the other VIO server provides network and disk connectivity for client partitions. These configurations are redundant, so in the case

7

Disclaimer: The configuration examples using System p virtual resources (VIO Server, virtual Ethernet, virtual SCSI) are just for test purposes. Virtual SCSI disks are currently NOT supported in all configurations. As of the release date of this book, virtual SCSI disks are only supported using Oracle's ASM, but not with GPFS. For the current IBM/Oracle cross-certification status, check the following URL:

http://www.oracle.com/technology/support/metalink/index.html


of the failure of one VIO server, network, or Fibre Channel (FC) adapter, there is no loss of service.

In this chapter, we demonstrate how to set up a dual VIO server configuration that will provide high availability for Oracle RAC. Remember that for a highly available Oracle RAC environment, you need to build two similar hardware configurations (two systems each with two VIO servers). Although you can install and run Oracle RAC on two logical partitions (LPARs) of the same server, we do not recommend that you use the same server, because the server itself represents a single point of failure.

This chapter provides the necessary information to create a resilient architecture for a two-node Oracle RAC cluster. We discuss the following topics:

� Configuration of the network and shared Ethernet adapters

� Storage configuration with MPIO

� Considerations when using System p virtualization with production RAC databases

We do not describe the installation and configuration of Oracle RAC here, because the installation and configuration of Oracle RAC are the same as installing RAC on two physical servers, which we have already described in this book.


7.1 Virtual networking environment

To achieve high availability for an Oracle RAC instance with Virtual Ethernet Adapters and a Shared Ethernet Adapter (SEA), you need to configure two VIO servers so that in case one VIO server fails, the other VIO server will handle the network traffic.

There are two common ways to provide high availability for a virtualized network:

� SEA failover

� A link aggregation adapter with one primary adapter and one backup adapter, known as a Network Interface Backup (NIB)

SEA is implemented at the VIO server level. When dealing with several client partitions running within the same system, SEA is configured only one time for the entire System p server and provides high network connectivity to every partition that utilizes virtual Ethernet.

When using SEA, a failover to a second VIO server can take as long as 30 seconds in case of an adapter failure, which can cause problems when SEA is used for Oracle RAC interconnect. Timeouts might be long enough to cause a “split brain” resolution in Oracle Clusterware and evict nodes from the cluster. Of course, you can still use SEA for administrative and Virtual IP address (VIP) networks and dedicate physical Ethernet adapters for interconnect. Mixing physical and virtual adapters is allowed and fully supported in System p virtualization.

The second possibility, a NIB, is implemented on every client partition and does not rely on the SEA failover mechanism. A NIB is implemented the same way as an EtherChannel adapter with a single primary adapter and a backup adapter.

In Figure 7-1 on page 230, the client uses two virtual Ethernet adapters to create an EtherChannel adapter (en3) that consists of one primary adapter (en1) and one backup adapter (en2). If the primary adapter becomes unavailable due to VIO server unavailability or the corresponding physical Ethernet adapter failure on a VIO server partition, the NIB switches to the backup adapter and routes the traffic through the second VIO server. This configuration allows and supports a total of two virtual adapters: one active virtual adapter and one standby virtual adapter.

Chapter 7. Highly available virtualized environments 229

Figure 7-1 Network configuration with redundant VIO servers

In this scenario, because there is no hardware link failure for virtual Ethernet adapters to trigger a failover to the other adapter, it is mandatory to use the ping-to-address feature of EtherChannel to detect network failures. When configuring virtual adapters for NIB, the two internal networks must be separated in the hypervisor layer by assigning two different PVIDs.

Virtual Ethernet uses the system processors for all communication functions instead of offloading the load to processors on network adapter cards. As a result, there is an increase in the system processor load that is generated by the virtual Ethernet traffic. This might be a good reason to consider using physical adapters for Oracle RAC interconnect.

The connection to the client partition that is shown in Figure 7-1 is still available in case of:

� A switch failure� Failure of any Ethernet link� Failure of the physical Ethernet adapter on the VIO server� Virtual I/O server failure or maintenance

Note: There is a common behavior with both SEA failover and NIB: They do not check the reachability of the specified IP address through the backup-path as long as the primary path is active. They do not check, because the virtual Ethernet adapter is always connected, and there is no linkup event, such as there is with physical adapters. You do not know if you really have an operational backup until your primary path fails.

VIO1

AIX partition

en1

Virtual Ethernet Adapters (Hypervisor)

en2 VIO2

Physical Ethernet Adapters

LAN Switches

SharedEthernetAdapter

EtherChannelAdapter en3en1 – primaryen2 – backup

SharedEthernetAdapter

external network


For detailed information about setting up Virtualization on IBM System p Servers, refer to Advanced POWER Virtualization on IBM System p5: Introduction and Configuration, SG24-7940.

7.1.1 Configuring EtherChannel with NIB

In this example, we set up an EtherChannel with NIB on the AIX partition. The configuration is consistent with the configuration that we presented in Figure 7-1 on page 230. We used these network interfaces to configure NIB:

� ent1: The virtual interface for the primary adapter, which goes through VIOS1� ent2: The virtual interface for the backup adapter, which goes through VIOS2

Assuming both virtual interfaces are visible on the AIX partition, the easiest way to configure NIB is by using SMIT. Follow these steps to create an ent3 adapter, which will be the aggregated adapter with the ent1and ent2 adapters:

1. Use the following SMIT fastpath: smitty etherchannel.

2. Select Add An EtherChannel / Link Aggregation.

3. From the list, choose a primary adapter for NIB, in this case, ent1.

4. The window in Example 7-1 appears.

Example 7-1 Configuring the NIB SMIT window



[Entry Fields] EtherChannel / Link Aggregation Adapters ent1 + Enable Alternate Address no + Alternate Address [] + Enable Gigabit Ethernet Jumbo Frames no + Mode standard + Hash Mode default + Backup Adapter ent2 + Automatically Recover to Main Channel yes + Perform Lossless Failover After Ping Failure yes + Internet Address to Ping [10.10.10.1] Number of Retries [2] +# Retry Timeout (sec) [1] +#


5. Choose ent2 as the backup adapter.

6. Configure an IP address that we will ping to determine if the connection of the VIO server that is used for sharing the primary virtual Ethernet adapter (ent1) is down. Make sure that the address exists in the network and choose the same network that was created for the NIB. In this example, we assume that 10.10.10.1 is the address in the same network. In this case, it is an address of a network gateway.


7. Minimize the number of retries and the duration of retry timeouts, as indicated in Example 7-1 on page 231, which shortens the time that is necessary to fail over to back up the adapter and the VIO server.

The next step is to assign an IP address for the newly created NIB. In our test scenario, we configure it with address 10.10.10.2 (see Example 7-2).

Example 7-2 Network interface IP addresses on AIX partition

root@texas:/> ifconfig -aen0: flags=1e080863,480<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),CHAIN> inet 192.168.100.58 netmask 0xffffff00 broadcast 192.168.100.255 tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1en3: flags=1e080863,80<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),CHAIN> inet 10.10.10.2 netmask 0xffffff00 broadcast 10.10.10.255 tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1lo0: flags=e08084b<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT> inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255 inet6 ::1/0 tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1

After completing this part, a Network Interface Backup is configured and ready to use.

7.1.2 Testing NIB failover

For testing purposes, we initiate a ping command on the gateway server (10.10.10.1 address) to the AIX server (10.10.10.2, which is the NIB adapter). At the same time, we initiated an FTP transfer, so we can see if the link is heavily utilized and if the transfer survives the VIO server failure.

During the transfer, the first VIO server partition (which was handling the network traffic) is shut down with the Hardware Management Console (HMC).

We observed on the gateway machine:

� The FTP transfer stops for about two seconds and continues without any error messages.

� The ping command loses two packets, as shown in Example 7-3.


Example 7-3 Testing the ping command on the NIB interface

root@nim8810:/> ping 10.10.10.2PING 10.10.10.2: (10.10.10.2): 56 data bytes64 bytes from 10.10.10.2: icmp_seq=0 ttl=255 time=0 ms64 bytes from 10.10.10.2: icmp_seq=1 ttl=255 time=0 ms64 bytes from 10.10.10.2: icmp_seq=2 ttl=255 time=0 ms64 bytes from 10.10.10.2: icmp_seq=3 ttl=255 time=3 ms64 bytes from 10.10.10.2: icmp_seq=4 ttl=255 time=0 ms64 bytes from 10.10.10.2: icmp_seq=5 ttl=255 time=0 ms64 bytes from 10.10.10.2: icmp_seq=6 ttl=255 time=0 ms64 bytes from 10.10.10.2: icmp_seq=7 ttl=255 time=0 ms64 bytes from 10.10.10.2: icmp_seq=8 ttl=255 time=0 ms64 bytes from 10.10.10.2: icmp_seq=9 ttl=255 time=0 ms64 bytes from 10.10.10.2: icmp_seq=10 ttl=255 time=5 ms64 bytes from 10.10.10.2: icmp_seq=11 ttl=255 time=2 ms64 bytes from 10.10.10.2: icmp_seq=14 ttl=255 time=0 ms64 bytes from 10.10.10.2: icmp_seq=15 ttl=255 time=0 ms64 bytes from 10.10.10.2: icmp_seq=16 ttl=255 time=3 ms

8. After FTP transfer completes, we verify the integrity of the transferred file on the AIX partition, and we saw no loss of data.

At the same time, during the VIO server failure, AIX detected the failure of the primary interface that was used for the network interface backup and switched to the backup interface. Example 7-4 shows the output from the errpt command that indicates the failure of the network interface backup.

Example 7-4 Output from errpt command on AIX partition

root@texas:/> errptIDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION9F7B0FA6 1004151707 I H ent3 PING TO REMOTE HOST FAILED

root@texas:/> errpt -a---------------------------------------------------------------------------LABEL: ECH_PING_FAIL_PRMRYIDENTIFIER: 9F7B0FA6

Date/Time: Thu Oct 4 15:17:12 CDT 2007Sequence Number: 222Machine Id: 00C7CD9E4C00Node Id: texasClass: HType: INFOResource Name: ent3Resource Class: adapterResource Type: ibm_echLocation:

DescriptionPING TO REMOTE HOST FAILED

Probable CausesCABLESWITCHADAPTER


Failure CausesCABLES AND CONNECTIONS

Recommended Actions CHECK CABLE AND ITS CONNECTIONS IF ERROR PERSISTS, REPLACE ADAPTER CARD.

Detail DataFAILING ADAPTERPRIMARYSWITCHING TO ADAPTERent2Unable to reach remote host through primary adapter: switching over to backup adapter

7.2 Disk configuration

In this section, we propose architecture that uses SAN disks with virtualization, which allows good I/O performance and also high availability. You can use this infrastructure also for a production environment.

For deploying test or POC environments, you can deploy a configuration that is not based on a SAN. In fact, a carefully designed VIO server environment can simulate a virtual SAN. Refer to Chapter 8, “Deploying test environments using virtualized SAN” on page 241 for more details.

7.2.1 External storage LUNs for Oracle 10g RAC data files

Due to the characteristics of the VIO server implementation, you must configure concurrent access to same storage devices from two or more client partitions considering the following aspects.

No reserveTo access the same set of LUNs on external storage from two VIO servers at the same time, the disk reservation or Small Computer System Interface (SCSI) has to be disabled. Failing to disable the SCSI prevents one VIO server from accessing the disk. This configuration has to be enforced, and it does not depend on your choice to use GPFS, ASM, or direct hdisks for voting or OCR disks.

You must set this no reserve policy on both VIO servers and on all RAC nodes.

Make sure that the reserve policy is set to no_reserve, as shown in Example 7-5. Note that the default value is single_path.

Disclaimer: The configuration examples using System p virtual resources (VIO Server, virtual Ethernet, virtual SCSI) are just for test purposes. Virtual SCSI disks are currently NOT supported in all configurations. As of the release date of this book, virtual SCSI disks are only supported using Oracle's ASM, but not with GPFS. For the current IBM/Oracle cross-certification status, check the following URL: http://www.oracle.com/technology/support/metalink/index.htm


Example 7-5 The reserve_policy must be set to no_reserve

root@alamo1:/> lsattr -El hdisk7PCM PCM/friend/vscsi Path Control Module Falsealgorithm fail_over Algorithm Truehcheck_cmd test_unit_rdy Health Check Command Truehcheck_interval 0 Health Check Interval Truehcheck_mode nonactive Health Check Mode Truemax_transfer 0x40000 Maximum TRANSFER Size Truepvid none Physical volume identifier Falsequeue_depth 3 Queue DEPTH Truereserve_policy no_reserve Reserve Policy True

Whole disk VIO server mapping for data disksOne of the features of IBM Advanced Power Virtualization is that you can assign a client partition either a logical volume (part of a volume group) or an entire disk (logical unit number (LUN)). This is a great advantage if you do not have many internal disks attached to your VIO server, and you want to share existing disks to define multiple virtual disks for the AIX partitions’ rootvg usage. A logical volume at the VIO server level becomes a virtual hdisk at the AIX partition level.

For Oracle 10g RAC shared storage, the same LUNs on the SAN have to be accessed concurrently by two partitions through two VIO servers. Thus. you cannot use a part of a physical disk or logical volume (LV) as a shared disk, because the definition is local to the LVM of a VIO server. So, when defining the mappings of the Oracle 10g RAC data disks on the VIO server, map only an entire LUN to the client partition.

For disks to be used as rootvg by the client partitions, you can map logical volumes in the VIO server if you do not want to dedicate an entire disk (LUN) for one partition. See 7.2.2, “Internal disk for client LPAR rootvg” on page 237 for more details.

No single point of failure disk configuration (general configuration)The architecture proposed in Figure 7-2 uses using a virtualized disk environment with no single point of failure (SPOF). The SAN provides performance and disk reliability. The VIO server allows partitions that do not have a SAN connection to share the Fibre Channel connections.

Note: All LUNs on external storage (that will be used to hold Oracle 10g RAC data files) must be mapped in the VIO server as a whole disk to the client partitions. The reserve policy must be set to no_reserve.


Figure 7-2 External storage disk configuration with MPIO and redundant VIO servers

The same LUN is accessed by two VIO servers and mapped (the entire disk) to the same AIX node. The AIX MPIO layer is aware that the two paths point to the same disk.

If any of the SAN switches, cables, or physical HBAs fail, or if a VIO server is stopped, there is still another path available to reach the disk. MPIO manages the load balancing and failover at the AIX level, which is transparent to Oracle. Thus, it is not mandatory to have a dual HBA for each VIO server, because there is no improved security at the node level.

If a VIO server is stopped, it results in the failure of one link to the storage. MPIO fails over the surviving link without requiring administrative action. When the link is back, MPIO reintegrates the link automatically. The failure and failback are completely transparent to anyone. There is nothing to do. Only errors are stored in the AIX error report to keep track of the events.

No single point of failure disk configuration using DS4xxx storageIBM System Storage DS4000 Series (which is used for testing purposes in this publication) has special host connection requirements, such as the VIO server in our case. This storage must be attached using two Fibre Channel interfaces, each of which is in a different SAN zone. For a specified LUN, one Fibre Channel is used normally (the preferred path), while the second FC is used only in the case of a failure of the first FC. In this configuration, there is no load balancing for a single LUN. To achieve load balancing (static) for an array, configure LUNs in pairs, which uses different preferred paths. The protocol managing the DS4000 connections is RDAC. MPIO is not used at the VIO server level. However, the client partitions use MPIO (over the virtual SCSI adapters) to manage the two paths for the same disk. We show this architecture in Figure 7-3.

DS8000

VIOS1

AIX partition

Virtual SCSI adaptersHypervisor

LUN3

LUN2

LUN1

hdisk

hdiskVIOS2

hdisk

Physical HBA adapter

SAN switches

MPIO


DS8000

VIOS1

AIX partition


LUN3

LUN2

LUN1

hdisk

hdiskVIOS2

hdisk


SAN switches

MPIO



Figure 7-3 Disk configuration with redundant VIO servers for DS4000

7.2.2 Internal disk for client LPAR rootvg

The root volume group of the client partition can use a virtual SCSI disk mapped on an internal disk of the VIO server. Volume groups other than the volume groups that are used for Oracle 10g RAC data (which are accessed concurrently from two or more nodes) can also take advantage of this virtual architecture. Because these volume groups are accessed by only one server (no shared or concurrent access), you can also map a logical volume on the VIO server as an hdisk on the client partition.

One of the goals of virtualization is to utilize the resources in the best manner possible. For example, one 143 GB disk might be too large for a single rootvg; thus, you can efficiently use the space on this disk by allocating logical volumes on this disk at the VIO server level and mapping the LVs as virtual SCSI disks used for rootvg to each client LPAR. Another goal is to share resources between various client LPARs. A limited number of internal disks in the VIO server can be shared by a large number of client LPARs. With the LVM mirroring proposed next, you can further increase the level of high availability.

Mirrored rootvgTo remove all the SPOFs, we use two VIO servers, and the rootvg must be mirrored using LVM, which we display in Figure 7-4.

A failure of an internal disk in one VIO server, or a shutdown of the VIO server, results in the failure of one LV copy. However, rootvg is still alive. After the VIO server is rebooted (for example), you must resynchronize the mirror (syncvg command).

DS4000

VIOS1

AIX partition


LUN3

LUN2

LUN1

hdisk

hdiskVIOS2

hdisk

Physical HBA adapters

SAN switches

MPIO


rdacI/O driver

rdacI/O driver

DS4000

VIOS1

AIX partition


LUN3

LUN2

LUN1

hdisk

hdiskVIOS2

hdisk


SAN switches

MPIO


rdacI/O driver

rdacI/O driver

rdacI/O driver


Figure 7-4 Internal SCSI disk configuration for rootvg with LVM mirroring and redundant VIO servers

7.3 System configuration

In this final scenario (Figure 7-5 on page 239), we use two frames to avoid a single point of failure. There is one RAC node per frame. In this configuration, we define four VIO servers, two per frame, which is the minimum required configuration. Of course, these VIO servers can be shared with all the other partitions in the same frame (other LPARS on part of the RAC cluster). Actually, two VIO servers per frame is a good setup for all virtualization needs and also provides the required high availability.

VIOS1

AIX partition


rootvg

VIOS2

LVM mirroring

hdiskhdisk

hdiskhdisk

LV LV

VIOS1

AIX partition


rootvg

VIOS2

LVM mirroring

hdiskhdiskhdisk

hdiskhdiskhdisk

LV LV




Figure 7-5 Oracle RAC configuration with two virtualized System p servers

Virtual I/O Server

System pserver 1

AIX1RAC node 1

System pserver 2

AIX2

AIX3

Virtual I/O Server

AIX4RAC node 2

AIX7

AIX6

AIX5

Virtual I/O ServerVirtual I/O Server

Databasestorage

LAN



Chapter 8. Deploying test environments using virtualized SAN

The architecture that we propose in this chapter is based entirely on IBM System p virtualization features. Its target is to build a cost-effective infrastructure for deploying Oracle 10g RAC test configurations. Because all of the partitions that are used for this type of a configuration are located in the same frame, the configuration does not provide the highest possible availability and is unsuitable for disaster recovery.

The architecture that we propose is suitable for development and testing purposes. One of its goals is to use the least possible number of disk and network (physical) adapters, which we achieve by sharing the same physical disk for all of the rootvg and virtual Ethernet adapters, for example. Creating a virtual SAN resource further contributes to reducing the hardware and administrative costs that are required to deploy clusters. You can create lightweight partitions with almost no hardware. For example, if you have a virtual I/O (VIO) server and one free disk, you can create two partitions for running Oracle 10g RAC easily. Of course, this configuration cannot match the performance of a similar environment with dedicated physical resources.

8




8.1 Totally virtualized simple architecture

Figure 8-1 shows an example of this type of a test architecture. The configuration consists of:

� Three partitions (including the VIO server) are the minimum number required.

� Two separate networks are also required: one network for private interconnect and one network for public client access (which is also used for administrative purposes).

� The number of disks depends on the space required for your database. The shared disks that will be used for GPFS must be separate logical unit numbers (LUNs).

Logical volumes can be assigned as virtual SCSI disks, but they can only be used in one LPAR; they cannot be shared between two client LPARs.

Figure 8-1 Simple fully virtualized Oracle 10g RAC architecture for development or testing

This architecture is not highly available. There are several single points of failure. Usually, Oracle 10g RAC is used for providing high availability or even disaster recovery capabilities. However, the goal of this configuration is to deploy a real (but lightweight) RAC.


http://www.oracle.com/technology/support/metalink/index.htm

VIOS

RAC node 1

rootvg

hdisk

LV

LV

rootvg

RAC node 2

Oracle 10g RAC data on GPFS

filesystem

en3SEA

en0

physicaldisk

virtualdisk

en5virtual network

interface

en0physical network

interface

SEA: Shared Ethernet Adapter

en5

en6

en6

en5

en5


This configuration is also suitable for testing the application scalability in a RAC environment. You can expand the cluster easily by creating a new logical partition (LPAR) and adding this LPAR as a node to the existing RAC cluster. For details, refer to 3.3, “Adding a node to an existing RAC” on page 117.

8.1.1 Disk configuration

The disks that are used are only internal to the frame (no SAN connection to external storage), and they are all managed by the VIO server. The RAC nodes are using virtual Small Computer System Interface (SCSI) disks. There is no physical disk requirement in the RAC nodes.

You can create a GPFS with as little as one SCSI disk. This disk can hold the normal files, data, and be used as a tiebreaker disk at the same time.

8.1.2 Network configuration

The networks are also virtual. The Oracle 10g RAC private interconnect network is virtual between the two RAC nodes. It does not go through the VIO server, because there is no need to be connected to the outside world. The client (public) network is also virtual and uses a single physical interface as a Shared Ethernet Adapter (SEA) in the VIO server.

When creating the virtual adapters, you must make sure that the interface’s number (en#) is the same on all of the nodes for the same network (interconnect and public).The Oracle Clusterware configuration uses the interface number to define the networks, and not the associated IP label (name).

The outbound client traffic for two RAC nodes shares the same physical interface. If this design becomes a bottleneck, you can use a link aggregation interface (EtherChannel) with two or more physical interfaces.

Disclaimer: The configuration examples using System p virtual resources (VIO Server, virtual Ethernet, virtual SCSI) are just for test purposes. Virtual SCSI disks are currently NOT supported in all configurations. As of the release date of this book, virtual SCSI disks are only supported using Oracle's Automated Storage Management (ASM), but not with GPFS. For the current IBM/Oracle cross-certification status, check the following URL: http://www.oracle.com/technology/support/metalink/index.html

Note: This configuration provides no data protection, which is acceptable because this is a test environment.

Chapter 8. Deploying test environments using virtualized SAN 243

8.1.3 Creating virtual adapters (VIO server and clients)

In this step, we create the virtual adapters using the Hardware Management Console (HMC) interface. We create a virtual SCSI server adapter for each client partition as shown in Figure 8-2. The steps are:

Figure 8-2 Virtual SCSI server adapters in the VIO server


1. To create a shared Ethernet adapter in the VIO server, you need to define (also in the VIO server) at least one virtual Ethernet adapter. You must select Access external network when you create a virtual Ethernet adapter in the VIO server. The Ethernet adapter in Slot 2 that is shown in Figure 8-3 has the Bridged attribute set to Yes, which means that this Ethernet adapter can be used as a shared Ethernet adapter.

Figure 8-3 Virtual Ethernet adapter in the VIO server


2. Create a virtual SCSI client adapter for the client1 partition as shown in Figure 8-4. Remember that the slot numbers for the SCSI client adapter match the slot numbers of the server adapter (defined in the VIO server).

Figure 8-4 Virtual SCSI client adapter in the client1 partition


3. Create the virtual Ethernet adapters for the client1 LPAR as shown in Figure 8-5. We create two Ethernet adapters: one Ethernet adapter for the interconnect and one Ethernet adapter for the client network.

Figure 8-5 Virtual Ethernet adapter in the client1 partition


4. Create a virtual SCSI client adapter in the client2 partition as shown in Figure 8-6.

Figure 8-6 Virtual SCSI client adapter in the client2 partition


5. Create the virtual Ethernet adapters for the client2 LPAR as shown in Figure 8-7.

Figure 8-7 Virtual Ethernet adapter in the client2 partition

8.1.4 Configuring virtual resources in VIO server

Example 8-1 shows how to configure a VIO server with the adapters that were defined in the previous section.

Example 8-1 Configuring the VIO server

### In the VIO server, run the following commands

$ lspvNAME PVID VG STATUShdisk0 0022be2abc04a1ca rootvg activehdisk1 00cc5d5c6b8fd309 Nonehdisk2 0022be2a80b97feb Nonehdisk3 0022be2abc247c91 None


### Create a volume group for client partitions

$ mkvg -vg client_rootvg hdisk1client_rootvg

### Create logical volumes for client partitions

$ mklv -lv client1vg_lv client_rootvg 10G$ mklv -lv client2vg_lv client_rootvg 10G

###Verify that logical volumes are properly created

$ lsvg -lv client_rootvgclient_rootvg:LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINTclient1vg_lv jfs 160 160 1 closed/syncd N/Aclient2vg_lv jfs 160 160 1 closed/syncd N/A

### Verify virtual SCSI adapters

$ lsdev -virtualname status descriptionent3 Available Virtual I/O Ethernet Adapter (l-lan)vhost0 Available Virtual SCSI Server Adaptervhost1 Available Virtual SCSI Server Adaptervsa0 Available LPAR Virtual Serial Adapter

### Create virtual target devices for mapping backing devices to virtual SCSI adapters. These devices will be used for local client rootvgs.

$ mkvdev -vdev client1vg_lv -vadapter vhost0 -dev clinet1_vg_vtdclinet1_vg_vtd Available

$ mkvdev -vdev client2vg_lv -vadapter vhost1 -dev clinet2_vg_vtdclinet2_vg_vtd Available

### Create virtual target devices for mapping backing devices to virtual SCSI adapters. These devices will be used for shared disks.

$ mkvdev -vdev hdisk2 -vadapter vhost0vtscsi0 Available

$ mkvdev -vdev hdisk2 -vadapter vhost1"hdisk2" is already being used as a backing device. Specify the -f flagto force this device to be used anyway.

$ mkvdev -f -vdev hdisk2 -vadapter vhost1vtscsi1 Available

### Verify mapping information between backing devices and virtual SCSI adapters

$ lsmap -all |more

SVSA Physloc Client Partition ID


--------------- -------------------------------------------- ------------------vhost0 U9117.570.10C5D5C-V5-C3 0x00000000

VTD clinet1_vg_vtdLUN 0x8100000000000000Backing device client1vg_lvPhysloc

VTD vtscsi0LUN 0x8200000000000000Backing device hdisk2Physloc U7879.001.DQDKZNP-P1-T14-L4-L0

SVSA Physloc Client Partition ID--------------- -------------------------------------------- ------------------vhost1 U9117.570.10C5D5C-V5-C4 0x00000000

VTD clinet2_vg_vtdLUN 0x8100000000000000Backing device client2vg_lvPhysloc

VTD vtscsi1LUN 0x8200000000000000Backing device hdisk2Physloc U7879.001.DQDKZNP-P1-T14-L4-L0

### Choose a physical Ethernet adapter and a virtual Ethernet adapter to create a shared Ethernet adapter. Make sure that IP address should not be assigned to the physical adapter at the time of creating a shared Ethernet adapter.

$ lsdev -vpd |grep ent Model Implementation: Multiple Processor, PCI bus ent3 U9117.570.10C5D5C-V5-C2-T1 Virtual I/O Ethernet Adapter (l-lan) ent2 U7879.001.DQDKZNP-P1-C4-T1 10/100 Mbps Ethernet PCI Adapter II (1410ff01) ent0 U7879.001.DQDKZNV-P1-C5-T1 2-Port Gigabit Ethernet-SX PCI-X Adapter (14108802) ent1 U7879.001.DQDKZNV-P1-C5-T2 2-Port Gigabit Ethernet-SX PCI-X Adapter (14108802) Device Type: PowerPC-External-Interrupt-Presentation

### Create a shared Ethernet adapter.

$ mkvdev -sea ent0 -vadapter ent3 -default ent3 -defaultid 1ent4 Availableen4et4

### Check the virtual devices on both nodes defined in client partitions

root@client1:/> lspvhdisk0 00c7cd9e76c83540 rootvg activehdisk1 00c7cd9ece71f8d4 None


root@client1:/> lsdev -Cc adapterent0 Available Virtual I/O Ethernet Adapter (l-lan)ent1 Available Virtual I/O Ethernet Adapter (l-lan)vsa0 Available LPAR Virtual Serial Adaptervscsi0 Available Virtual SCSI Client Adapter

When the VIO server and the client partitions have been configured, proceed to install the operating system on the client1 and client2 LPARs (we have used the Network Installation Management (NIM) installation), configure networking, and configure GPFS.

Install Oracle Clusterware and database as described in Chapter 2, “Basic RAC configuration with GPFS” on page 19.


Part 5 Appendixes

This part contains helpful information about various aspects of installing, configuring, and maintaining your environment, but the information is either not directly related to the mainstream topic of this publication, such as the Secure Shell configuration and the GPFS 2.3 installation, or is described as an extension to other documents or manuals.



Appendix A. EtherChannel parameters on AIX

We use this procedure to configure an EtherChannel interface in our test environment:

1. Type smitty etherchannel at the command line.

2. Select Add an EtherChannel / Link Aggregation from the list and press Enter.

3. Select the Ethernet adapters that you want in your EtherChannel, and press Enter. If you are planning to use EtherChannel backup, do not select the adapter that you plan to use for the backup at this point.

The EtherChannel backup option is available in AIX 5.2 and later.

Enter the information in the fields according to the following guidelines:

� Parent Adapter: This field provides information about an EtherChannel's parent device (for example, when an EtherChannel belongs to a Shared Ethernet Adapter). This field displays a value of NONE if the EtherChannel is not contained within another adapter (the default). If the EtherChannel is contained within another adapter, this field displays the parent adapter’s name (for example, ent6). This field is informational only and cannot be modified. The parent adapter option is available in AIX 5.3 and later.

� EtherChannel / Link Aggregation Adapters: You see all of the primary adapters that you use in your EtherChannel. You selected these adapters in the previous step.

� Enable/ Alternate Address: This field is optional. Setting this to yes enables you to specify the MAC address that you want the EtherChannel to use. If you set this option to no, the EtherChannel uses the MAC address of the first adapter.

� Alternate Address: If you set Enable Alternate Address to yes, specify the MAC address that you want to use here. The address that you specify must start with 0x and be a 12-digit hexadecimal address (for example, 0x001122334455).

� Enable Gigabit Ethernet Jumbo Frames: This field is optional. In order to use this field, your switch must support jumbo frames, which will only work with a Standard Ethernet (en) interface, not an IEEE 802.3 (et) interface. Set this to yes to enable it.

A

Tip: The Available Network Adapters window displays all Ethernet adapters. If you select an Ethernet adapter that is already being used (has a defined interface), you get an error message. You first need to detach this interface if you want to use it.


� Mode: You can choose from the following modes:

– Standard: In this mode, the EtherChannel uses an algorithm to choose on which adapter it will send the packets out. The algorithm consists of taking a data value, dividing it by the number of adapters in the EtherChannel, and using the remainder (using the modulus operator) to identify the outgoing link.

– Round_robin: In this mode, the EtherChannel rotates through the adapters, giving each adapter one packet before repeating. The packets might be sent out in a slightly different order than they are given to the EtherChannel, but it makes the best use of its bandwidth. It is an invalid combination to select this mode with a Hash Mode other than default. If you choose the round-robin mode, leave the Hash Mode value as default.

– 8023ad: This options enables the use of the IEEE 802.3ad Link Aggregation Control Protocol (LACP) for automatic link aggregation. For more details about this feature, refer to the IEEE 802.3ad Link Aggregation configuration.

– Netif_backup: This option is available only in AIX 5.1 and AIX 4.3.3. In this mode, the EtherChannel activates only one adapter at a time. The intention is that the adapters are plugged into different Ethernet switches, each of which is capable of getting to any other machine on the subnet or network. When a problem is detected either with the direct connection (or optionally through the inability to ping a machine), the EtherChannel deactivates the current adapter and activates a backup adapter. This mode is the only mode that uses the Internet Address to Ping, Number of Retries, and Retry Timeout fields. Network Interface Backup Mode does not exist as an explicit mode in AIX 5.2 and later. To enable Network Interface Backup Mode in AIX 5.2 and later, you can configure multiple adapters in the primary EtherChannel and a backup adapter. For more information, see Configuring Network Interface Backup in AIX 5.2 and later.

� Hash mode: The Hash Mode value determines which data value is fed into this algorithm (see the Hash Mode attribute for an explanation of the different hash modes). For example, if the Hash Mode is standard, it uses the packet’s destination IP address. If this is 10.10.10.11 and there are two adapters in the EtherChannel, (1 / 2) = 0 with remainder one, the second adapter is used (the adapters are numbered starting from zero). The adapters are numbered in the order they are listed in the SMIT menu. This is the default operation mode.

Choose from the following hash modes; this field determines the data value that is used by the algorithm to determine the outgoing adapter:

– Default: The destination IP address of the packet is used to determine the outgoing adapter. For non-IP traffic (such as ARP), the last byte of the destination MAC address is used to perform the calculation. This mode guarantees packets are sent out over the EtherChannel in the order in which they are received, but it might not make full use of the bandwidth.

– src_port: The source UDP or TCP port value of the packet is used to determine the outgoing adapter. If the packet is not UDP or TCP traffic, the last byte of the destination IP address is used. If the packet is not IP traffic, the last byte of the destination MAC address is used.

– dst_port: The destination UDP or TCP port value of the packet is used to determine the outgoing adapter. If the packet is not UDP or TCP traffic, the last byte of the destination IP will be used. If the packet is not IP traffic, the last byte of the destination MAC address is used.


– src_dst_port: The source and destination UDP or TCP port values of the packet are used to determine the outgoing adapter (specifically, the source and destination ports are added and then divided by two before being fed into the algorithm). If the packet is not UDP or TCP traffic, the last byte of the destination IP is used. If the packet is not IP traffic, the last byte of the destination MAC address is used. This mode gives good packet distribution in most situations, both for clients and servers.

� Backup Adapter: This field is optional. Enter the adapter that you want to use as your EtherChannel backup.

� Internet Address to Ping: This field is optional and only takes effect if you are running Network Interface Backup mode, or if you have one or more adapters in the EtherChannel and a backup adapter. The EtherChannel pings the IP address or host name that you specify here. If the EtherChannel is unable to ping this address for the number of times specified in the Number of Retries field, and in the intervals specified in the Retry Timeout field, the EtherChannel switches adapters.

� Number of Retries: Enter the number of ping response failures that are allowed before the EtherChannel switches adapters. The default is three. This field is optional and valid only if you set an Internet Address to Ping.

� Retry Timeout: Enter the number of seconds between the times for the EtherChannel Ping and the Internet Address to Ping. The default is one second. This field is optional and valid only if you have set an Internet Address to Ping.

4. Press Enter after changing the desired fields to create the EtherChannel. Configure IP over the newly created EtherChannel device by typing smitty chinet at the command line.

5. Select your new EtherChannel interface from the list. Fill in all of the required fields and press Enter.

Note: It is an invalid combination to select a Hash Mode other than default with a Mode of round_robin.

Appendix A. EtherChannel parameters on AIX 257


Appendix B. Setting up trusted ssh in a cluster

This Appendix describes how to set ssh up to accept trusted connections for user root@nodes within the cluster. In our example, we reuse the server keys that already exist on host alamo1, which are generated during the ssh install. Host alamo1 is used for all the definitions, which are copied to the other nodes in the cluster. The setup relies on three types of files:

� SSH server keys

Located in /etc/ssh

� User keys

Located in the user’s home directory .ssh (~/.ssh)

� Authentication files

Store information of the trusted servers/users (also located in ~/.ssh):

• Known_hosts

Stores the public keys for known hosts (sshd - the server) together with IP addresses

• Authorized_keys

Stores the public keys for known (authorized) users

B

Note: The methods described allow for encrypted traffic between the cluster nodes without the need to enter a password or passphrase, which means that even though traffic is encrypted, any malicious hacker, who gains access to one of the cluster nodes, has access to all cluster nodes.

Important: As we change the server keys on some of the nodes in the cluster, be careful. Doing things in the wrong order might prevent you from logging on to the systems (especially if ssh is the only way to access the system over the network).


In this example, we reuse the existing keys on server alamo1. The steps for configuring root user access are:

1. Generate root user (client) key set.

2. Create the known_hosts file that contains the public key of the hosts (ssh servers) for user root connection.

3. Create the authorized_keys file storing the public key of the root@cluster_nodes.

4. Distribute keys: both server level (/etc/ssh) and user level (~/.ssh).

The detailed steps are:

1. Generate the root user (client) key set.

To generate the root user (client) keys, we use ssh-keygen, as shown in Example B-1.

Example: B-1 Generating and displaying client keys

root@alamo1:/> ssh-keygen -t rsa -q -f ~/.ssh/id_rsa -N ''root@alamo1:/> ls -l ~/.sshtotal 32-rw-r--r-- 1 root system 393 Sep 17 16:00 authorized_keys-rw------- 1 root system 1675 Sep 17 15:55 id_rsa-rw-r--r-- 1 root system 393 Sep 17 15:55 id_rsa.pub-rw-r--r-- 1 root system 489 Sep 17 15:59 known_hostsroot@alamo1:/> cat ~/.ssh/id_rsa.pubssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEArpdFZc+ynyxG8jS0wE0YBT9l6ztuvZ+p7GGoivP4PdBMHD+KdEoZ/w42A+kYREOV5/0TN4+8wfgYBCl8ZvcZg2zQ6/Pamh1nsGKaXPLEd4rllPxyPTsZi1rCmUcAx2+qN7Rktx4/WWqYZOdZQ54xHQqHk0uNnNkfENSRNSBhqsKqnCiob0ITjt8GvG15qyvg+1OxK6Q72P52DjmU8Tr1zPY9P8zVYFdWes5jLnPxW79UjPiMv3c5J2k0AVxLreQOLykXBsnaXH+PP0/76pK46mzYbz/weVLcsnWXRZTus2qRkWSlR9jJ8SfZVsRM0zG50pDgn0OVV6p+TqKGW6MdeQ== root@alamo1root@alamo1:/.ssh>

2. Create the known_hosts file.

For now, we use a different name for the file, cluster_known_hosts, not known_hosts (the name the SSH client uses), because we do not want to use this file at this time. The closer_known_hosts file contains the SSH server public keys, but the file has to be copied on all nodes and renamed to ~/.ssh/.known_hosts before it becomes active.

Example: B-2 Server public key in known_hosts file

root@alamo1:/> cat /etc/ssh/ssh_host_rsa_key.pubssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEArmEnYFkEbv6BF3rQZBPQzFX5HINv5Z3jXVFc6jDTJiW6iJ8s/zrPHcpBhY7+VDFreXhBCuCNFfhDfbQTu28e7BXaklcKeoG9lI2pdGThxmchTpAnW2vummmfkHwnG+TJHZJIL1lOP8F+XOB9jwFGV0oogSWqq662WLdhRqA6wbyu8DxWJyMQ0ZQZEeGUPaPVPHFd5fwm0eh7KKDCmyNXXCNf1bY9v0nyGchwNULEBOV/Y6BGnmYYTKcA4jr/VsRypHqKlbbGdpIZ295QIla4iWLeV9K0F/KbEp9lY/dY9L6Zz9qlqa+u/7jX1A+IexG7hxKTFsnj/tGHj+sUC9IaPw==root@alamo1:/> cd .sshroot@alamo1:/.ssh> pwd/.sshroot@alamo1:/.ssh> ssh-keyscan -t rsa alamo1 >> ~/.ssh/cluster_known_hosts# austin1 SSH-1.99-OpenSSH_4.3root@alamo1:/.ssh> cat ~/.ssh/cluster_known_hostsalamo1 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEArmEnYFkEbv6BF3rQZBPQzFX5HINv5Z3jXVFc6j DTJiW6iJ8s/zrPHcpBhY7+VDFreXhBCuCNFfhDfbQTu28e7BXaklcKeoG9lI2pdGThxmchTpAnW2vummmf kHwnG+TJHZJIL1lOP8F+XOB9jwFGV0oogSWqq662WLdhRqA6wbyu8DxWJyMQ0ZQZEeGUPaPVPHFd5fwm0e


h7KKDCmyNXXCNf1bY9v0nyGchwNULEBOV/Y6BGnmYYTKcA4jr/VsRypHqKlbbGdpIZ295QIla4iWLeV9K0F/KbEp9lY/dY9L6Zz9qlqa+u/7jX1A+IexG7hxKTFsnj/tGHj+sUC9IaPw==

a. Then, edit the file, adding the remaining nodes in the cluster. See Example B-3.

Example: B-3 The updated cluster_known_hosts

root@alamo1:/.ssh> ls -ltotal 24-rw-r--r-- 1 root system 489 Sep 17 15:59 cluster_known_hosts-rw------- 1 root system 1675 Sep 17 15:55 id_rsa-rw-r--r-- 1 root system 393 Sep 17 15:55 id_rsa.pubroot@alamo1:/.ssh> cat cluster_known_hostsalamo1,192.168.100.53,alamo1_interconnect,10.1.100.53,alamo2,192.168.100.54,alamo2_interconnect,10.1.100.54 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEArmEnYFkEbv6BF3rQZBPQ zFX5HINv5Z3jXVFc6jDTJiW6iJ8s/zrPHcpBhY7+VDFreXhBCuCNFfhDfbQTu28e7BXaklcKeoG9lI2pdGThxmchTpAnW2vummmfkHwnG+TJHZJIL1lOP8F+XOB9jwFGV0oogSWqq662WLdhRqA6wbyu8DxWJyMQ0ZQZEeGUPaPVPHFd5fwm0eh7KKDCmyNXXCNf1bY9v0nyGchwNULEBOV/Y6BGnmYYTKcA4jr/VsRypHqKlbbGdpIZ295QIla4iWLeV9K0F/KbEp9lY/dY9L6Zz9qlqa+u/7jX1A+IexG7hxKTFsnj/tGHj+sUC9IaPw==root@alamo1:/.ssh>

b. As seen in Example B-3, there are two nodes in the cluster, alamo1 and alamo2. Both nodes are accessible in the 192.168.100.x and 10.1.100.x subnets.

3. Create the authorized_keys.

Because we intend to use the same keys on all nodes, the authorized keys will be just the root user’s key. See Example B-4.

Example: B-4 Authorized users file

root@alamo1:/.ssh> cat id_rsa.pub >> ~/.ssh/authorized_keysroot@alamo1:/.ssh> cat ~/.ssh/authorized_keysssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEArpdFZc+ynyxG8jS0wE0YBT9l6ztuvZ+p7GGoivP4PdBMHD +KdEoZ/w42A+kYREOV5/0TN4+8wfgYBCl8ZvcZg2zQ6/Pamh1nsGKaXPLEd4rllPxyPTsZi1rCmUcAx2+qN7Rktx4/WWqYZOdZQ54xHQqHk0uNnNkfENSRNSBhqsKqnCiob0ITjt8GvG15qyvg+1OxK6Q72P52DjmU8Tr1zPY9P8zVYFdWes5jLnPxW79UjPiMv3c5J2k0AVxLreQOLykXBsnaXH+PP0/76pK46mzYbz/weVLcsnWXRZTus2qRkWSlR9jJ8SfZVsRM0zG50pDgn0OVV6p+TqKGW6MdeQ== root@alamo1root@alamo1:/.ssh>

4. Distribute files to the remaining nodes in the cluster.

Copy the entire /etc/ssh directory to other nodes in the cluster (in this case, alamo2):

a. Make sure that file read/write modes are maintained when copied. After copying, restart sshd. Example B-5 shows the distribution of the server keys.

Example: B-5 Distribute server keys (/etc/ssd directory)

root@alamo1:/.ssh> scp -pr /etc/ssh/* root@alamo2:/etc/sshThe authenticity of host 'alamo2 (192.168.100.54)' can't be established.RSA key fingerprint is 9c:8d:7c:51:ce:f2:4d:06:93:64:07:0b:94:43:2f:1a.Are you sure you want to continue connecting (yes/no)? yesWarning: Permanently added 'alamo2,192.168.100.54' (RSA) to the list of known hosts.root@alamo2's password:moduli 100% 130KB 129.7KB/s 00:00ssh_config 100% 1354 1.3KB/s 00:00ssh_host_dsa_key 100% 672 0.7KB/s 00:00ssh_host_dsa_key.pub 100% 590 0.6KB/s 00:00

Appendix B. Setting up trusted ssh in a cluster 261

ssh_host_key 100% 963 0.9KB/s 00:00ssh_host_key.pub 100% 627 0.6KB/s 00:00ssh_host_rsa_key 100% 1675 1.6KB/s 00:00ssh_host_rsa_key.pub 100% 382 0.4KB/s 00:00ssh_prng_cmds 100% 2341 2.3KB/s 00:00sshd.pid 100% 7 0.0KB/s 00:00sshd_config 100% 2865 2.8KB/s 00:00root@alamo1:/.ssh>

b. After changing the server keys, sshd must be restarted, as shown in Example B-6.

Example: B-6 Restart sshd on the nodes which has had the keys changed

root@alamo2:/> stopsrc -s sshd0513-044 The sshd Subsystem was requested to stop.root@alamo2:/> startsrc -s sshd0513-059 The sshd Subsystem has been started. Subsystem PID is 295030.root@alamo2:/>

c. Now, distribute the root user keys, as shown in Example B-7.

Example: B-7 Distribute root user files

root@alamo1:/.ssh> scp -pr /.ssh/* root@alamo2:/.ssh/.root@alamo2's password:authorized_keys 100% 393 0.4KB/s 00:00cluster_known_hosts 100% 489 0.5KB/s 00:00id_rsa 100% 1675 1.6KB/s 00:00id_rsa.pub 100% 393 0.4KB/s 00:00known_hosts 100% 403 0.4KB/s 00:00root@alamo1:/.ssh>

d. Now, everything is in place, so rename the cluster_known_hosts files to known_hosts on all nodes, as shown in Example B-8.

Example: B-8 Rename cluster_known_hosts

root@alamo2:/.ssh> ls -ltotal 40-rw-r--r-- 1 root system 393 Sep 17 16:00 authorized_keys-rw-r--r-- 1 root system 489 Sep 17 15:59 cluster_known_hosts-rw------- 1 root system 1675 Sep 17 15:55 id_rsa-rw-r--r-- 1 root system 393 Sep 17 15:55 id_rsa.pub-rw-r--r-- 1 root system 403 Sep 17 16:01 known_hostsroot@alamo2:/.ssh> mv cluster_known_hosts known_hostsroot@alamo2:/.ssh>Connection to alamo2 closed.root@alamo1:/.ssh> mv cluster_known_hosts known_hosts

e. Finally, verify that everything works. We use the command shown in Example B-9.

Example: B-9 Verification

root@alamo1:/.ssh> ssh alamo1 ssh alamo2 ssh alamo2 ssh alamo1 dateMon Sep 17 16:10:46 CDT 2007root@alamo1:/.ssh>


Appendix C. Creating a GPFS 2.3

This appendix shows the procedure that we use to create a GPFS 2.3 cluster and the file systems that we use in 3.7, “GPFS upgrade from 2.3 to 3.1” on page 138. We set up the GPFS cluster with three tiebreaker disks. The steps are:

1. First, we get the disk information regarding the logical unit numbers (LUNs) that we use for this test, which is shown in Example C-1.

Example: C-1 Listing the DS4000 Series LUNs

root@dallas1:/etc/gpfs_config> fget_config -vA

---dar0---

User array name = 'Austin_DS4800'dac0 ACTIVE dac1 ACTIVE

Disk DAC LUN Logical Drivehdisk2 dac1 0 DALLAS_ocr1hdisk3 dac0 1 DALLAS_ocr2hdisk4 dac0 2 DALLAS_vote1hdisk5 dac1 3 DALLAS_vote2hdisk6 dac1 4 DALLAS_vote3hdisk7 dac0 5 DALLAS_gptb1hdisk8 dac0 6 DALLAS_gptb2hdisk9 dac1 7 DALLAS_gptb3hdisk10 dac1 8 DALLAS_gpDataA1hdisk11 dac0 9 DALLAS_gpDataA2hdisk12 dac0 10 DALLAS_gpDataB1hdisk13 dac1 11 DALLAS_gpDataB2hdisk14 dac1 12 DALLAS_gpOraHomeAhdisk15 dac0 13 DALLAS_gpOraHomeB

2. We create the node and disk descriptor files in /etc/gpfs_config. They are listed in Example C-2 on page 264.

C


Example: C-2 GPFS descriptor files

root@dallas1:/etc/gpfs_config> for F in *> do> echo "File $F contains:"> cat $F> echo> doneFile gpfs_disks_orabin contains:hdisk14:::dataAndMetadata:1:nsd05hdisk15:::dataAndMetadata:2:nsd06

File gpfs_disks_oradata contains:hdisk10:::dataAndMetadata:1:nsd01hdisk11:::dataAndMetadata:1:nsd02hdisk12:::dataAndMetadata:2:nsd03hdisk13:::dataAndMetadata:2:nsd04

File gpfs_disks_tb contains:hdisk7:::::nsd_tb1hdisk8:::::nsd_tb2hdisk9:::::nsd_tb3

File gpfs_nodes contains:dallas1_interconnect:quorum-managerdallas2_interconnect:quorum-manager

3. We create the cluster using the gpfs_nodes descriptor file, as shown in Example C-3.

Example: C-3 Creating the cluster

root@dallas1:/> mmcrcluster -n /etc/gpfs_config/gpfs_nodes -p dallas1_interconnect -s dallas2_interconnect -r /usr/bin/ssh -R /usr/bin/scp-C dallas_cluster -AThu Sep 13 13:34:04 CDT 2007: 6027-1664 mmcrcluster: Processing node dallas1_interconnectThu Sep 13 13:34:06 CDT 2007: 6027-1664 mmcrcluster: Processing node dallas2_interconnectmmcrcluster: Command successfully completedmmcrcluster: 6027-1371 Propagating the changes to all affected nodes.This is an asynchronous process.

4. We create the Network Shared Disks (NSDs) using the descriptor file gpfs_disks_tb for tiebreaker disks, gpfs_disk_oradata for the /oradata file system disks, and gpfs_disk_orabin for the /orabin file system, as shown in Example C-4 on page 265.


Example: C-4 Creating the NSDs

root@dallas1:/> mmcrnsdmmcrnsd: 6027-1268 Missing argumentsUsage: mmcrnsd -F DescFile [-v {yes | no}]root@dallas1:/> mmcrnsd -F /etc/gpfs_config/gpfs_disks_tbmmcrnsd: Processing disk hdisk7mmcrnsd: Processing disk hdisk8mmcrnsd: Processing disk hdisk9mmcrnsd: 6027-1371 Propagating the changes to all affected nodes.This is an asynchronous process.root@dallas1:/> mmcrnsd -F /etc/gpfs_config/gpfs_disks_oradatammcrnsd: Processing disk hdisk10mmcrnsd: Processing disk hdisk11mmcrnsd: Processing disk hdisk12mmcrnsd: Processing disk hdisk13mmcrnsd: 6027-1371 Propagating the changes to all affected nodes.This is an asynchronous process.root@dallas1:/> mmcrnsd -F /etc/gpfs_config/gpfs_disks_orabinmmcrnsd: Processing disk hdisk14mmcrnsd: Processing disk hdisk15mmcrnsd: 6027-1371 Propagating the changes to all affected nodes.This is an asynchronous process.

5. Add the tiebreaker NSDs to the cluster configuration, as shown in Example C-5.

Example: C-5 Adding tiebreaker disks to the cluster

root@dallas1:/> mmchconfigmmchconfig: 6027-1268 Missing argumentsUsage: mmchconfig Attribute=value[,Attribute=value...] [-i | -I] [-n NodeFile | NodeName[,NodeName,...]]root@dallas1:/> mmchconfig tiebreakerDisks="nsd_tb1;nsd_tb2;nsd_tb3"Verifying GPFS is stopped on all nodes ...mmchconfig: Command successfully completedmmchconfig: 6027-1371 Propagating the changes to all affected nodes.This is an asynchronous process.

6. The cluster is now ready, and we start GPFS on all nodes (Example C-6).

Example: C-6 Cluster startup

root@dallas1:/> mmstartup -aThu Sep 13 13:48:14 CDT 2007: 6027-1642 mmstartup: Starting GPFS ...

7. The /orabin file system is then created, as shown in Example C-7 on page 266. The block size is set to 256k; the file system is created with a maximum of 80k inodes.

Appendix C. Creating a GPFS 2.3 265

Example: C-7 Creating the /orabin file system (GPFS)

root@dallas1:/> mmcrfs /orabin orabin -F /etc/gpfs_config/gpfs_disks_orabin -A yes -B 256k -m 2 -M 2 -r 2 -R 2 -n 4 -N 80k

GPFS: 6027-531 The following disks of orabin will be formatted on node dallas1: nsd05: size 10485760 KB nsd06: size 10485760 KBGPFS: 6027-540 Formatting file system ...Creating Inode FileCreating Allocation MapsClearing Inode Allocation MapClearing Block Allocation MapFlushing Allocation MapsGPFS: 6027-535 Disks up to size 27 GB can be added to this file system.GPFS: 6027-572 Completed creation of file system /dev/orabin.mmcrfs: 6027-1371 Propagating the changes to all affected nodes.This is an asynchronous process.

8. Create the /oradata file system as shown in Example C-8.

Example: C-8 Creating the /oradata file system (GPFS)

root@dallas1:/> mmcrfs /oradata oradata -F /etc/gpfs_config/gpfs_disks_oradata -A yes -B 256k -m 2 -M 2 -r 2 -R 2 -n 4

GPFS: 6027-531 The following disks of oradata will be formatted on node dallas2: nsd01: size 10485760 KB nsd02: size 10485760 KB nsd03: size 10485760 KB nsd04: size 10485760 KBGPFS: 6027-540 Formatting file system ...Creating Inode FileCreating Allocation MapsClearing Inode Allocation MapClearing Block Allocation MapFlushing Allocation MapsGPFS: 6027-535 Disks up to size 70 GB can be added to this file system.GPFS: 6027-572 Completed creation of file system /dev/oradata.mmcrfs: 6027-1371 Propagating the changes to all affected nodes.This is an asynchronous process.

9. Finally, the GPFS cluster is restarted and the file systems are checked, as shown in Example C-9 on page 267.


Example: C-9 Restarting GPFS and verifying file systems

root@dallas1:/> mmshutdown -aThu Sep 13 13:56:20 CDT 2007: 6027-1341 mmshutdown: Starting force unmount of GPFS file systemsThu Sep 13 13:56:25 CDT 2007: 6027-1344 mmshutdown: Shutting down GPFS daemonsdallas1_interconnect: Shutting down!dallas2_interconnect: Shutting down!dallas1_interconnect: 'shutdown' command about to kill process 344180dallas2_interconnect: 'shutdown' command about to kill process 249916Thu Sep 13 13:56:31 CDT 2007: 6027-1345 mmshutdown: Finishedroot@dallas1:/> mmstartup -aThu Sep 13 13:56:36 CDT 2007: 6027-1642 mmstartup: Starting GPFS ...root@dallas1:/> cd /oradataroot@dallas1:/oradata> df -k .Filesystem 1024-blocks Free %Used Iused %Iused Mounted on/dev/oradata 41943040 41851392 1% 14 1% /oradataroot@dallas1:/oradata> cd /orabinroot@dallas1:/orabin> df -k .Filesystem 1024-blocks Free %Used Iused %Iused Mounted on/dev/orabin 20971520 20925440 1% 10 1% /orabinroot@dallas1:/orabin> dfFilesystem 512-blocks Free %Used Iused %Iused Mounted on/dev/hd4 131072 98680 25% 1741 14% //dev/hd2 2490368 30360 99% 30823 84% /usr/dev/hd9var 131072 113008 14% 424 4% /var/dev/hd3 131072 130200 1% 22 1% /tmp/dev/hd1 131072 130360 1% 5 1% /home/proc - - - - - /proc/dev/hd10opt 262144 99576 63% 2417 18% /opt/dev/orabin 41943040 41850880 1% 10 1% /orabin/dev/oradata 83886080 83702784 1% 14 1% /oradataroot@dallas1:/> chown oracle.dba /oradata /orabinroot@dallas1:/> ls -ld /oradata /orabindrwxr-xr-x 2 oracle dba 8192 Sep 13 13:49 /orabindrwxr-xr-x 2 oracle dba 8192 Sep 13 13:50 /oradata

Appendix C. Creating a GPFS 2.3 267


Appendix D. Oracle 10g database installation

This section presents the Oracle 10g database code installation steps using the Oracle Universal Installer (OUI). We do not create a database, because creating a database is beyond the purpose of this document. A graphical user interface (GUI) is needed to run OUI (Oracle Universal Installer). From a GUI terminal, run the installer as the oracle user from the installation directory.

You are asked if rootpre.sh has been run, as shown in Figure 2-3 on page 51. Make sure that you execute Disk1/rootpre/rootpre.sh as root user on each node. The steps are:

D


1. At the OUI Welcome window, click Next as shown in Figure D-1.

Figure D-1 Welcome window for database installation


2. There are three types of installations: Enterprise Edition, Standard Edition, and Custom. Select Custom to avoid creating a database (see Figure D-2). Click Next.

Figure D-2 Select the installation type of oracle database

Appendix D. Oracle 10g database installation 271

3. Specify the ORACLE_HOME name and destination directory for the database installation as shown in Figure D-3. Click Next.

Figure D-3 Specifying oracle home directory


4. Specify the cluster nodes on which the Oracle database code will be installed (see Figure D-4). If CRS is correctly installed and is up and running, you can select both nodes. If you are unable to select both nodes, correct the CRS configuration and retry the installation process. Click Next.

Figure D-4 Specify cluster installation mode


5. Choose the database components that you are installing from the Available Product Components, as shown in Figure D-5. Click Next.

Figure D-5 Select database components


6. Figure D-6 shows the Product-Specific Prerequisite Checks. The installer verifies that your environment meets the requirements. Click Next.

Figure D-6 Product-specific prerequisite checks


7. Specify the Privileged Operating System Groups. Because the dba group is chosen by default, click Next as shown in Figure D-7.

Figure D-7 Select privileged operating system groups


8. Choose Install database Software only to avoid creating a database at this phase, and then click Next as shown in Figure D-8.

Figure D-8 Select to install database software only


9. On the Summary window shown in Figure D-9, check that the RAC database software and the other selected options are shown and click Install.

Figure D-9 Summary of Oracle database 10g installation selections


10.The Oracle Universal Installer proceeds with the installation on the first node, and then copies the code automatically onto the other selected nodes as shown in Figure D-10.

Figure D-10 Installing Oracle 10g database


11.Check on which nodes the root.sh will be run, as shown Figure D-11.

Figure D-11 Executing configuration scripts

12.As the root user, execute root.sh on each node as shown in Example 8-2.

Example 8-2 Running configuration scripts on the database

root@austin1:/orabin/ora102> root.shRunning Oracle10 root.sh script...

The following environment variables are set as: ORACLE_OWNER= oracle ORACLE_HOME= /orabin/ora102

Enter the full pathname of the local bin directory: [/usr/local/bin]:Creating /usr/local/bin directory... Copying dbhome to /usr/local/bin ... Copying oraenv to /usr/local/bin ... Copying coraenv to /usr/local/bin ...

Creating /etc/oratab file...Entries will be added to the /etc/oratab file as needed byDatabase Configuration Assistant when a database is createdFinished running generic part of root.sh script.Now product-specific root actions will be performed.

root@austin2:/orabin/ora102> root.shRunning Oracle10 root.sh script...


The following environment variables are set as: ORACLE_OWNER= oracle ORACLE_HOME= /orabin/ora102

Enter the full pathname of the local bin directory: [/usr/local/bin]:Creating /usr/local/bin directory... Copying dbhome to /usr/local/bin ... Copying oraenv to /usr/local/bin ... Copying coraenv to /usr/local/bin ...

Creating /etc/oratab file...Entries will be added to the /etc/oratab file as needed byDatabase Configuration Assistant when a database is createdFinished running generic part of root.sh script.Now product-specific root actions will be performed.

If there is no problem, you will see “End of Installation” as shown in Figure D-12.

Figure D-12 End of database installation

Before you proceed to database creation, we recommend that you apply the recommended patch set for the database code files.



Appendix E. How to cleanly remove CRS

This appendix describes how to remove the Oracle CRS software. We found this procedure useful especially if you plan to migrate your cluster from a HACMP-based RAC cluster to a CRS only-based RAC cluster. The steps are:

1. Run the following scripts on both nodes:

$ORA_CRS_HOME/install/rootdelete.sh$ORA_CRS_HOME/install/rootdeinstall.sh

2. If the previous scripts run successfully, stop the nodeapps on both nodes:

srvctl stop nodeapps -n <nodename> srvctl stop nodeapps -n austin1 srvctl stop nodeapps -n austin2

3. In order to prevent CRS from starting when a node starts, run the following commands:

rm /etc/init.cssd rm /etc/init.crs rm /etc/init.crsd rm /etc/init.evmd rm /etc/rc.d/rc2.d/K96init.crs rm /etc/rc.d/rc2.d/S96init.crs rm -Rf /etc/oracle/scls_scrrm -Rf /etc/oracle/oprocdrm /etc/oracle/ocr.locrm /etc/inittab.crs cp /etc/inittab.orig /etc/inittab

4. Stop the EVM, CRS, and CSS processes if they are still active:

ps -ef | grep crskill <crs pid> ps -ef | grep evm kill <evm pid> ps -ef | grep csskill <css pid>

5. Deinstall CRS Home with Oracle Universal Installer.

E


6. Remove the CRS install location if it is not deleted by Oracle Universal Installer:

rm -Rf <CRS Install Location>

7. Clean out all the OCR and Voting Files with dd:

# dd if=/dev/zero of=/dev/votedisk1 bs=8192 count=2560 # dd if=/dev/zero of=/dev/ocrdisk1 bs=8192 count=12800

For details, see also the following Oracle Metalink document: Removing a Node from a 10g RAC Cluster, Doc ID: Note:269320.1 at:


Note: You need an Oracle Metalink ID to access this document.



ronyms

ACL Access Control List

ACL access control list

AIO Asynchronous I/O

AIX Advanced Interactive Executive

ARP Address Resolution Protocol

ASM Automatic Storage Management

CDT class descriptor table

CLVM Concurrent Logical Volume Manager

CRS Cluster Ready Services

CRS configuration report server

CSS cascading style sheet channel subsystem

DAC Disk Array Controller

DAC digital-to-analog converter

DASD Direct Access Storage Device

DB database

DIO Direct I/O

DLPAR Dynamic LPAR

DLPAR dynamic LPAR

DMAPI Data Management API

DML data manipulation language

DNS Domain Name Services

DNS Domain Name System

DR Disaster Recovery

DR definite response

DR disaster recovery

DS directory services

EMC electromagnetic compatibility

ESS IBM TotalStorage Enterprise Storage Server®

EVM Event management

FAN Financial Analysis

FC Fibre Channel

FOR file-owning region

FTP File Transfer Protocol

GB gigabyte

GC graphics context

GCD Global Cache Directory

GCS Global Cache Service

GPFS General Parallel File System

Abbreviations and ac

© Copyright IBM Corp. 2008. All rights reserved.

GRD Global Resource Directory

GUI Graphical User Interface

HACMP High-Availability Cluster Multi-Processing

HBA Host Bus Adapter

HBA host bus adapter

HMC Hardware Management Console

I/O input/output

IB InfiniBand

IBM International Business Machines Corporation

ID identifier

IEEE Institute of Electrical and Electronics Engineers

ILM Information Lifecycle Management

IP Internet Protocol

ITSO International Technical Support Organization

JFS Journaled File System

JFS journaled file system

KB kilobyte

LACP Link Aggregation Control Protocol

LAN local area network

LP licensed program

LPAR Logical Partition

LPAR logical partition

LUN Logical Unit Number

LUN logical unit number

LV Logical Volume

LV logical volume

LVCB Logical Volume Control Block

LVM Logical Volume Manager

MAC Media Access Control

MAC Medium Access Control

MB megabyte

MPIO Multi-Path I/O

NFS Network File System

NIB Network Interface Backup

NIM Network Installation Management

NIM Network Installation Manager

NIS Network Information Service

285

NIS Network Information Services

NS network services

NSD Network Shared Disk

OCR Oracle Cluster Registry

OLAP online analytical processing

OLTP On-Line Transaction Processing

OLTP online transaction processing

OS operating system

OUI Oracle Universal Installer

OUI organizationally unique identifier

PCI Peripheral Component Interconnect

PCI-X Peripheral Component Interconnect-X

PCM Path Control Module

PID persistent identifier

PM project manager

POSIX Portable Operating System Interface

PP physical partition

PPRC Peer-to-Peer Remote Copy

PTF Program Temporary Fix

PTF program temporary fix

PV Physical Volume

PV persistent verification physical volume

PVID Physical Volume Identifier

RAC Real Application Clusters

RAID Redundant Array of Independent Disks

RAM Random Access Memory

RAM random access memory

RDAC Redundant Disk Array Controller

REM ring error monitor

RISC reduced instruction set computer

RMAN Recovery Manager

RSA Rivest (Ron), Shamir (Adi), and Adelman (Leonard)

RSA register save area

RSCT Reliable Scalable Clustering Technology

SAN Storage Area Network

SAN storage area network

SAN system area network

SCN System Change Number

SCSI Small Computer System Interface

SDD Subsystem Device Driver

SDDPCM Subsystem Device Driver Path Control Module

SEA Shared Ethernet Adapter

SET Secure Electronic Transaction

SGA System Global Array

SL standard label

SMIT System Management Interface Tool

SMP System Modification Program

SMT station management

SPOF Single Point of Failure

SQL Structured Query Language

SS start-stop

SVC SAN Volume Controller

SVC switched virtual circuit

SW special weight

TAF Transparent Application Failover

TB terabyte

TCP Transmission Control Protocol

TCP/IP Transmission Control Protocol/Internet Protocol

TRUE task-related user exit

UDP User Datagram Protocol

UID AIX windows User Interface Definition

UTC Universal Time Coordinated

VG Volume Group

VG volume group

VIO Virtual I/O

VIOS Virtual I/O Server

VIP Virtual IP

VIPA Virtual IP Address

VMS Voice Message Service

WAN wide area network

WWNN worldwide node name


Related publications

The publications listed in this section are considered particularly suitable for a more detailed discussion of the topics covered in this book.

IBM Redbooks publications

For information about ordering these publications, see “How to get IBM Redbooks publications” on page 287. Note that some of the documents referenced here might be available in softcopy only:

� Advanced POWER Virtualization on IBM System p5: Introduction and Configuration, SG24-7940

Other publications

These publications are also relevant as further information sources:

� GPFS V3.1 Concepts, Planning, and Installation Guide, GA76-0413

� GPFS V3.1 Advanced Administration Guide, SC23-5182

� GPFS V3.1 Administration and Programming Reference, SA23-2221

� GPFS V3.1 Problem Determination Guide, GA76-0415-00

Online resources

These Web sites are also relevant as further information sources:

� Oracle articles about changing OCR and CRS voting disks to raw devices:

http://www.oracle.com/technology/pub/articles/vallath-nodes.htmlhttp://www.oracle.com/technology/pub/articles/chan_sing2rac_install.html

� Oracle Knowledge Base (Metalink)


How to get IBM Redbooks publications

You can search for, view, or download IBM Redbooks publications, Redpapers, Technotes, draft publications and Additional materials, as well as order hardcopy IBM Redbooks publications, at this Web site:

ibm.com/redbooks

Note: You need an Oracle Metalink ID to access the Knowledge Base.




http://www.oracle.com/technology/pub/articles/vallath-nodes.html

http://www.oracle.com/technology/pub/articles/chan_sing2rac_install.html


Help from IBM

IBM Support and downloads

ibm.com/support

IBM Global Services

ibm.com/services


http://www.ibm.com/support/

http://www.ibm.com/support/

http://www.ibm.com/services/

http://www.ibm.com/services/

Index

Symbols. 104

Aaffected node 140–141, 143, 165, 215–216, 264–266AIX 5.2

Configuring Network Interface Backup 256explicit mode 256Network Interface Backup Mode 256

Allocation Map 165, 216, 266alter database 109, 113–114, 126, 128, 201

default temporary tablespace 127alter tablespace

example offline 136example online 136sysaux offline 136Temporary 127users offline 136users online 136

Alternate Address 231, 255asynchronous process 140–141, 143, 165, 215–216, 264–266aust in1_interconnect 160, 163–164, 174

Bbackup adapter 229, 231–232, 234, 256–257backup process 203

Cchannel ORA_DISK_1 125–126chown oracle.dba 172, 267client partition 227, 229–230, 235, 237, 244, 250

root volume group 237virtual SCSI server adapter 244volume group 250

Cloning database 201cluster configuration

data 140–141, 143, 162, 165, 215–216manager 174manager role 177

cluster configuration data 140–141, 143clusterType lc 142, 161–162CMUC00155I rmpprc 193commands

cfgmgr 192chown 169date 174

control file 109, 125, 133CRS installation 19, 130CRS voting

disc 144, 153, 166–167, 183, 187–188, 191, 194disk device 146

© Copyright IBM Corp. 2008. All rights reserved.

disks configuration 171CRS voting disc 144, 148crs_stat 114, 120, 122, 133css votedisk 144, 149, 172

Ddata block 111, 165, 205data file 104, 106, 202, 222

same process 136Data Metadata 198data partitioning 213database backup 201–202database file 105–106, 187, 200–203, 205datafile copy 126dd 123, 133–136dd command 133–137

different options 134deletion 211destination UDP 256–257device name 197, 210device subtype 134Device/File Name 145, 148disaster recovery

data storage 207Metro Mirror 156

disaster recovery (DR) 151, 153, 155–156, 166, 173, 186, 207, 241–242Disk descriptor file 215disk descriptor file 214, 216, 263disk failure 162, 178–183

file system 180Disk file 163dscli 187–188dscli > lspprc 190DSCLI commands

failbackpprc 194failoverpprc 194lsfbvol 188lspprc 191, 194lspprcpath 189mkpprc 189, 193pausepprc 190, 194rmpprc 193

DSCLi commandslssi 188

Eenhanced concurrent mode (ECM) 123entry field 168, 170, 231etc/oratab file 280EtherChannel 20, 154, 160, 227, 229–231, 255–257EtherChannel interface 255, 257exclusion 211

289

FFailover 106, 115–116, 186, 188, 192, 229–232, 236Fibre Channel

attachment 158, 178connection 154, 236

Fibre Channel (FC) 228, 235–236file system 104–107, 123, 129, 156, 158–159, 162, 196–200, 264–266

Completed creation 266device name 197replicated files 209system storage pool 207

fileset name 223Filesets 207filesets 209free disc 215, 241free inodes 220, 222

GGeneral Parallel File System (GPFS) 19, 103–107, 163, 165, 195–199Global Cache Directory (GCD) 111GPFS 3.1

release 207storage pool 212

GPFS cluster 104–107, 138, 153, 157–158, 160–161, 174, 181, 263, 266GPFS code 106, 140, 142GPFS commands

mmapplypolicy 209–210, 220mmbackup 196mmchattr 208mmchpolicy 210mmcommon 214mmcrnsd 215mmcrsnapshot 196mmdelsnapshot 197, 199, 204mmdf 208, 219mmdsh 214mmfsadm dump cfgmgr 175mmgetstate 192mmlsattr 208, 223mmlscluster 214mmlsdisk 191mmlsfs 208mmlsnsnapshot 197mmlspolicy 210mmrestorefs 197, 199mmrestripefile 209mmrestripefs 208mmsnapdir 197–198

GPFS filesystem 105, 199system layer 183

GPFS file system 105–106, 139, 162, 164, 166, 174–175, 178, 181, 183, 196, 199, 201–202

layer 183namespace 209subtree 210

GPFS policy 209, 212, 220–221command 221file 221

GPFS replication 162GPFS snapshot 151, 196–199, 201, 204

file system 197GPFS snapshots commands 196GPFS V3.1

Administration 199, 224Concept 104, 140

GPFSMIG 114–115graphical user interface (GUI) 269

HHash Mode 256–257host alamo1 259Hostname list 168

IIBM DSCLI (ID) 188, 190–191, 193IEEE 802.3ad

Link Aggregation configuration 256Information Lifecycle Management (ILM) 206, 212Inode File 165, 196, 216, 266interface number 243Internet Address 231, 256–257IP address 230–231, 251, 256–257IP label 160IP traffic 256–257

Jjumbo frame 255

LLink Aggregation 229, 231, 255–256

Adapter 255Control Protocol 256

link aggregationinterface 243

logical volumecontrol block 134first block 123, 134manager 183

logical volume (LV) 123, 134–135, 235, 237, 242, 250long wave (LW) 187LPAR 20, 242, 247, 249–250, 252LUN mapping 179, 183LUN size 144, 190LUNs 107, 144–145, 162, 183, 186–188, 191, 194, 234–236LVM mirroring 227, 237–238

MMetro Mirror 185migration 211mklv 123, 133–135mmcrcluster 264


mmcrnsd command 164mmlssnapshot oradata 198, 203ms 233

NNetwork Shared Disk (NSD) 158, 163–165, 178NFS version 168node austin1 165, 174, 183November 22 188NS directory 120, 130NSDs 105, 141, 164, 187, 191, 264–265

OOCR location 120Oracle 10g database 279

Check 280Oracle Cluster

Registry 120, 130, 145, 148Repository 187

Oracle Clusterware10.2.0.2 167configuration 117control 114full installation 106home directory 117new instance 108, 114split brain resolution 229

Oracle clusterware 103, 106–108, 114, 117, 154, 166, 217, 229, 243, 252Oracle code 104, 106, 108Oracle commands

srvctl 202Oracle CRS

installation 101software 283voting 153

Oracle data partitioning 206, 212Oracle Database

consistent backup 201I/O operations 206partitioned objects 217

oracle database 103, 109, 125, 131, 150, 199–201, 205–206, 271, 273, 278oracle home

directory 132, 272Oracle instance 116, 125, 128, 135, 137, 201Oracle Inventory 104–106, 121Oracle Metalink

doc Id 108Id 108, 284

Oracle RAC 20, 103, 106, 108, 110, 114, 160, 166, 186, 192, 194, 206, 227–228, 230

code 106configuration 227high availability 228–229instance 229solution 227

Oracle Universal Installer (OUI) 106–108, 117–118, 269, 279, 283–284

oracle user 105, 107, 132, 139, 167, 269oracle/ora102/lib >

cd 131ln 131

ORACLE_HOME name 272

Ppartitioning method

composite range-hash 212composite range-list 212hash 212list 212range 212

partitioning methods 212Path Control Module (PCM) 235Peer to Peer Remote Copy (PPRC) 185, 187–190placement 211Policies 207policies 209Policy commands 209policy file 209policy rule 209–211, 220

File Attributes 211pool total 219, 222, 224PPRC

Failover 194PPRC recovery 190PPRC relation 189–190, 192–194previous CRS 129–130primary adapter 229, 231, 234

link aggregation adapter 229

RRAC data

dictionary view 108, 122RAC environment 103–104, 127, 137, 243

log files 127RAC w 133random access memory (RAM) 20raw device 103–104, 106, 123, 166, 187

data files 125raw partition 103–104, 106, 123, 144Redbooks Web site 287

Contact us xiiRemote Mirror 189–191, 193–194remotedev Ibm 191, 193RMAN commands

copy datafile 126restore controlfile 125switch database 126

root user 139, 260, 262, 269rsh 28rules 207, 209

SSAN connection 235, 243select failover_type 116select file_name 112, 127

Index 291

select instance_number 116, 122separate tablespaces 212, 218Shared Ethernet Adapter (SEA) 229, 243, 251, 255single point 154, 157, 227, 235–236, 238, 242Single Point of Failure (SPOF) 154, 235size 10485760 KB 216, 266size 50M 113snapshot files 201spfile 108, 110, 122, 125SQL commands

alter database 205alter system 206create index 219create table 218create tablespace 217drop tablespace 127select database_status 206

sqlplus 109, 116ssh 28startup nomount 125Storage Area Network (SAN) 154, 185, 234–236storage B 187, 190–191, 193Storage pool

file placement 208GPFS file system 213, 216newly created files 209

storage pool 151, 165, 206–207, 210, 212storage pools 207, 213storage subsystem 107, 154–155, 158, 162, 185, 187–188, 191–192, 199

original mapping 192System Change Number (SCN) 200, 205System storage pool 207

Ttablespaces 206, 212, 217tar cfv 204TCP traffic 256–257test environment 20, 104, 121, 123, 135, 243, 255

raw devices 123third node 153, 158–160, 162–163

free internal disk 163inappropriate error messages 162internal SCSI disk 178NFS server 169

Thu Sep 20 174, 180Thu Sep 27 125, 203, 205Transparent Application Failover (TAF) 106, 183ttl 233

Uuser storage pools 222

VVIO server 227, 244–246, 249, 252

Configuring virtual resources 249Virtual I/O Server

partition 227

unavailability 229virtual I/O server

external storage 234Virtual I/O Server (VIOS) 229–232, 243virtual IO server

dual HBA 236Voting disc 144, 148, 166, 171–172, 183voting disc

NFS clients 169


(0.5” spine)0.475”<

->0.873”

250 <->

459 pages


Deploying Oracle 10g RAC on AIX V5 w

ith GPFS


ith GPFS



ith GPFS


ith GPFS

®

SG24-7541-00 ISBN 0738485837

INTERNATIONAL TECHNICALSUPPORTORGANIZATION

BUILDING TECHNICAL INFORMATION BASED ON PRACTICAL EXPERIENCE

IBM Redbooks are developed by the IBM International Technical Support Organization. Experts from IBM, Customers and Partners from around the world create timely technical information based on realistic scenarios. Specific recommendations are provided to help you implement IT solutions more effectively in your environment.

For more information:ibm.com/redbooks

®

Deploying Oracle 10gRAC on AIX V5 withGPFS

Understand clustering layers that help harden your configuration

Learn System p virtualization and advanced GPFS features

Deploy disaster recovery and test scenarios

This IBM Redbooks publication helps you architect, install, tailor, and configure Oracle 10g RAC on System p™ clusters running AIX®. We describe the architecture and how to design, plan, and implement a highly available infrastructure for Oracle database using IBM General Parallel File System (GPFS) V3.1.

This book gives a broad understanding of how Oracle 10g RAC can use and benefit the virtualization facilities embedded in System p architecture and how to efficiently use the tremendous computing power and availability characteristics of the POWER5 hardware and AIX 5L operating system.

This book also helps you design and create a solution to migrate your existing Oracle 9i RAC configurations to Oracle 10g RAC, simplifying configurations and making them easier to administer and more resilient to failures.

This book also describes how to quickly deploy Oracle 10g RAC test environments and how to use some of the built-in disaster recovery capabilities of IBM GPFS and storage subsystems to make your cluster resilient to various failures.

This book is intended for anyone planning to architect, install, tailor, and configure Oracle 10g RAC on System p™ clusters running AIX and GPFS.

Back cover




Documents

efrit.tistory.comefrit.tistory.com/attachment/[email protected] · iv Deploying Oracle 10g RAC on AIX V5 with GPFS 3.2.4 Installing Oracle RAC option using OUI