29
Oracle Corporation CRS & RAC Troubleshooting Krishnadev Telikicherla Cluster & Parallel Storage Technology Oracle Corporation

CRS-RAC Troubleshooting

Embed Size (px)

Citation preview

Page 1: CRS-RAC Troubleshooting

Oracle Corporation

CRS & RACTroubleshooting

Krishnadev TelikicherlaCluster & Parallel Storage Technology

Oracle Corporation

Page 2: CRS-RAC Troubleshooting

Oracle Corporation

Topics:

� Defining the Issue� Creating a Timeline� Hang or Slowdown� Performance Issues� Gathering Data� Testcases� Rediscovery� Engaging Oracle Support� Examples

Page 3: CRS-RAC Troubleshooting

Oracle Corporation

Defining the IssueLayers

� What layers are involved in the issue:

• Oracle Clusterware

• CRS daemon• CSS daemon• HangCheckTimer [Linux] / Oprocd (not

Linux)• EVM• OCR• Voting

• General RDBMS• Operating System• Hardware

Page 4: CRS-RAC Troubleshooting

Oracle Corporation

Defining the IssueCause vs. Effects

� Causes:– Resource issues– Oracle issues– OS issues

� Effects:– Hangs/Spins– Instances Crashes and Evictions– Node Reboots and Evictions– Oracle Errors (ORA-600, ORA-7445, ORA-29740)

Page 5: CRS-RAC Troubleshooting

Oracle Corporation

Defining the IssueDescription

� When describing the problem while creating the SR via Metalink it is important that you use phrases that will help identify known issues either in bugs or Metalink content.

� In the body of the SR try to be as detailed as possible about the environment.

� Nobody knows the system better than the you.� Talk to the sys-admin as well regarding OS/Network

related issues.

Page 6: CRS-RAC Troubleshooting

Oracle Corporation

Creating a Timeline

� A timeline helps identify the times to concentrate on when reviewing files

� A timeline can be built from reviewing the files themselves once they are provided to support but this will only slow resolution time down

� Timelines should include an ordering of cause and effects as well as include all participating nodes

� Include specific times, ie…– At 3:00am PST we noticed that node2 was hanging.

Page 7: CRS-RAC Troubleshooting

Oracle Corporation

Hang or slowdown

� Differentiate between a database hang and a database slowdown

� Identify the extent of a hang

Page 8: CRS-RAC Troubleshooting

Oracle Corporation

Is it a Hang or a Slowdown?

� Check:� System states to see if there is any change

over a short period of time� V$SESSION_WAIT where wait_time=0� Overall machine load, including cpu,

memory, swap, I/O

Page 9: CRS-RAC Troubleshooting

Oracle Corporation

Is it a Hang or a Slowdown?

� Single or multiprocess hang:– Usually characterized by a particular job

hanging or not completing– Essentially the same as in single instance

unless it’s internode parallel query.

� Instance hang: A single instance is unusable.

� Multi-instance or full database hang: Entire database is hung or not responding

Page 10: CRS-RAC Troubleshooting

Oracle Corporation

Performance

� Single process or statement� Instance� Multi-Instance

Page 11: CRS-RAC Troubleshooting

Oracle Corporation

Single Process or Single Statement� Find the wait event� 10046 level 12

- oradebug setorapid

- oradebug event 10046 trace name context forever, level 12- oradebug tracefile_name

� Explain plan� 10053 if plan problems are found� V$SESSTAT� Truss/trace/dbx/pstack if OS-related

problems are suspected

Page 12: CRS-RAC Troubleshooting

Oracle Corporation

Instance Slowdown

� Statspack / AWR� OS performance statistics - cpu, memory,

and I/O� Characteristics:

– Related to a particular job?– Certain time of day?– What’s changed?

Page 13: CRS-RAC Troubleshooting

Oracle Corporation

Multi-Instance Slowdowns

� AWR from each node can be of use:� AWR collects instance specific data� Examine and correlate the reports

Page 14: CRS-RAC Troubleshooting

Oracle Corporation

Multi-Instance Slowdowns

� In cases of extreme slowdowns:� systemstates on all nodes� V$SESSION_WAIT� Alert logs and any trace files� Process states, or stack traces if

determined and applicable

Page 15: CRS-RAC Troubleshooting

Oracle Corporation

Debugging Techniques

� v$session_wait� System states from all nodes� 10046 level 12 trace of the hung process� ORADEBUG� Lock layer and DLM tracing� Get any traces:

� DLM traces� Background processes, alert logs, and init.ora� User traces

Page 16: CRS-RAC Troubleshooting

Oracle Corporation

Debugging and Diagnostics

� Performance issues or hangs:� Identify the resource being requested.� Identify who holds the resource.

Page 17: CRS-RAC Troubleshooting

Oracle Corporation

ORADEBUG and Tools

� Hang analyze:– hanganalyze <level>

� Note: 301137.1 – OS Watcher User Guide� Note: 135714.1 - Script to Collect RAC

Diagnostic Information (diagcollection.pl)

Page 18: CRS-RAC Troubleshooting

Oracle Corporation

Gathering DataBest Practices

� Single most important step� There is never too much data, but including lots of

useless data can increase download time of the data as well as increase the amount of time to process the data.

� Always error on getting too much data, but be aware of the impact on the resolution time.

� Too little data increases resolution time more than too much data.

� Always include a readme.txt file that explains the contens of the provided files

Page 19: CRS-RAC Troubleshooting

Oracle Corporation

Gathering DataProcesses

� Always get stacks from processes that seem to be spinning, hanging or unresponsive:

– oradebug– gdb– pstack

� ps and top info can be very usefull when trying to determine if a processes exhibits issues such as memory leaks, spinning or hanging

Page 20: CRS-RAC Troubleshooting

Oracle Corporation

Gathering DataRAC

� For instance evictions please review Metalinknote 219361.1

� See Metalink note 203226.1 : RAC Survival Kit: Real Application Clusters Troubleshooting and Information

� See Metalink note 289690.1 : Data Gathering for Troubleshooting RAC and CRS issues

Page 21: CRS-RAC Troubleshooting

Oracle Corporation

Gathering DataTools

� RDA – system and Oracle configuration information� racdiag – modifiable sql script for gathering rac data. See

Metalink note 135714.1 “Script to Collect RAC Diagnostic Information

� OSW – OS Watcher gathers top, slabinfo, netstat and ps data over programmable intervals 301137.1 “OS Watcher User Guide”

Page 22: CRS-RAC Troubleshooting

Oracle Corporation

Gathering DataCRS 10.2.0.x (continued)

� CRS and other resource issues:– ORA_CRS_HOME

� log/<hostname>/cssd/oclsmon

� log/<hostname>/cssd

� log/<hostname>/client

� log/<hostname>/crsd

� log/<hostname>/evmd� log/<hostname>/racg

– ORACLE_HOME (rdbms)

� racg/dump

� ORACLE_BASE/<db_name>/hdump

Page 23: CRS-RAC Troubleshooting

Oracle Corporation

Gathering DataTools (continue)

� Starting with 10.2.0.1 $ORA_CRS_HOME/bin/diagcollection.pl collect all RAC relevant files (run as root)

oracle10@stnsp010>./diagcollection.plProduction Copyright 2004, 2005, Oracle. All rights reservedCluster Ready Services (CRS) diagnostic collection tooldiagcollection

--collect[--crs] For collecting crs diag information[--oh] For collecting oracle home diag information[--ob] For collecting oracle base diag information[--all] Default.For collecting all diag informationNOTE:1. You can also do the following

./diagcollection.pl --collect --crs --oh2. ORA_CRS_HOME,ORACLE_HOME and ORACLE_BASE env variables

need to be set.--clean cleans up the diagnosability

information gathered by this script--coreanalyze extracts information from core files

and stores it in a text file

Page 24: CRS-RAC Troubleshooting

Oracle Corporation

Testcases

� Not always feasible� If provided, can greatly influence resolution time� When providing a testcase:

– Include a readme file

– Try to strip the testcase down to the minimal elements that are needed to reproduce the problem

� If at all possible, always try to build a testcase� Testcases are your friends!

Page 25: CRS-RAC Troubleshooting

Oracle Corporation

Rediscovery

� Expensive for a support organization� Issue rediscovery is not always obvious� Use Metalink to identify possible causes for

issues as well as workarounds and patch availability

� Communicate new issues between DBAs

Page 26: CRS-RAC Troubleshooting

Oracle Corporation

Engaging Oracle Support

� Try to be responsive to all TARs when they are set to CUS status. Delays inherently causes two problems:1. The issue loses momentum2. A new engineer may have to take over the issue

Page 27: CRS-RAC Troubleshooting

Oracle Corporation

Examples

� 10.2.0.2 HP-UX/Itanium ServiceGuard, CRS, CFS and RAC

� Delays in reconfiguration

Page 28: CRS-RAC Troubleshooting

Oracle Corporation

Examples

� 10.2.0.2 Linux CRS, RAC and ASM� ORA-600[2103] and one instance crashed

Page 29: CRS-RAC Troubleshooting

Oracle Corporation

Questions?