ASM Troubleshooting Overview

Embed Size (px)

Citation preview

  • 8/13/2019 ASM Troubleshooting Overview

    1/27

    ASM Troubleshooting

    Yahoo!August 2009

    Kevin MooreTechnical Lead, Advanced Customer Services

  • 8/13/2019 ASM Troubleshooting Overview

    2/27

    Advanced Custom er Services

    ASM L & L Topics

    1. ASM init.ora Parameters

    2. ASM Alert log messages

    3. Yahoo! Alert Log & messages

    4. ASM data Gathering

    5. Troubleshooting Scenarios

    6. Instance Events

    7. Instance Tracing

    8. ASM Rebalancing operations

    9. ASM Extent management10. Performance Considerations

    11. ASM Templates

    12. Background Processes

    13. ASM Views

    14. ASMCMD Commands

    15. New 11g Commands

    16. ASM MySupport Documents

  • 8/13/2019 ASM Troubleshooting Overview

    3/27

    Advanced Custom er Services

    ASM initSID.ora

    ##############################################################################

    # Copyright (c) 1991, 2001, 2002 by Oracle Corporation

    ##############################################################################

    ###########################################

    # Cluster Database

    ###########################################

    cluster_database=true

    ###########################################

    # Miscellaneous

    ###########################################

    diagnostic_dest=/home/oracle

    instance_type=asm

    ###########################################

    # Pools

    ###########################################

    large_pool_size=12M

    asm_diskgroups='DATA'

    +ASM2.instance_number=2

    +ASM1.instance_number=1

  • 8/13/2019 ASM Troubleshooting Overview

    4/27

    Advanced Custom er Services

    ASM Alert Log

    Mon Aug 24 15:14:10 2009

    Starting ORACLE instance (normal)

    LICENSE_MAX_SESSION = 0

    LICENSE_SESSIONS_WARNING = 0

    Interface type 1 eth0 192.168.1.0 configured from OCR for use as a cluster interconnect

    Interface type 1 eth1 67.0.0.0 configured from OCR for use as a public interface

    Picked latch-free SCN scheme 2

    Using LOG_ARCHIVE_DEST_1 parameter default value as/home/oracle/oracle/product/11.1.0/rdbms/dbs/arch

    Autotune of undo retention is turned on.

    LICENSE_MAX_USERS = 0

    SYS auditing is disabled Starting up ORACLE RDBMS Version: 11.1.0.6.0.

    Using parameter settings in server-side pfile /home/oracle/oracle/product/11.1.0/rdbms/dbs/init+ASM1.ora

    System parameters with non-default values:

    large_pool_size = 12M

    instance_type = "asm"

    cluster_database = TRUE

    instance_number = 1

    asm_diskgroups = "DATA" diagnostic_dest = "/home/oracle"

    Cluster communication is configured to use the following interface(s) for this instance

    192.168.1.161

    cluster interconnect IPC version:Oracle UDP/IP (generic)

    IPC Vendor 1 proto 2

  • 8/13/2019 ASM Troubleshooting Overview

    5/27

    Advanced Custom er Services

    Yahoo! ASM Alert Log Sun May 4 00:19:05 2008

    kjbdomatt send to node 0 * One line for each node *

    kjbdomatt send to node 1

    kjbdomatt send to node 2 NOTE: F1X0 found on disk 0 fcn 0.0

    NOTE: cache opening disk 1 of grp 2: DISK116 label:DISK116 * One line for each node *

    NOTE: cache opening disk 2 of grp 2: DISK117 label:DISK117

    NOTE: attached to recovery domain 2

    Sun May 4 00:19:14 2008

    NOTE: recovering COD for group 1/0x8ccb7277 (DATA) * Metadata for tracking long running trx *

    SUCCESS: completed COD recovery for group 1/0x8ccb7277 (DATA)

    Sun May 4 00:19:14 2008 NOTE: opening chunk 14 at fcn 0.0 ABA

    NOTE: seq=2 blk=0

    Sun May 4 00:19:14 2008

    NOTE: cache mounting group 2/0x8CDB7278 (TEMP) succeeded

    SUCCESS: diskgroup TEMP was mounted

    Sun May 4 00:19:17 2008

    NOTE: recovering COD for group 2/0x8cdb7278 (TEMP)

    SUCCESS: completed COD recovery for group 2/0x8cdb7278 (TEMP) NOTE: enlarging ACD for group 1/0x8ccb7277 (DATA)

    Sun May 4 00:21:10 2008

    SUCCESS: ACD enlarged for group 1/0x8ccb7277 (DATA) * Metadata REDO *

    NOTE: enlarging ACD for group 2/0x8cdb7278 (TEMP)

    SUCCESS: ACD enlarged for group 2/0x8cdb7278 (TEMP)

  • 8/13/2019 ASM Troubleshooting Overview

    6/27

    Advanced Custom er Services

    ASM Data gathering

    Please gather all files from the ASM bdump and udump directoriescovering the specified time frame of the problem - be sure to include

    alert logs for ALL ASM instances. For Hang/Performance issues, please gather System state dumps from

    ASM instances

    Please use the script below for querying ASM views, and provide thespooled output (each instance).

    set newpage noneset feedback offset heading offset termout offcolumn grp format 99column disk format 99999column lxn format 999column flg format 999column chk format 999spool asmselect group_number as grp, name, state, type, total_mb, free_mb from v$asm_diskgroup;select rpad('>', 10, '>'), to_char(sysdate, 'MON DD HH24:MM:SS') from dual;select group_kfdat, number_kfdat, aunum_kfdat, v_kfdat, fnum_kfdat, i_kfdat, xnum_kfdat, raw_kfdat from x$kfdat;select rpad('>', 10, '>'), to_char(sysdate, 'MON DD HH24:MM:SS') from dual;select grp, disk, NUMBER_KFDPARTNER, PARITY_KFDPARTNER, ACTIVE_KFDPARTNER from x$kfdpartner;select rpad('>', 10, '>'), to_char(sysdate, 'MON DD HH24:MM:SS') from dual;select group_kffxp as grp, number_kffxp as num, incarn_kffxp as incarn, PXN_KFFXP, XNUM_KFFXP, LXN_KFFXP as lxn, DISK_KFFXP asdisk, AU_KFFXP, FLAGS_KFFXP as flg, CHK_KFFXP as chk from x$kffxp;select rpad('>', 10, '>'), to_char(sysdate, 'MON DD HH24:MM:SS') from dual;set linesize 1500select GROUP_NUMBER, DISK_NUMBER, INCARNATION, MOUNT_STATUS, HEADER_STATUS, MODE_STATUS, STATE, LIBRARY,TOTAL_MB, FREE_MB, NAME, FAILGROUP, LABEL, PATH, CREATE_DATE, MOUNT_DATE, READS, WRITES, READ_ERRS,WRITE_ERRS, READ_TIME, WRITE_TIME, BYTES_READ, BYTES_WRITTEN from v$asm_disk;spool offexit

  • 8/13/2019 ASM Troubleshooting Overview

    7/27

    Advanced Custom er Services

    ASM Troubleshooting Scenarios

    ASM space issues

    1. ASM level errors ORA-15041

    ORA-15047

    2. RDBMS level errors when storage is on ASM

    3. Inconsistencies between what is perceived as the available space

    4.Inconsistencies between V$ASM_DISKGROUP and X$ views

    Note #351117.1 - Information to gather when diagnosing ASM space issuescontainsscripts for collecting specific ASM information

  • 8/13/2019 ASM Troubleshooting Overview

    8/27

    Advanced Custom er Services

    ASM Troubleshooting Scenarios

    ASM Disk Missing

    1. Use OS utilities to determine which disk cannot be found

    TRUSSing or STRACEing the RBAL process while selecting * from v$asm_disk can often show errors in the path of thecommand

    SESSION #1

    strace -f -o /tmp/rbal.trc -p

    truss -ef -o /tmp/rbal.out -p

    SESSION #2

    select * from v$asm_disk

    SESSION #3

    tailf /tmp/rbal.trc

    Examine the rbal.out for errors:

    1147090: 1871929: chdir("dev/") = 01147090: 1871929: statx("rhdisk8, ", 0x0FFFFFFFFFFFAA80, 176, 010) Err#2 ENOENT

    This says that rhdisk8cannot be found

    2. ORA-15063: ASM discovered an insufficient number of disks for diskgroup s%ORA-15040: diskgroup is incompleteORA-15042: ASM disk "%" is missing

    Note #452770.1- ASM disk not found/visible/discovered issues

  • 8/13/2019 ASM Troubleshooting Overview

    9/27

    Advanced Custom er Services

    ASM Troubleshooting Scenarios

    ASM is Unable to Detect ASMLIB Disks/Devices

    1. First of all, please scan the disks (on all the nodes if RAC):

    dbaasm.us.oracle.com:+ASM:oracle:11g>/etc/init.d/oracleasm scandisksScanning system for ASM disks: OK ]

    2) Second, make sure the disks can be listed :

    dbaasm.us.oracle.com:+ASM:oracle:11g>/etc/init.d/oracleasm listdisksVOL1_10GVOL2_10G

    3) Query each disks:

    dbaasm.us.oracle.com:+ASM:oracle:11g>/etc/init.d/oracleasm querydisk VOL1_10GDisk "VOL1_10G" is a valid ASM disk on device [3, 18]dbaasm.us.oracle.com:+ASM:oracle:11g>/etc/init.d/oracleasm querydisk VOL2_10GDisk "VOL2_10G" is a valid ASM disk on device [3, 22]

    4) Check if they exist at OS level:

    dbaasm.us.oracle.com:+ASM:oracle:11g>ls -l /dev/oracleasm/disks/VOL1_10Gbrw-rw---- 1 oracle dba 3, 18 Aug 13 09:54 /dev/oracleasm/disks/VOL1_10Gdbaasm.us.oracle.com:+ASM:oracle:11g>ls -l /dev/oracleasm/disks/VOL2_10Gbrw-rw---- 1 oracle dba 3, 22 Aug 13 09:55 /dev/oracleasm/disks/VOL2_10G

    5) Then, in the initialization parameter file set the discovery disks string parameter as follow:

    asm_diskstring =ORCL:*

    Note: Also, you can set it thru the DBCA (during the diskgroup(s) creation) by pressing the [Change Disk Discovery Path]

    button.

  • 8/13/2019 ASM Troubleshooting Overview

    10/27

    Advanced Custom er Services

    ASM Troubleshooting Scenarios

    ASM is Unable to Detect ASMLIB Disks/Devices (LINUX Specific)

    1. 6) If the problem persists then you can set the discovery disks string as follow:

    asm_diskstring = /dev/oracleasm/disks/*

    7) As workaround you can setasm_diskstring = /dev/oracleasm/disks/*, this is possible for Oracle 10g Release 2 and onwards since itcan access block devices. Oracle uses O_DIRECT flag, which can be used for opening block devices to bypass the OS cache.

    8) If the problem persists, please open a new service request with Oracle support and then please provide us the next information(from all the nodes if RAC) :

    2. Upload the next files:

    3. ========================================)> /var/log/messages

    4. =)> New /etc/sysconfig/oracleasm

    =)> alert+ASM#.log for each instance.================================

    And the output of the next commands

    5. ================================

    6. $> cat /etc/*release$> uname -a$> rpm -qa |grep oracleasm$> df -ha$> ls -l /dev/oracleasm/disks$> powermt display dev=emcpower# (On all the partitions if using PowerPath from EMC)

    7. ================================$> /etc/init.d/oracleasm status$> usr/sbin/oracleasm-discover$> /usr/sbin/oracleasm-discover 'ORCL:*'

    SQL> show parameter asm

    Note #457369.1- ASM is Unable to Detect ASMLIB Disks/Devices

  • 8/13/2019 ASM Troubleshooting Overview

    11/27

    Advanced Custom er Services

    ASM Instance Events

    Applicable Event Levels (15xxx)

    Level 7 - DEBUG - Trace information for ASM/OSM debugging purposes only Level 6 - NLOOPS - Trace deeply nested loops within a function

    Level 5 - LOOPS - Trace loops within a function

    Level 4 - CALLS - Trace function call entry

    Level 3 - NORMAL - Trace normal paths within a function

    Level 2 - WARN - Trace warning paths within a function

    Level 1 - ERROR - Trace error paths within a function

    Kx 0x0000010 /* Array portion flags */Kxx 0x0000020 /* Alias-Directory operations */Kxx 0x0000040 /* Block validation interface */Kxx 0x0000080 /* metadata cache */Kxx 0x0000100 /* disk operations */Kxx 0x0000200 /* file operations */Kxx 0x0000400 /* disk group operations */Kxx 0x0000800 /* I/O layer (to ASMLIB or KSFD) */Kxx 0x0001000 /* node monitor (ie CSS interface) */Kxx 0x0002000 /* network layer (ie RDBMS-ASM connections) */Kxx 0x0004000 /* PLSQL package */Kxx 0x0008000 /* recovery */Kxx 0x0010000 /* templates */Kxx 0x0020000 /* SQL execution (processing ASM SQL commands) */Kxxx 0x0040000 /* ASM DBWR */Kxxx 0x0080000 /* ASM LGWR */Kxxx 0x0100000 /* I/O handles mirroring, striping, etc. */

  • 8/13/2019 ASM Troubleshooting Overview

    12/27

    Advanced Custom er Services

    ASM Instance Tracing

    Trace RBAL process

    [oracle@rac1 ~]$ ps -ef | grep rbal oracle 7745 1 0 09:24 ? 00:00:02 asm_rbal_+ASM1

    oracle 9255 1 0 09:27 ? 00:00:00 ora_rbal_whsed1

    oracle 9971 5367 0 11:31 pts/1 00:00:00 grep rbal

    [oracle@rac1 ~]$ strace -f -o /tmp/rbal.trc -p 7745

    Process 7745 attached - interrupt to quit

    Process 7745 detached

    more /tmp/rbal.trc

    7745 semtimedop(163842, 0xbfb973f4, 1, {2, 350000000}) = -1 EAGAIN (Resource te

    mporarily unavailable)

    7745 gettimeofday({1251917133, 714243}, NULL) = 0

    7745 gettimeofday({1251917133, 714337}, NULL) = 0

    7745 gettimeofday({1251917133, 714395}, NULL) = 0

    7745 getrusage(RUSAGE_SELF, {ru_utime={2, 79683}, ru_stime={1, 81835}, ...}) =

    7745 sendmsg(13, {msg_name(16)={sa_family=AF_INET, sin_port=htons(32963), sin_a ddr=inet_addr("192.168.1.162")}, msg_iov(2)=[{"\4\3\2\1\327\263\200\0\0\0\0\0MRO

    N\0\1\0\0\220\0\0\0\1"..., 68}, {"KSXP\2\0\0\0\1\0\2\0\20\0\0\0\4\0\0\0\0\0\0\0\

    0\0\0\0r"..., 144}], msg_controllen=0, msg_flags=0}, 0) = 212

    "buffer busy or rdbms ipc reply events

  • 8/13/2019 ASM Troubleshooting Overview

    13/27

    Advanced Custom er Services

    ASM Rebalancing

    Rebalancing is the activity of spreading data amongst disks inan ASM group

    Happens in the background but can be done manually

    Internally the balance happens on a file per file basis

    Only one RBAL process runs per node

    Rebalance request on the same diskgroup are done serially

    ASM decides how best to balance load across available disks Uses one of three allocation schemes for selecting disks

    1. Placement by file/extent number

    2. Random-seeded ordering of all disks in the ASM disk directory

    3. Balanced placement over all disks

  • 8/13/2019 ASM Troubleshooting Overview

    14/27

    Advanced Custom er Services

    ASM Rebalancing

    Parallel execution based on rebalance POWER

    POWER settings are 1-11 (default 1) Used to throttle overhead during normal operations

    Rebalance moves 1mb chunks at a time

    Setting POWER to 0 defers rebalancing to another time

  • 8/13/2019 ASM Troubleshooting Overview

    15/27

    Advanced Custom er Services

    ASM Rebalancing

    Displaying & changing rebalance POWER setting

    SQL> show parameter limit

    NAME TYPE VALUE------------------------------------ ----------- ------asm_power_limit integer 1

    Changing setting

    SQL> alter diskgroup dg1 rebalance power 8;

    Verifying Change

    SQL> select * from v$asm_operation;

    GROUP_NUMBER OPERA STAT POWER ACTUAL SOFAR EST_WORK EST_RATE

    ------------ ----- ---- ---------- ---------- ---------- ---------- ----------1 REBAL RUN 8 8 0 407 0

  • 8/13/2019 ASM Troubleshooting Overview

    16/27

    Advanced Custom er Services

    ASM AU/Extent Management

    Allocation Units (AU) at the disk level and Extents at the file level

    Default AU size is 1mb

    Default extent size is 1mb

    Extents are allocated in 1, 4, 16, & 64mb chunks (11g)

    Extent placement is circular when disks are the same size

    Cannot be changed without recreating the diskgroup Templates can be created and added to diskgroups

  • 8/13/2019 ASM Troubleshooting Overview

    17/27

    Advanced Custom er Services

    ASM Performance Considerations

    Metadata ONLY is Cached In The ASM Instance

    ASM Diskgroup Configuration

    External Redundancy

    Normal Redundancy (default)

    High Redundency

    ASM Instance Configuration (large_pool_size)

    Resolving ORA-4031

    ASM Allocation Unit Size (1mb default)

    ASM Fine Grained Stripe Size (8x128k Stripes)

    MAX I/O Size Oracle Block Size

  • 8/13/2019 ASM Troubleshooting Overview

    18/27

    Advanced Custom er Services

    ASM Default Template

    Archivelog Files - Coarse

    Autobackup - Coarse

    Controlfile - Fine Grained

    Datafile - Coarse

    Flashback data - Fine Grained Online REDO - Fine Grained

    SPFILE - Coarse

    Tempfile - Coarse

    Coarse1mb stripe sizeFine Grained8 x 128k stripes

  • 8/13/2019 ASM Troubleshooting Overview

    19/27

    Advanced Custom er Services

    ASM Templates

    Striping AttributesFine, Coarse

    Redundancy Attributes Mirror2 way

    High3 way

    UnprotectedNot mirrored

  • 8/13/2019 ASM Troubleshooting Overview

    20/27

    Advanced Custom er Services

    ASM Templates

    Viewing Template select * from V$ASM_TEMPLATE;

    Altering Template

    Alter diskgroup DG modify template NAME attributes (coarse/fine);

    Adding Template

    Alter diskgroup DG add template NAME attributes (attributes);

    Dropping Templates Alter diskgroup DG drop template NAME;

  • 8/13/2019 ASM Troubleshooting Overview

    21/27

    Advanced Custom er Services

    ASM Background Processes

    ora_asmb_whsed1 - Foregrounds servicing clients commands from client of database

    asm_pmon_+ASM1 - Process monitor, same as database

    asm_vktm_+ASM1 - Process to maintain a fast timer, same as database

    asm_diag_+ASM1 - Diag process, same as database

    asm_ping_+ASM1 - Process to measure network latency, same as database

    asm_psp0_+ASM1 - Process that Starts other Processes, used to startup other backgrounds

    asm_dia0_+ASM1 - Diag slave process, same as database

    asm_lmon_+ASM1 - Lock monitor, Same as database

    asm_lmd0_+ASM1 - Lock monitor diag, Same as database

    asm_lms0_+ASM1 - Lock monitor slaves, same as database

    asm_mman_+ASM1 - Autotune SGA process, Same as Database.

    asm_dbw0_+ASM1 - DB writes, same as database DB writer, but deals with ASM cache

    asm_lgwr_+ASM1 - Log writer, similar to database, but deals with diskgroups

    asm_ckpt_+ASM1 - Checkpoint process, Similar to database CKPT

    asm_smon_+ASM1 - Recovery process, Same as database SMON, but deals with diskgroup recovery asm_rbal_+ASM1 - Background process that is used for diskgroup management

    asm_gmon_+ASM1 - Group monitor, used for partner and status table, and node membership

    asm_lck0_+ASM1 - Lock monitor slave, Same as database

  • 8/13/2019 ASM Troubleshooting Overview

    22/27

    Advanced Custom er Services

    ASM Views (10g & 11G)

    View Contents

    V$ASM_ALIAS Alias for each disk group mounted by the ASMinstance

    V$ASM_CLIENT Identifies databases using disk groups managed by

    the ASM instance.

    V$ASM_DISK Disks discovered by the ASM instance

    V$ASM_DISKGROUP Disk groups known by the ASM instance

    V$ASM_FILE File list for each disk group mounted by the ASMinstance

    V$ASM_OPERATION Long running operations executing in the ASMinstance

    V$ASM_TEMPLATE Templates present in each ASM mounted disk group

  • 8/13/2019 ASM Troubleshooting Overview

    23/27

    Advanced Custom er Services

    cd - Changes the current directory to the specified directory.du - Displays the total disk space occupied by ASM files in the specified

    ASM directoryexit - Exits ASMCMD.find - Lists the paths of the specified name (with wildcards) under the

    specified directory.help - Displays the syntax and description of ASMCMD commands.ls - Lists the contents of an ASM directory, attributes of the sfile, or the names and attributes

    of all disk groups.lsct - Lists information about current ASM clients.lsdg - Lists all disk groups and their attributes.mkalias - Creates an alias for a system-generated filename.

    mkdir - Creates ASM directory.pwd - Displays the path of the current ASM directory.m - Deletes the specified ASM files or directories.rmalias - Deletes the specified alias, retaining the file that the alias

    ASMCMD Command Reference

  • 8/13/2019 ASM Troubleshooting Overview

    24/27

    Advanced Custom er Services

    cp- Enables you to copy files between ASM disk groups on local instances andremote instances.

    lsdsk-ASM can list disk information with or without a running ASM instance. Also

    useful for system or storage administrators to obtain lists of disks thatan ASM instance uses.

    md_backup and md_restore- These commands enable you to re-create a pre-existing ASMdisk group with the same disk path, disk name, failure groups, attributes,templates and aliasdirectory structure. You can use md_backup to back up the disk group environment and usemd_restore to re-create the disk group before loading from a database backup.

    remap- You can remap and recover bad blocks on an ASM disk in normal or high redundancythat have been reported by storage management tools such as disk scrubbers. ASM reads fromthe good copy of an ASM mirror and rewrites these blocks to an alternate location on disk.

    New 11g ASM Commands

  • 8/13/2019 ASM Troubleshooting Overview

    25/27

    Advanced Custom er Services

    Note: 340417.1 - Data Gathering for Troubleshooting ASM Issues

    Note: 267982.1 - Automatic Storage Management (ASM) Knowledge Browser Product PageNote:824354.1 - How To Trace ASMCMD on UnixNote:351866.1 - How To Reclaim ASM Disk SpaceNote:345180.1 - How to duplicate a controlfile when ASM is involvedNote:553319.1 - ORA-15036 When Starting An ASM Instance

    MySupport ASM References

  • 8/13/2019 ASM Troubleshooting Overview

    26/27

    Advanced Custom er Services

  • 8/13/2019 ASM Troubleshooting Overview

    27/27

    Advanced Custom er Services