Recovery Tales from Experience - gsebelux.com Tales from Experience.pdf · –PLAN B of regular image copies need ... There seems to be a general reluctance to declare a disaster

© Copyright 2016 BMC Software,

Inc.

1

—

R&D Solutions Architect

[email protected] December 13, 2016

Ken McDonald

Recovery Tales from Experience

Belgium DB2 GSE - Antwerp


Inc.

2

Acknowledgements / Disclaimers

IBM®, DB2®, z/OS® are registered service marks and trademarks of International Business Machines Corporation, in the United States and/or other countries

The information contained in this presentation has not been submitted to any formal review and is distributed on an “As Is” basis. All opinions, mistakes, etc. are my own.


Inc.

3

Agenda

Overview

Abstract Bullets– Hear about real life events that lead to DB2 Recovery scenarios

– Learn about possible pitfalls during recovery related to insufficient resources or incomplete planning

– See how alternative resources can be used to work around seemingly insurmountable problems

– Understand that sometimes a Recover is not really a Recover

– Grasp the importance of planning for Recovery

Several Stories presented

Summary


Inc.

4

Overview


Inc.

5

Overview


Inc.

6

Overview


Inc.

7

Overview

We find a lot of mirrored environments– Often, mirroring only replicates bad data

• Especially when related to an application or system software error

– There is an overreliance on mirrored data as the only DR approach

– PLAN B of regular image copies need to be taken

Corrupt ARCHIVE and ACTIVE LOG data is detrimental

There seems to be a general reluctance to declare a disaster

Application Recovery or Forging Ahead is often the choice– Log Tools for UNDO SQL generation help mitigate this

Backups are considered for Production Systems, but often not for critical test systems


Inc.

8

Okay Kids,

It’s Story Time!


Inc.

9

Story 1


Inc.

10

A DASD Controller Catches Fire

The Event and Fallout

• Well, it smoked a lot if it didn’t actually flame up

• I/O corrupted and lost during the overheating period– Including the mirrored data

• Over 300 Volumes behind the controller

• Table spaces, Index spaces, and Log Files impacted– Data and index pages out of sync

– Both copies of archive log files lost in some cases


Inc.

11


Original Approach to Recovery / Problems Encountered

Tried RECOVERY TO CURRENT– One recovery utility failed initially due to missing critical maintenance

• Work around available

– Same utility failed with work around and after maintenance due to missing log files

– Second recovery utility “worked” but left many (uncounted) data pages and indexes out of sync leading to 00C90101 abends

• Attempted to REPAIR each one, but the effort grew too large

• Only find them if access path takes you there


Inc.

12


Getting around the problems

• Decided to do a Point In Time Recover prior to controller issue for problem objects (to get Application active)– Initially chose wrong point not accounting for GMT versus Local Time

• Used INDEP OUTSPACE Recovery for individual objects to get closer to CURRENT– Compared INDEP copy of TS to PIT to determine differences

– Did TABLE RENAMEs to use copy if satisfied it was better

• Implies potential changes to SPACE name based Utilities

– Extracted differences programmatically to apply otherwise… not sure how


Inc.

13



DSNJU004 – Print Log Map snippet– GMT Times

– Local Time

ARCHIVE LOG COPY 1 DATA SETS

START RBA/LRSN/TIME END RBA/LRSN/TIME DATE/LTIME DATA SET INFORMATION

---------------------- ---------------------- ---------- --------------------

0100000000004AD15000 01000000000076C34FFF 2014.027 DSN=DSNDKD.DKD1.ARCHLOG1.A0000060

010036AE8DFACFD06B83 0100E1DAF00F29564F83 15:20 VOL=160053 UNIT=CART

2013.256 17:51:35.3 2014.027 21:20:39.9

CATALOGUED

01000000000076C35000 01000000000085461FFF 2015.173 DSN=DSNDKD.DKD1.ARCHLOG1.A0000061

0100E1DAF00F29564F83 010364430991997CBD83 15:05 VOL=103942 UNIT=CART

2014.027 21:20:39.9 2015.173 20:04:48.0

CATALOGUED

01000000000085462000 010000000000B1381FFF 2015.218 DSN=DSNDKD.DKD1.ARCHLOG1.A0000062

010364430991997CBD83 01039D34B1A684494F83 22:04 VOL=717335 UNIT=CART

2015.173 20:04:48.0 2015.219 03:04:07.9


Inc.

14


What planning could have made the recovery easier

• I/O Configuration Matters– Always separate pairs of BSDS, ACTIVE LOG, and ARCHIVE LOG copies

to different strings of DASD

– The log is sacrosanct in recovery

• HIPER and FLASH maintenance should be applied– Recovery experience elongated due to missing PTFs identified as high

priority maintenance

• Slow down and VERIFY before committing to a RECOVERY TO Point In Time– A few extra minutes of think time could have saved hours of bad

recovery time


Inc.

15

Story 2


Inc.

16

System Level Software corruption of I/O


• Bug in System Level Software caused I/O errors… I/O completed correctly, but incorrect pages written to physical pages.– Page x physically written with Page y

• Bad things happened of many varieties– Many spaces went into GRECP OR LPL. START command too slow to

keep up

– RC00C90101 inconsistent data

– Other abends due to unexpected data

• Log was good… TS/IX pages were corrupted– Recovery TO CURRENT is possible

– Worst scenario for daily full image copy paradigm – 23 hours of log


Inc.

17



• Was unprepared for a wholesale recovery

• 400+ spaces involved, so generated JCL jobs 1 per space and submitted– Major contention occurred on the archive log files

– Tape contention and waiting for tape

– Also reading the archive logs 400+ times instead of submitting multiple recoveries in fewer parallel jobs to reduce the I/O of re-reading archives by each individual job

• This is when we got involved when asked: – “Why are your recovers running so slowly?”


Inc.

18



• After original attempt had been running hours…

• Cancel all outstanding jobs

• Regenerate fewer jobs to do multiple recoveries– Let jobs multi-task, control log reading fewer times

– Stagger start to avoid archive log tape contention

• Remaining larger percentage of spaces recovered in a fraction of the time it took the successful jobs for a small percentage of spaces in original scenario


Inc.

19



• Prepare for Recovery BEFORE recovery– Have existing JCL if possible

– Have an planned approach for recovery otherwise (e.g. this is how we generate optimized JCL during the event)

– Design your backup methodology to match your recovery strategy

• Take a breath before submitted the recoveries– Verifying approach and analyzing pitfalls BEFORE submitting. 15

minutes up front can save you hours in execution

• PRACTICE, PRACTICE, PRACTICE– Learning how to recover should not occur during an emergency


Inc.

20

Story 3


Inc.

21

DASD Hardware / Data Movement corruption of I/OThe Event and Fallout

• Hardware Vendor promised ‘Transparent Movement’ of data from old hardware to new hardware while updates were occurring

• I/O was dropped during the process– New inserts on pages not externalized / RIDs reused by subsequent

inserts

– Deleted rows not externalized… index may not reflect table data

– Updates also dropped and rows re-updated missing previous data

• This was tablespace/index data. Log was not involved directly in the I/O errors

• But…


Inc.

22

DASD Hardware / Data Movement corruption of I/O

LOG WAS CORRUPTED due to the dropped I/O– Was not a physical corruption, but a logical

– Duplicate inserts into the same RID

– Before and After images of concurrent updates not reflected correctly

– Who knows what else

• These are problems that we encountered in our attempt at Recover and use of Log Tools that were roadblocks

– Have I mentioned that the log is sacrosanct? But, it wasn’t…


Inc.

23



• Standard TO CURRENT Recovery attempted

• Some Recoveries failed due to lack of image copies as a starting point within the same time frame as available log

• Some Recoveries failed due to sanity checks– Can’t have the same ROWID inserted twice in a page

– After Image of prior update does not match before image of subsequent update

• One Recovery utility did not error on this – led to RC00C90101


Inc.

24



• Mirrored environment had backups of hardware snaps available (due to data movement testing)

• Restored that environment to most recent physical dump

• Created Image Copies to use in Production

• Recovered to specified image copies• This is supported in DB2 12 DSNUTILB

• Used a Log Tool to get transactions (generate SQL) from image copy point to point of corruption


Inc.

25



• Log Tool challenges– Used available syntax to identify Image Copies created from Mirrors

• Image Copies used as source for completion of partially logged updates and compression dictionaries

– Duplicate Inserts caused no problems

– Missing Updates caused issues on 4 spaces out of 70+ implicated

• Used syntax to extract Inserts and Deletes

• Managed Updates separately

– Not quite a 100% recovery, but pretty close


Inc.

26



• More Frequent Image Copies– At a minimum, align image copy frequency to log availability

– Local copies would have prevented mirror shenanigans

• Delay in Recovery

• Log Tool implications… what if yours does not have the capability to specify alternative resources?

• Log Corruption– Bad is Bad… I don’t see how in this scenario planning could have

avoided this.. Short of taking down time for data movement.


Inc.

27

Dropped Object Recovery


• Accidental DROP TABLESPACE

Original Approach

• Recreate TABLESPACE

• RECOVER TO INCOPY of last full image copy

• Used Log Tool to access transactions between image copy and DROP to generate SQL– This is when we got involved…


Inc.

28


Problems Encountered

• Log Tools designed to process objects current in catalog

• Several have alternative mapping capability• Also need to be aware other SYSCOPY events like REORGs

• Image Copies needed for completion/compression no longer in catalog


• Had syntax available to get around the missing SYSCOPY issue

• Was tedious and slow compared to alternative approach


Inc.

29



• Knowledge of tools they had available– Recovery product could have applied log to point of drop

– Log Tool could have generated the complete drop recovery JCL

• Scan Log of time of drop

• Generate/Execute DDL to recreate object

• Generate/Execute Recovery to point of drop

• REBIND invalidated PACKAGES

• Why do things manually when you had automation?


Inc.

30

Story 4


Inc.

31

Oops Happens (SQL UNDO)


• SQL Executed that corrupts data– I have countless stories here…

– Bad Application updates

– Unqualified SPUFI or DSNTEP2 updates

– Malicious Intent


• Used a Log Tool to generate UNDO SQL and execute

• Need to consider subsequent updates to rows being undone

• Works like a champ! (Well, most of the time)


Inc.

32

Oops Happens (SQL UNDO)


• If you don’t have a Log Tool… *really*, *really* consider one

• Transaction level UNDO occurs while your objects are online while preserving good changes

• I’m not selling. I don’t care which one you get.– Truthfully, I don’t care if you DO get one…

• Alternatives– Point in Time Recovery. Lose good updates as well as bad.

– Write your own fix it program to reverse bad updates

– Intimate knowledge of DSN1LOGP and log formats to do the heavy lifting yourself

– There’s a new discussion on the DB2-L about this 3-4 times a year


Inc.

33

Story 5


Inc.

34

Application Error Discovered after resources purged


• Application change caused data to be incorrectly updated

• Fallout was letters sent to customers alerting to ‘badness’

• Discovered when an application team member received one of the errant letters more than a week after problem

• Subsequent updates to many rows in the table


Inc.

35



• Application tried to address independently for several days

• 2 weeks after problem, contacted DBAs for PIT Recovery

• Forward PIT Recovery failed– Archive log was available for the time frame of the PIT

– But… Image Copies prior to the PIT point did not exist

• Backout PIT Recovery failed– REORG utility had ran on most spaces between PIT and CURRENT

– LOG NO utility = NO BACKOUT

– Even if LOG YES, only the new page formats are logged, not old data.


Inc.

36



• Punt

• Company issued press release explaining error

• Company/Application Team had to manage the consequences


Inc.

37

Application Error Discovered after resources purgedWhat planning could have made the recovery easier• Get the DBAs involved earlier in the event.

• Tools available had ability to assess Recoverability

– MODIFY capability to keep all necessary resources for recovery for specified days

– Separate Batch Program available to report on Recoverability

• Know and TAKE ADVANTAGE of the Tools you own

• They did not have a Log Tool

– Could have generated UNDO SQL for errant updates

– Some have ‘Integrity’ report to show subsequent updates to the rows implicated by the UNDO SQL


Inc.

38

Story 6


Inc.

39

BSDS Corruption or InconsistencyData Sharing Enablement / Conditional Restart Problems


• Have seen several scenarios

• Some caused by missing maintenance– e.g. STCK TO LRSN DELTA not reset when Member 1 Cold Started or

DESTROYED

– STCK TO LRSN DELTA not propagated to newly added members

• Some caused by incorrect DSNJU003 execution– Bad Conditional Restart record creation

– Improper sequence of DEACTIVATE/DESTROY procedures

• Worst case example, DB2 was down for 2+ days– EMEA IDUG 2009 two part presentation on this event


Inc.

40

BSDS Corruption or Inconsistency



• Again Varied…


• Avoidance!– PRACTICE an out of ordinary event before doing in production

– Slow down and confirm process before executing in production

• Conditional Restart Point and Syntax for DSNJU003 correct?

• Take a copy of the BSDS BEFORE changing as a fall back


Inc.

41

Story 7


Inc.

42

Critical Test System Lost


• DB2 System painstakingly maintained since Version 3.1

• Objects created in each version and migrated forward

• Used to test Old format DBDs / reproduce customer issues

• Backed up religiously by an individual QA– Job never scheduled, was a manual effort

– System restored more than once over the years

• Person left company, backups stop

• After several months, system was broken after weekend IPL– Critical DSNDB06 and DSNDB01 spaces broken


Inc.

43



• Standard Recovery– No image copy taken for several months

– Log only available for the last two weeks

• Looked for Volume Dumps by Storage team– Not available for ‘non-production’ systems

Getting around the problem

• Tried cold starting

• Tried initializing some spaces (e.g. SYSLGRNX) to empty

• Tried several hair brained attempts to fool mother nature

• Ultimately, could not revive the system


Inc.

44



• DON’T IPL OVER ACTIVE DB2 SYSTEMS

• Have backups of the ‘Critical’ Test System

• “Production” is not just the revenue supporting systems– Critical systems used by QA Regression

– Critical systems used by Application Development

• How much productivity is lost if this type system is down for an extended period?


Inc.

45

Okay Kids,

Story Time is Over!


Inc.

46

Summary

Bad Things Happen… – And Mirroring duplicates Bad Things

– Always have PLAN B in place for when hardware fails

• Maybe standard image copies should be Plan A…

There is no substitute for Planning and Preparation– Your Recovery SLA should drive your backup strategy

– Your back up strategy should be designed to fit your Recovery SLA

– Consider Recovery SLAs for non-production systems

– I/O Configuration should be central to recovery planning

Know and understand the tools available in your shop– Exploit their capabilities

– Involve your vendors early in a critical recovery event


Inc.

47

Summary

BACKUP critical artifacts BEFORE starting major recovery– THINGS CAN AND OFTEN GET WORSE before they get better

– Having a path back to restart is prudent

Practice is the only path to Experience

–Practice

• Practice

–Practice

» Practice

• Practice

• …


Inc.

48

Questions?

And thank you

[email protected]

Documents

Recovery Tales from Experience - gsebelux.com Tales from Experience.pdf · –PLAN B of regular image copies need ... There seems to be a general reluctance to declare a disaster