48
© Copyright 2016 BMC Software, Inc. 1 R&D Solutions Architect [email protected] December 13, 2016 Ken McDonald Recovery Tales from Experience Belgium DB2 GSE - Antwerp

Recovery Tales from Experience - gsebelux.com Tales from Experience.pdf · –PLAN B of regular image copies need ... There seems to be a general reluctance to declare a disaster

  • Upload
    hathu

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

© Copyright 2016 BMC Software,

Inc.

1

R&D Solutions Architect

[email protected] December 13, 2016

Ken McDonald

Recovery Tales from Experience

Belgium DB2 GSE - Antwerp

© Copyright 2016 BMC Software,

Inc.

2

Acknowledgements / Disclaimers

IBM®, DB2®, z/OS® are registered service marks and trademarks of International Business Machines Corporation, in the United States and/or other countries

The information contained in this presentation has not been submitted to any formal review and is distributed on an “As Is” basis. All opinions, mistakes, etc. are my own.

© Copyright 2016 BMC Software,

Inc.

3

Agenda

Overview

Abstract Bullets– Hear about real life events that lead to DB2 Recovery scenarios

– Learn about possible pitfalls during recovery related to insufficient resources or incomplete planning

– See how alternative resources can be used to work around seemingly insurmountable problems

– Understand that sometimes a Recover is not really a Recover

– Grasp the importance of planning for Recovery

Several Stories presented

Summary

© Copyright 2016 BMC Software,

Inc.

4

Overview

© Copyright 2016 BMC Software,

Inc.

5

Overview

© Copyright 2016 BMC Software,

Inc.

6

Overview

© Copyright 2016 BMC Software,

Inc.

7

Overview

We find a lot of mirrored environments– Often, mirroring only replicates bad data

• Especially when related to an application or system software error

– There is an overreliance on mirrored data as the only DR approach

– PLAN B of regular image copies need to be taken

Corrupt ARCHIVE and ACTIVE LOG data is detrimental

There seems to be a general reluctance to declare a disaster

Application Recovery or Forging Ahead is often the choice– Log Tools for UNDO SQL generation help mitigate this

Backups are considered for Production Systems, but often not for critical test systems

© Copyright 2016 BMC Software,

Inc.

8

Okay Kids,

It’s Story Time!

© Copyright 2016 BMC Software,

Inc.

9

Story 1

© Copyright 2016 BMC Software,

Inc.

10

A DASD Controller Catches Fire

The Event and Fallout

• Well, it smoked a lot if it didn’t actually flame up

• I/O corrupted and lost during the overheating period– Including the mirrored data

• Over 300 Volumes behind the controller

• Table spaces, Index spaces, and Log Files impacted– Data and index pages out of sync

– Both copies of archive log files lost in some cases

© Copyright 2016 BMC Software,

Inc.

11

A DASD Controller Catches Fire

Original Approach to Recovery / Problems Encountered

Tried RECOVERY TO CURRENT– One recovery utility failed initially due to missing critical maintenance

• Work around available

– Same utility failed with work around and after maintenance due to missing log files

– Second recovery utility “worked” but left many (uncounted) data pages and indexes out of sync leading to 00C90101 abends

• Attempted to REPAIR each one, but the effort grew too large

• Only find them if access path takes you there

© Copyright 2016 BMC Software,

Inc.

12

A DASD Controller Catches Fire

Getting around the problems

• Decided to do a Point In Time Recover prior to controller issue for problem objects (to get Application active)– Initially chose wrong point not accounting for GMT versus Local Time

• Used INDEP OUTSPACE Recovery for individual objects to get closer to CURRENT– Compared INDEP copy of TS to PIT to determine differences

– Did TABLE RENAMEs to use copy if satisfied it was better

• Implies potential changes to SPACE name based Utilities

– Extracted differences programmatically to apply otherwise… not sure how

© Copyright 2016 BMC Software,

Inc.

13

A DASD Controller Catches Fire

Getting around the problems

DSNJU004 – Print Log Map snippet– GMT Times

– Local Time

ARCHIVE LOG COPY 1 DATA SETS

START RBA/LRSN/TIME END RBA/LRSN/TIME DATE/LTIME DATA SET INFORMATION

---------------------- ---------------------- ---------- --------------------

0100000000004AD15000 01000000000076C34FFF 2014.027 DSN=DSNDKD.DKD1.ARCHLOG1.A0000060

010036AE8DFACFD06B83 0100E1DAF00F29564F83 15:20 VOL=160053 UNIT=CART

2013.256 17:51:35.3 2014.027 21:20:39.9

CATALOGUED

01000000000076C35000 01000000000085461FFF 2015.173 DSN=DSNDKD.DKD1.ARCHLOG1.A0000061

0100E1DAF00F29564F83 010364430991997CBD83 15:05 VOL=103942 UNIT=CART

2014.027 21:20:39.9 2015.173 20:04:48.0

CATALOGUED

01000000000085462000 010000000000B1381FFF 2015.218 DSN=DSNDKD.DKD1.ARCHLOG1.A0000062

010364430991997CBD83 01039D34B1A684494F83 22:04 VOL=717335 UNIT=CART

2015.173 20:04:48.0 2015.219 03:04:07.9

© Copyright 2016 BMC Software,

Inc.

14

A DASD Controller Catches Fire

What planning could have made the recovery easier

• I/O Configuration Matters– Always separate pairs of BSDS, ACTIVE LOG, and ARCHIVE LOG copies

to different strings of DASD

– The log is sacrosanct in recovery

• HIPER and FLASH maintenance should be applied– Recovery experience elongated due to missing PTFs identified as high

priority maintenance

• Slow down and VERIFY before committing to a RECOVERY TO Point In Time– A few extra minutes of think time could have saved hours of bad

recovery time

© Copyright 2016 BMC Software,

Inc.

15

Story 2

© Copyright 2016 BMC Software,

Inc.

16

System Level Software corruption of I/O

The Event and Fallout

• Bug in System Level Software caused I/O errors… I/O completed correctly, but incorrect pages written to physical pages.– Page x physically written with Page y

• Bad things happened of many varieties– Many spaces went into GRECP OR LPL. START command too slow to

keep up

– RC00C90101 inconsistent data

– Other abends due to unexpected data

• Log was good… TS/IX pages were corrupted– Recovery TO CURRENT is possible

– Worst scenario for daily full image copy paradigm – 23 hours of log

© Copyright 2016 BMC Software,

Inc.

17

System Level Software corruption of I/O

Original Approach to Recovery / Problems Encountered

• Was unprepared for a wholesale recovery

• 400+ spaces involved, so generated JCL jobs 1 per space and submitted– Major contention occurred on the archive log files

– Tape contention and waiting for tape

– Also reading the archive logs 400+ times instead of submitting multiple recoveries in fewer parallel jobs to reduce the I/O of re-reading archives by each individual job

• This is when we got involved when asked: – “Why are your recovers running so slowly?”

© Copyright 2016 BMC Software,

Inc.

18

System Level Software corruption of I/O

Getting around the problems

• After original attempt had been running hours…

• Cancel all outstanding jobs

• Regenerate fewer jobs to do multiple recoveries– Let jobs multi-task, control log reading fewer times

– Stagger start to avoid archive log tape contention

• Remaining larger percentage of spaces recovered in a fraction of the time it took the successful jobs for a small percentage of spaces in original scenario

© Copyright 2016 BMC Software,

Inc.

19

System Level Software corruption of I/O

What planning could have made the recovery easier

• Prepare for Recovery BEFORE recovery– Have existing JCL if possible

– Have an planned approach for recovery otherwise (e.g. this is how we generate optimized JCL during the event)

– Design your backup methodology to match your recovery strategy

• Take a breath before submitted the recoveries– Verifying approach and analyzing pitfalls BEFORE submitting. 15

minutes up front can save you hours in execution

• PRACTICE, PRACTICE, PRACTICE– Learning how to recover should not occur during an emergency

© Copyright 2016 BMC Software,

Inc.

20

Story 3

© Copyright 2016 BMC Software,

Inc.

21

DASD Hardware / Data Movement corruption of I/OThe Event and Fallout

• Hardware Vendor promised ‘Transparent Movement’ of data from old hardware to new hardware while updates were occurring

• I/O was dropped during the process– New inserts on pages not externalized / RIDs reused by subsequent

inserts

– Deleted rows not externalized… index may not reflect table data

– Updates also dropped and rows re-updated missing previous data

• This was tablespace/index data. Log was not involved directly in the I/O errors

• But…

© Copyright 2016 BMC Software,

Inc.

22

DASD Hardware / Data Movement corruption of I/O

LOG WAS CORRUPTED due to the dropped I/O– Was not a physical corruption, but a logical

– Duplicate inserts into the same RID

– Before and After images of concurrent updates not reflected correctly

– Who knows what else

• These are problems that we encountered in our attempt at Recover and use of Log Tools that were roadblocks

– Have I mentioned that the log is sacrosanct? But, it wasn’t…

© Copyright 2016 BMC Software,

Inc.

23

DASD Hardware / Data Movement corruption of I/O

Original Approach to Recovery / Problems Encountered

• Standard TO CURRENT Recovery attempted

• Some Recoveries failed due to lack of image copies as a starting point within the same time frame as available log

• Some Recoveries failed due to sanity checks– Can’t have the same ROWID inserted twice in a page

– After Image of prior update does not match before image of subsequent update

• One Recovery utility did not error on this – led to RC00C90101

© Copyright 2016 BMC Software,

Inc.

24

DASD Hardware / Data Movement corruption of I/O

Getting around the problems

• Mirrored environment had backups of hardware snaps available (due to data movement testing)

• Restored that environment to most recent physical dump

• Created Image Copies to use in Production

• Recovered to specified image copies• This is supported in DB2 12 DSNUTILB

• Used a Log Tool to get transactions (generate SQL) from image copy point to point of corruption

© Copyright 2016 BMC Software,

Inc.

25

DASD Hardware / Data Movement corruption of I/O

Getting around the problems

• Log Tool challenges– Used available syntax to identify Image Copies created from Mirrors

• Image Copies used as source for completion of partially logged updates and compression dictionaries

– Duplicate Inserts caused no problems

– Missing Updates caused issues on 4 spaces out of 70+ implicated

• Used syntax to extract Inserts and Deletes

• Managed Updates separately

– Not quite a 100% recovery, but pretty close

© Copyright 2016 BMC Software,

Inc.

26

DASD Hardware / Data Movement corruption of I/O

What planning could have made the recovery easier

• More Frequent Image Copies– At a minimum, align image copy frequency to log availability

– Local copies would have prevented mirror shenanigans

• Delay in Recovery

• Log Tool implications… what if yours does not have the capability to specify alternative resources?

• Log Corruption– Bad is Bad… I don’t see how in this scenario planning could have

avoided this.. Short of taking down time for data movement.

© Copyright 2016 BMC Software,

Inc.

27

Dropped Object Recovery

The Event and Fallout

• Accidental DROP TABLESPACE

Original Approach

• Recreate TABLESPACE

• RECOVER TO INCOPY of last full image copy

• Used Log Tool to access transactions between image copy and DROP to generate SQL– This is when we got involved…

© Copyright 2016 BMC Software,

Inc.

28

Dropped Object Recovery

Problems Encountered

• Log Tools designed to process objects current in catalog

• Several have alternative mapping capability• Also need to be aware other SYSCOPY events like REORGs

• Image Copies needed for completion/compression no longer in catalog

Getting around the problems

• Had syntax available to get around the missing SYSCOPY issue

• Was tedious and slow compared to alternative approach

© Copyright 2016 BMC Software,

Inc.

29

Dropped Object Recovery

What planning could have made the recovery easier

• Knowledge of tools they had available– Recovery product could have applied log to point of drop

– Log Tool could have generated the complete drop recovery JCL

• Scan Log of time of drop

• Generate/Execute DDL to recreate object

• Generate/Execute Recovery to point of drop

• REBIND invalidated PACKAGES

• Why do things manually when you had automation?

© Copyright 2016 BMC Software,

Inc.

30

Story 4

© Copyright 2016 BMC Software,

Inc.

31

Oops Happens (SQL UNDO)

The Event and Fallout

• SQL Executed that corrupts data– I have countless stories here…

– Bad Application updates

– Unqualified SPUFI or DSNTEP2 updates

– Malicious Intent

Original Approach to Recovery / Problems Encountered

• Used a Log Tool to generate UNDO SQL and execute

• Need to consider subsequent updates to rows being undone

• Works like a champ! (Well, most of the time)

© Copyright 2016 BMC Software,

Inc.

32

Oops Happens (SQL UNDO)

What planning could have made the recovery easier

• If you don’t have a Log Tool… *really*, *really* consider one

• Transaction level UNDO occurs while your objects are online while preserving good changes

• I’m not selling. I don’t care which one you get.– Truthfully, I don’t care if you DO get one…

• Alternatives– Point in Time Recovery. Lose good updates as well as bad.

– Write your own fix it program to reverse bad updates

– Intimate knowledge of DSN1LOGP and log formats to do the heavy lifting yourself

– There’s a new discussion on the DB2-L about this 3-4 times a year

© Copyright 2016 BMC Software,

Inc.

33

Story 5

© Copyright 2016 BMC Software,

Inc.

34

Application Error Discovered after resources purged

The Event and Fallout

• Application change caused data to be incorrectly updated

• Fallout was letters sent to customers alerting to ‘badness’

• Discovered when an application team member received one of the errant letters more than a week after problem

• Subsequent updates to many rows in the table

© Copyright 2016 BMC Software,

Inc.

35

Application Error Discovered after resources purged

Original Approach to Recovery / Problems Encountered

• Application tried to address independently for several days

• 2 weeks after problem, contacted DBAs for PIT Recovery

• Forward PIT Recovery failed– Archive log was available for the time frame of the PIT

– But… Image Copies prior to the PIT point did not exist

• Backout PIT Recovery failed– REORG utility had ran on most spaces between PIT and CURRENT

– LOG NO utility = NO BACKOUT

– Even if LOG YES, only the new page formats are logged, not old data.

© Copyright 2016 BMC Software,

Inc.

36

Application Error Discovered after resources purged

Getting around the problems

• Punt

• Company issued press release explaining error

• Company/Application Team had to manage the consequences

© Copyright 2016 BMC Software,

Inc.

37

Application Error Discovered after resources purgedWhat planning could have made the recovery easier• Get the DBAs involved earlier in the event.

• Tools available had ability to assess Recoverability

– MODIFY capability to keep all necessary resources for recovery for specified days

– Separate Batch Program available to report on Recoverability

• Know and TAKE ADVANTAGE of the Tools you own

• They did not have a Log Tool

– Could have generated UNDO SQL for errant updates

– Some have ‘Integrity’ report to show subsequent updates to the rows implicated by the UNDO SQL

© Copyright 2016 BMC Software,

Inc.

38

Story 6

© Copyright 2016 BMC Software,

Inc.

39

BSDS Corruption or InconsistencyData Sharing Enablement / Conditional Restart Problems

The Event and Fallout

• Have seen several scenarios

• Some caused by missing maintenance– e.g. STCK TO LRSN DELTA not reset when Member 1 Cold Started or

DESTROYED

– STCK TO LRSN DELTA not propagated to newly added members

• Some caused by incorrect DSNJU003 execution– Bad Conditional Restart record creation

– Improper sequence of DEACTIVATE/DESTROY procedures

• Worst case example, DB2 was down for 2+ days– EMEA IDUG 2009 two part presentation on this event

© Copyright 2016 BMC Software,

Inc.

40

BSDS Corruption or Inconsistency

Original Approach to Recovery / Problems Encountered

Getting around the problems

• Again Varied…

What planning could have made the recovery easier

• Avoidance!– PRACTICE an out of ordinary event before doing in production

– Slow down and confirm process before executing in production

• Conditional Restart Point and Syntax for DSNJU003 correct?

• Take a copy of the BSDS BEFORE changing as a fall back

© Copyright 2016 BMC Software,

Inc.

41

Story 7

© Copyright 2016 BMC Software,

Inc.

42

Critical Test System Lost

The Event and Fallout

• DB2 System painstakingly maintained since Version 3.1

• Objects created in each version and migrated forward

• Used to test Old format DBDs / reproduce customer issues

• Backed up religiously by an individual QA– Job never scheduled, was a manual effort

– System restored more than once over the years

• Person left company, backups stop

• After several months, system was broken after weekend IPL– Critical DSNDB06 and DSNDB01 spaces broken

© Copyright 2016 BMC Software,

Inc.

43

Critical Test System Lost

Original Approach to Recovery / Problems Encountered

• Standard Recovery– No image copy taken for several months

– Log only available for the last two weeks

• Looked for Volume Dumps by Storage team– Not available for ‘non-production’ systems

Getting around the problem

• Tried cold starting

• Tried initializing some spaces (e.g. SYSLGRNX) to empty

• Tried several hair brained attempts to fool mother nature

• Ultimately, could not revive the system

© Copyright 2016 BMC Software,

Inc.

44

Critical Test System Lost

What planning could have made the recovery easier

• DON’T IPL OVER ACTIVE DB2 SYSTEMS

• Have backups of the ‘Critical’ Test System

• “Production” is not just the revenue supporting systems– Critical systems used by QA Regression

– Critical systems used by Application Development

• How much productivity is lost if this type system is down for an extended period?

© Copyright 2016 BMC Software,

Inc.

45

Okay Kids,

Story Time is Over!

© Copyright 2016 BMC Software,

Inc.

46

Summary

Bad Things Happen… – And Mirroring duplicates Bad Things

– Always have PLAN B in place for when hardware fails

• Maybe standard image copies should be Plan A…

There is no substitute for Planning and Preparation– Your Recovery SLA should drive your backup strategy

– Your back up strategy should be designed to fit your Recovery SLA

– Consider Recovery SLAs for non-production systems

– I/O Configuration should be central to recovery planning

Know and understand the tools available in your shop– Exploit their capabilities

– Involve your vendors early in a critical recovery event

© Copyright 2016 BMC Software,

Inc.

47

Summary

BACKUP critical artifacts BEFORE starting major recovery– THINGS CAN AND OFTEN GET WORSE before they get better

– Having a path back to restart is prudent

Practice is the only path to Experience

–Practice

• Practice

–Practice

» Practice

• Practice

• …

© Copyright 2016 BMC Software,

Inc.

48

Questions?

And thank you

[email protected]