84
1 1 34 Riviera Drive, Markham, ON, L3R 5M1 Web: www.midrange.ca E-mail: [email protected] Phone: 1-800-668-6470 Don’t Fall with the Fallen !! Don’t Fall with the Fallen !! Richard Dolewski

Don’t Fall with the Fallen - gomitec.com€¦ · Don’t Fall with the Fallen !! ... 18 objects saved from library QUSRBRM. ... Configuration and Security information is saved daily

  • Upload
    tranque

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

11

34 Riviera Drive, Markham, ON, L3R 5M1Web: www.midrange.caE-mail: [email protected]

Phone: 1-800-668-6470

Don’t Fall with the Fallen !!Don’t Fall with the Fallen !!

Richard Dolewski

2

Definition of a Disaster

A sudden, Unplanned Event that causes great damage and loss to an organization

The time factor determines whether an interruption in service is an inconvenience or a disaster. The time

factor varies from organization to organization

3

What is Disaster Recovery

Reaction to a sudden, unplanned event that enables an organization to continue critical business functions until normal business

operations resume.

“…It is not enough to arrange for hardware replacement;… planning must address continuation of

business operations, or business continuation.”

4

What is a Disaster

ANYTHING !That stops your business from functioning & that

cannot be corrected within an acceptable amount of time….

5

The Value of Systems Availability

Competitive value

Increased End-user & Business Productivity

Ongoing improvement in customer support & service

Positive Business Image

Reduced Outages =

��

RTP vs. RTO

Do you know your Recovery Objectives

7

THE GOAL

NO Business tolerance for DOWNTIME

Critical systems & Networks are continuously available

Business interruption measured in hours & minutes rather

than DAYS.

Disaster Recovery

8

Protect Important Assets

Four Primary Assets needed to operate Information Systems:

Facilities

Hardware

Network

Data

9

Protect Important Assets

Hardware and Networks can be replaced

Facilities can be rebuilt or relocated

Data is Priceless !!!

10

A shared responsibilityEntire organizations senior management

IT alone cannot determine which processes are critical

Disaster Recovery

11

How Much Data Can You Afford to Lose ?

Most IT shops depend on backups to protect their dataOrphan data considerations

Minimum 24hrs of lost dataApplication access may be more critical

12

Questions for Management

Include: Manufacturing, Finance, Purchasing, Sales, Warehousing.

Ask them about their Business, What Services do you provide them!!!

13

RTO, RPO and ROI

RTO: Recovery Time ObjectivesHow long can your system be down?

RPO: Recovery Point ObjectivesHow much data can you lose?

ROI: Return on Investment GoalsPlanned verses unplannedHA verses conventional hot site

RPO RTO

ROI

DaysHoursMinutes

14

Planned vs. Unplanned Downtime

• Backup Window – Incremental Daily & Full System

• IBM & 3rd party Software Upgrades

• IBM & 3rd Party PTF/Fixes

• Application Maintenance ( Reorgs )

• Hardware Upgrades

Planned

15

Planned Downtime Score Card

Full Backups Wkly

Software Installs

+ Housekeeping------------------------------= P lanned Outages

Per/Year

------------------------------

312 hrs

20 hrs

24 hrs

512 hrs or 21.33 days/year

Per/ Week

156 hrs3 hrs

6 hrsDaily Backups 6 times per week

Disaster Recovery Planning

Plan yourself or get someone to do it:

But Whatever You Do -Plan!

17

Disaster Recovery Planning

Only 70% of today’s businesses have fully documented Disaster Recovery Plans.

Of these company’s with plans

45% NEVER test their plan

18

Common Issues

We all tend to let our guard down when times improve

As Planners we must always be ready & be prepared.

19

No Disaster Recovery Plan

• Guarantees:

• Confusion

• Lack of direction

• Conflict

• Lost customers

20

Cannot be approached casually

The Plan must be ....

Well organizedAction Oriented Comprehensive

Objective: Total restoration of Services in a timely manner

Disaster Recovery Planning

21

The Products of the PlanWho will execute recovery actionsWhat is needed to continue, resume, recover or restore business functionsWhen business functions and operations must resumeWhere to go to resume corporate, business & operational functionsHow; Detailed procedures for continuity, resumption, recovery or restoration

CLASSIC: WHO-WHAT-WHERE-WHEN-HOW

22

Common Issues

Has your plan kept up to date with your IT integrations.

Expectations of Plan are un-realistic

I no longer have the staff

Implement DR into your Change Control Process.

23

Recovery Script IssuesThis procedure will pre-determine your company’s course of action:

When do I inform management?When do I put the hotsite on Alert or Declare?What time - What actions ???Who will execute them ???

Recovery Teams

Does your Recovery team have what it

takes ?

25

PerspectiveTo often companies populate their DR Teams with raw inexperienced staffers and the wrong solution

Volunteers to satisfy an auditor or worse the sacrificial lamb

26

The Right People

The Best Candidates for DR Teams:Characterized as leaders

People that everyone go to

Folks that understand Enterprise systems – know quickly the how and the ramifications

Understand the business

27

Characteristics of a good TeamIdeal Characteristics Characteristic to

AvoidConsidered an Expert by his/her peers

Hands off Individual ( Avoids Work )

A go-to Person for anything and/or Everything

New to the Organization

Totally unfamiliar with the systems

Works well under Pressure Folds under Pressure

Controls Emotions Hot Head

28

Ideal Characteristics Characteristics to Avoid

Confident Lacks sense of Urgency

Trusted by Peers Tendency to blame others

Excuses , Excuses , Excuses

Totally unfamiliar with the systems

Dedicated – A company person

Pure 9 – 5 er.

First one out the DoorWilling to fix problems created by others

No where to be found

Characteristics of a good Team

Clean up your House

Too much exposure is bad for your health:

Do a Vulnerability Assessment

30

Vulnerability Assessment

Objective is to drive down the duration of outages

A systematic approach towards:

Reducing the frequency of outages by eliminating all single points of failure.

Reducing duration of outages by configuring hardware & software for the fastest possible recovery.

31

Computer room environmentTemperatureHumidityAir Flow ( From the plant ? )

Electrical power SupplyKey-lock switchData securityTape and backup device maintenance

Vulnerability Assessment

32

Power Redundancy

33

Power Redundancy

UPS/Diesel Generator

Extends system operation without Hydro supplied power.RS232 cable Interface & System ValuesOff line maintenance

34

RAID5/6 Disk RedundancyParity Information saved across multiple disks.

Advantages:Lower cost than DASD mirroring.System available during a Disk failure.Customer responsibility to configure RAID sets.

Disadvantages:Only protects on the DASD level, no upstream protection.

35

Security

Will You Pass The Audit?

37

Lack of Security because…

Time pressures !!Administrators wear too many hats.

Result: Poorly administered security schemes

Too much authoritySecurity software or products never installed or utilized

38

Types of Security

PHYSICAL SECURITY

SIGN-ON SECURITY

RESOURCE SECURITY

39

Unsecured copies of production data

• Developers need copies to test against

• The test Data is “real”

• Copies are often left unsecured on test servers

40

Default passwords• Analyze use of Default passwords on your systems

• These are the first passwords a hacker will try

• Check Consultant & Suppliers passwords:

JDEINTASLL, JDEPROD, QPGMR, you’re Boss !!!

Change IBM Default passwords

41

Old User profiles

• Rather than being cleaned up, profiles often

accumulate, even though staff have left the company

• Old profiles owning production Objects

42

TCP/IP applications

Many systems have TCP/IP servers started even when they are not used.

• Check autostart attribute of servers

• These will start when STRTCP is run

• Check authority to STRTCPSVR

• This starts all TCP servers regardless of

autostart value

43

Biggest security exposure

Behind the firewallDisgruntled employeesAccidental errors due to users having too much authority

No auditingNo way to determine if there really is a problem

44

No auditing

• If you don’t audit, you have no knowledge of

what happened.

• May need to audit to meet regulations - PIPEDA

• Minimum recommendation:

• *SECURITY, *SAVRST, *AUTFAIL,

*DELETE, *CREATE, *SERVICE

• Caution – don’t audit too much!

Backup & Recovery

I Backup therefore I can Recover

46

Your System just lost all of its data!

Are you worried!

47

Backup = RecoveryHow many people backup their system ?

Of these company’s that perform regular backups…

• 51 % are in-complete• 23 % ( iSeries/400 ) are un-recoverable• 42 % ( Intel ) are un-recoverable

48

Availability problems IT is facing today !!

1. Backup window reduction2. Scheduling a Planned outage3. Recovery from disaster related outage events4. Best Practices for Server Compliance5. Recovery Solution Verification

IBM Enhanced CapacityCartridge System Tape

Cartridge System TapeIBM Enhanced CapacityCartridge System Tape

Cartridge System TapeIBM Enhanced CapacityCartridge System Tape

Cartridge System TapeIBM Enhanced CapacityCartridge System Tape

Cartridge System TapeIBM Enhanced CapacityCartridge System Tape

Cartridge System TapeIBM Enhanced CapacityCartridge System Tape

Cartridge System TapeIBM Enhanced CapacityCartridge System Tape

Cartridge System TapeIBM Enhanced CapacityCartridge System Tape

Cartridge System TapeIBM Enhanced CapacityCartridge System Tape

49

Save/Restore Strategy

System saves must be reviewed Ensure compete recovery is possible from mid week, mid day or weekend failure. Electronic notification of exceptions Review Restoration Procedures Backup software BRMS

50

When was the last Full Save

51

Last Save InfoFollowing are other data areas that the system uses to maintain last save information for various other general save/restore commands:

Data Area Name PurposeQSAVALLUSR INFO FOR SAVLIB/RSTLIB LIB(*ALLUSR)QSAVCFG INFO FOR SAVCFG/RSTCFGQSAVIBM INFO FOR SAVLIB/RSTLIB LIB(*IBM)QSAVLIBALL INFO FOR SAVLIB/RSTLIB LIB(*NONSYS)QSAVSYS INFO FOR SAVSYSQSAVUSRPRF INFO FOR AVSYS/SAVSECDTA

/RSTUSRPRF

52

Reliable Backups

Backups are the backbone to any recovery situation

In most recovery situations, the backups are not adequate

Excessive time is spent recreating parts of operating system

QUSRSYS not complete

53

BRMS LogSystem not in restricted state, SAVSYS Processing completed with errorsStarting SAVDLO of folder *ANY to devices TAP01. 2574 document library objects saved. Starting save of list *LINK to devices TAP01. 43917 objects saved. 342 not saved. Save of list *LINK completed with errors.Starting save of media information at level *OBJ to device

TAP01. 18 objects saved from library QUSRBRM. Save of BRM media information at level *OBJ complete. DAILY *BKU 0070 *EXIT CALL

PGM(BBSYSTEM/ENDDAYBU). Control group DAILY type *BKU completed with errors.

54

BRMS MaintenanceRecovery Analysis reportRecovery Volume Summary reportASP Information reportProduce the Location Analysis report Recovery reports by system

Send Recovery report offsite….Daily

55

Tape Management

Ensure tapes are labeled or cataloged with unique volume ID’s (BRMS/400, Robot Save)

Prevent overwriting tapes with Active data

Have at least 2 full system saves ( yes 2 )

Audit tapes for data integrity

Do NOT IGNORE tape drive problems - PRTERRLOG *VOLSTAT

56

Save StrategyMonthly a full system save Option 21 is performed. SAVSYS, SAVLIB *NONSYS, SAVDLO, SAV.

Daily SAVLIB of all production libraries using Save -While - Active

IFS save is performed daily

Configuration and Security information is saved daily.

Tapes sent offsite daily.

57

High Availability

Best Practices

58

Continuous AvailabilityProduction Partition

BackupPartition

High Availability Software

59

Why is Continuous Availability not H/A !

It does eliminate the need for planned shutdown of production system

Objective is to to allow users 24 hour access to the production system

Normally only key Production application libraries are mirrored in this approach

60

Send & Receive

Users

Primary Node Backup Node

StagingStore

MatchMerge

Building Address: 222 Cross my Fingers Drive

61

Single point of failure

Same power grid

Same CO for communications

No alterative in a Building Disaster

Issues with Side by Side

62

H/A Common Findings

Mirrored System data integrity in serious questionData in-consistencies beyond application. Numerous application support model requirements missing or out of sync.Little or NO documentation exists besides run book

Solution questioned by Management

63

Bridging the H/A Gap

H/A needs to monitored 7/24 to ensure integrity

Special considerations must be given to designing a messaging model that is responsive to any mirroring condition.

Operations education will be required.

6464

TESTING

Test because your Business Depends on it !

65

It’s not all bad news when your plan

fails during a test…

…frequent testing identifies gaps in your recovery process

66

Testing

Passive TestingActive Testing

Test because your business depends on it!

67

Passive Testing

Hands on Plan Review

Paper WalkthroughAssessment of current Recovery PlanInvolve every member of your recovery team(s)

68

Active Testing

Validates the Recovery Plan in terms of:

1. Recovery Capability2. Source system configuration3. Network Recovery4. Data Integrity5. Identify Weakness in the Plan6. Provides Training

This all equals success during a Disaster

6969

Vital Records

Best Practices

70

Offsite StrategyDISASTERSTRIKES

10 AM

W T F S S M T W T F S S M T W T F S

FullBackup

S

OPERATIONS RECOVERY

POSSIBLEUNPROCESSED

DATA

PLANNEDOFFSITEBACKUP

DAILY EXPOSURE

Incremental backup

71

Offsite Storage ConsiderationsThe last full system save (multiple copies)Software Installation Keys & Proof of Entitlement documentsLAN Server CD-ROMCisco Router ScriptsLAN full Backup and build items ( Database )Recovery CD’sElectronic version of DRP - PRTSYSINFLVT for LPAR-ed serversHMC – Critical Console Data DVD

7272

It's Eleven O ‘ Clock.

Do You Know Where Your Data Is?

73

Recipe for Disaster

Ingredients:One Average, everyday filing cabinetAll your important business documentsOne Average business fire

74

Recipe for Disaster

Directions:Place media in filing cabinet. Bake in fire at approximately 800° F for 20min. Let cool. Open filing cabinet.

75

Recipe for Disaster

Your tapes and, for the most part your business are toast !

76

Good Medicine

The doctor says it Tastes Awful &

It Works.

77

Prevent a Disaster

Review your Backup strategy !!!!Develop a Recovery strategy First.

Review custom CL programs for applicationchanges and for completeness*UNLOADReview the logs !!!!

78

Position Your Software

Keep current software installed (V5R3/V5R4)

OS/400 upgrade paths require specific generations of hardwareReplacement hardware from IBM will dictate the release you require.

79

Prevent a Disaster

Install latest Backup/Recovery group PTFsKnow where your LIC CD isHave you performed a FULL save since the last upgrade?

80

Logical Partitioning

Each partition needs to be backed up

There is NO Option 21 for the entire system that includes all partitions.

Logical Partition configuration maintained on Primary - not Saved… Document it !!

81

LPAR Experiences

Client: Operator in DST deletes a partition

Situation: Told nobody about his actions because all looked to be in order

Result: Partition went offline 1 month Later

Educate staff and use appropriate security measures!

82

Misconceptions

Fact: BEST EFFORT

“I am a very large IBM customer, a machine will be made available for me, do you know how much I spend each year?”

8383

Make a commitment

Fortune Favors The Prepared.

84

MID-RANGE Technical ServicesRichard Dolewski, CDRPDisaster Recovery [email protected]. 1-800-668-6470

www.midrange.ca