Technology Offerings for Mainframe Batch Environment

Technology offerings for mainframe batch environment

White Paper

Kumar ChunduruApplication support


Confidentiality Statement

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice. To

copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specific permission and/or a fee.

2


Abstract

This white paper details the technology offerings for mainframe batch

environment for one of the largest Insurance companies in UK as part of problem

management. Front shop is a group that deals with application recovery by

working on incidents. Front shop has certain challenges which are causing the

front shop to take longer time for recovery of critical job failures and recurrent

failures. This white paper details how these challenges are addressed.

3


About the Author

Kumar Chunduru has completed his graduation in Master of Computer

Applications. He has worked in different technologies and domains in the IT

industry. He has experience in Project Lifecycle Management, Software Testing

Life Cycle and Application Support management. Personal email id is

[email protected]

About the Domain

Any event which is not part of the standard operation of a service and which

causes an interruption to, or a reduction in, the quality of that service is an

Incident. A problem is a condition often identified as a result of multiple Incidents

that exhibit common symptoms. Problems can also be identified from a single

significant Incident, indicative of a single error, for which the cause is unknown,

but for which the impact is significant. For frequent occurrence of incidents, a

problem record will be created to fix the issue.

Application support (Problem management) is a continuous process. It

encompasses problem detection, documentation of the problem and its

resolution, identification and testing of the solution, problem closure, and

generation of statistical reports. The objective of problem management is to

resolve the root cause of incidents, minimizing the impact of problems caused by

errors in the IT infrastructure and preventing reoccurrence of similar incidents.

Application support (Problem management) helps business for the following -

Reports reach different business areas on time, money movement transactions

will be processed on time, accounting information will be available on time.

4


CONTENTS

1. INTRODUCTION.................................................................................................................................6

2. AUTO-RECOVERY OF CRITICAL JOB FAILURES...................................................................7

3. PREVENTING CONTENTION FAILURES....................................................................................9

4. PREVENTING SPACE FAILURES................................................................................................13

5. PREVENTING ALERTS...................................................................................................................15

6. CONCLUSION...................................................................................................................................16

7. ACKNOWLEDGEMENTS...............................................................................................................17

8. Appendix A – REXX and JCL used for auto-recovery of failures.......................................................18

5


1.Introduction

TCS provides application recovery for one of the largest Insurance companies in

UK. Recovery team works low, medium and high priority incidents. SLA (service

level agreement) for low priority incidents is 16 hours, for medium priority

incidents it is 8 hours and for high priority incidents it is 4 hours. There is a

business impact for each type of incident if the incident is not recovered on time.

New applications are being installed into production environment and the number

of incidents is increasing to recover. Recurring failures are also increasing.

Time taken to resolve HIGH incidents is increasing. A problem management

team is newly introduced at TCS offshore to address these issues.

This white paper details how the permanent solution is provided for various types

of recurrent failures (problem records), how the HIGH incidents are recovered on

time. This white paper details the technology offerings provided for mainframe

batch environment. The solution of problem records is categorized as auto-

recovery, contention failures, space failures and alerts.

6


2.Auto-recovery of critical job failures

Automatic recovery is provided for certain critical job failures (high incidents) as

explained below by working on problem records.

A batch program in this context reads a dataset of records, does some process

for each input record like getting data from DB2 tables / IMS segments for a

scheme number in input record and writes the extracted data into a new output

dataset. Assume that the job is a critical job. If the input scheme number is not

valid, then the program is designed to fail. Then the manual recovery action is to

remove the invalid scheme record from input dataset and rerun the job. This

process can be automated as below.

1. Take back up of exclude and input datasets into GDG dataset. Refer the

next point to know about the exclude dataset.

2. Use a dataset to store the invalid scheme numbers that needs to be

excluded from the processing. This is called exclude dataset. Having this

exclude dataset is beneficial to skip processing the invalid scheme

numbers. This also helps to notify the failing scheme numbers to the

required group only for the first time. Once the failing scheme number is

fixed by the concerned group, the scheme number can be deleted from

the exclude dataset. Appendix A provides the REXX utility that excludes

the failing scheme numbers from an input dataset.

3. Identify the new failing scheme number from the main program step as

follows. While processing input dataset, copy each input record that is

being processed into a new dataset. When the program fails on a

particular record of input dataset, the last record of the new dataset is the

record that caused the failure of the program.

4. Using a REXX program or COBOL program add the failing scheme

number into exclude dataset and email dataset. Refer appendix A to find

the REXX program for this.

7


5. Rerun the job using auto-resubmit logic in case of invalid scheme found.

There are several methods for auto-resubmission logic. One such method

is to use OPC cards in job where TWS (Tiwoli Workload Scheduler) is

installed to maintain the jobs.

6. At the end of successful completion of the job, notify the failing cases to

concerned group of people if required. Appendix A provides the utility for

this.

8


3.Preventing contention failuresProvided permanent solution to prevent contention job failures by working on

problem records as mentioned below.

Contention arises when two jobs are attempting to access a same resource at

the same time, this might prevent one or even all jobs from executing. This

contention usually results in deadlock failures of the jobs. The result of

contention is poor response time in the online environment or slow execution

time in the batch environment.

Resource resulting contention may be either a dataset or a table space

contention. To resolve these contention issues certain guidelines are given below

for reference.

Dataset / DB2 resource contention

1. Identify all the jobs that ran at the time of contention. Failed joblog may

contain these details. For DB2 related jobs, following is useful to identify

the jobs/resources that caused contention.

2. Check spool - SDSF - D*MSTR (Example: DA1MSTR) content, identify

the table space name which results in contention.

Sample Error Message:

DSNT376I

PLAN=plan-name1 WITH CORRELATION-ID=correlation-id1

CONNECTION-ID=connection-id1 LUW-ID=luw-id1 THREAD-

INFO=thread-information1 IS TIMED OUT. ONE HOLDER OF THE

RESOURCE IS PLAN=plan-name2 WITH CORRELATION-

9


ID=correlation-id2 CONNECTION-ID=connection-id2 LUW-ID=luw- id2

THREAD-INFO= thread-information2 ON MEMBER member- name

DSNT376I error message details the table space / program & job resulted

in contention.

3. Identify the jobs that utilize the resource which resulted in contention.

4. If failed job has no time dependency & critical successors, modify the

failed job schedule to avoid contention.

5. If failed job is time dependent, check whether dependency can be created

among the contented jobs. On creation of such dependency, ensure no

considerable delay in failed job’s execution

6. If none of the above option helps to resolve contention, exclusive lock

access for the contented resource can be issued to resolve contention.

But provision of such exclusive access will have huge impact in batch

performance, if contented dataset usage is heavy among batch jobs.

DB2 Resource Contention

Below are some guidelines specific to resolve table space (DB2 resource)

contention:

1. If program uses SELECT query to fetch content from the table space

using a cursor, verify whether query has “FOR FETCH ONLY/FOR

READ ONLY” option. If not, modify the query to add these options to

resolve lock contention issue. This is applicable if the program is

having only read-only SQLs.

2. If both programs use SELECT query with “FOR FETCH

ONLY/FOR READ ONLY” option to fetch from table space, modify

10


programs to include retry logic. On query execution verify for -911

SQLCODE, if so retry the same SQL statement to resolve contention.

Considerable retry count (Say 5) is acceptable.

[Note: Query with “FOR FETCH ONLY/FOR READ ONLY” option is

similar to uncommitted read. That is, the retrieved data may not have

up-to-date information.]

3. Failed jobs can be modified to include re-submission logic on such

deadlock issue. This option too will resolve table space contention.

4. If contented programs use UPDATE query, verify failed job details via

job scheduler. If failed job has no time dependency & critical

successors, modify the failed job schedule to avoid contention.

5. If one program uses “Update” & other uses “SELECT” query, then

modify the program(which use SELECT query) to include WITH UR

option in the SELECT query. This “WITH UR” option helps to resolve

lock issue. But SELECT with Uncommitted Read option is considered

as “dirty read”, as it allows an application to read while acquiring few

locks, at the risk of reading uncommitted data. UR isolation applies

only to read-only operations: SELECT, SELECT INTO, or FETCH from

a read-only result table.

6. Contention may arise, if a program does frequent updates on same

table space without commit process. To resolve such issue, commit

logic can be added in the program.

7. Program that has high commit frequency may lead to contention. In

this case, commit frequency can be reduced to prevent contention.

11


Thing to Consider: When commit logic is included in program that has

update/read cursor, ensure to use WITH HOLD option to prevent the

cursor from being closed while committing. Too high commit

frequency will lead to performance degrade.

12


4.Preventing space failures

Provided permanent solution to prevent space job failures by working on

problem records as explained below.

1. Calculate the number of records in input dataset and increase primary

and secondary quantities appropriately (in case of AVGREC parameter

is used).

2. Calculate the dataset size in bytes as number of records * record

length. Increase primary and secondary quantities appropriately (in

case of TRK or CYL method). One track indicates 56,664 bytes on a

3390 disk and 47,476 bytes on a 3380 disk. One cyclinder indicates

849,960 bytes (15 tracks) on a 3390 disk and 712,140 bytes (15

tracks) on a 3380 disk.

3. Sometimes the error message could be ‘DATA SET exists on

maximum volumes’. This can be resolved by increasing the space

allocation in a way to reduce number of extents in a volume. Or

number of volumes allowed could be increased by VOLUME/UNIT

parameter.

4. The PDS directory must fit in the first extent of the data set. If the

primary quantity is too small for the directory, or if the system has

allocated the primary quantity over multiple extents and the first extent

is too small for the directory, then the allocation fails.

5. When LIKE parameter is used the following is applicable. Unless the

SPACE parameter is explicitly coded, system determines the space to

be allocated for the new data set by adding up the space allocated in

the first three extents of the model data set. Therefore, the space

allocated for the new data set will generally not match the space that

was specified for the model data set. In case of space failure for the

dataset which has LIKE parameter and which does not have explicit

SPACE parameter, the issue can be resolved either by increasing the

13


space for the model data set or by specifying the SPACE parameter

(with correct quantity) along with LIKE parameter.

14


5.Preventing alerts

An alert in this context is a type of incident which usually does not require

immediate action to take. An alert is generated to provide warning messages

too. Provided permanent solution to prevent alerts by working on problem

records as explained below.

1. Suppressed the alerts that do not require any action to take. Filtering

mechanism is used for this. Filtering is a client specific mechanism

which allows to suppress the alerts. A tool is used for this.

2. Modified the programs where certain action is required. For example,

if a batch job is generating too many sysout lines in a job an alert is

generated. These alerts are prevented using OUTLIM parameter.

3. In certain cases, the programs are modified to prevent alerts. For

example, if a batch job is trying to insert same row into a DB2 table

more than one time, then an alert is generated. These alerts are fixed

by fixing the related programs.

15


6.Conclusion

Below graphs show the benefits of problem management in graphical view

Number of incidents is reduced from month to month.

Count of incidents

0

500

1000

1500

2000

2500

3000

3500

4000

Count of incidents

SLA percentage is increased from month to month.

SLA %

93

94

95

96

97

98

99

mon

th 1

mon

th 2

mon

th 3

mon

th 4

mon

th 5

mon

th 6

mon

th 7

mon

th 8

mon

th 9

mon

th 1

0

mon

th 1

1

mon

th 1

3

mon

th 1

5

mon

th 1

6

mon

th 1

7

mon

th 1

8

mon

th 1

9

mon

th 2

0

mon

th 2

1

mon

th 2

2

mon

th 2

3

mon

th 2

4

SLA %

16


7.AcknowledgementsMy sincere thanks to Emmanuel Vasanthakumar who has supported me in

writing this white paper.

17


8.Appendix A – REXX and JCL used for auto-recovery of failures

18

Documents

Technology Offerings for Mainframe Batch Environment