Upload
kumar-chunduru
View
44
Download
2
Embed Size (px)
Citation preview
Technology offerings for mainframe batch environment
White Paper
Kumar ChunduruApplication support
Technology offerings for mainframe batch environment
Confidentiality Statement
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear this notice. To
copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee.
2
Technology offerings for mainframe batch environment
Abstract
This white paper details the technology offerings for mainframe batch
environment for one of the largest Insurance companies in UK as part of problem
management. Front shop is a group that deals with application recovery by
working on incidents. Front shop has certain challenges which are causing the
front shop to take longer time for recovery of critical job failures and recurrent
failures. This white paper details how these challenges are addressed.
3
Technology offerings for mainframe batch environment
About the Author
Kumar Chunduru has completed his graduation in Master of Computer
Applications. He has worked in different technologies and domains in the IT
industry. He has experience in Project Lifecycle Management, Software Testing
Life Cycle and Application Support management. Personal email id is
About the Domain
Any event which is not part of the standard operation of a service and which
causes an interruption to, or a reduction in, the quality of that service is an
Incident. A problem is a condition often identified as a result of multiple Incidents
that exhibit common symptoms. Problems can also be identified from a single
significant Incident, indicative of a single error, for which the cause is unknown,
but for which the impact is significant. For frequent occurrence of incidents, a
problem record will be created to fix the issue.
Application support (Problem management) is a continuous process. It
encompasses problem detection, documentation of the problem and its
resolution, identification and testing of the solution, problem closure, and
generation of statistical reports. The objective of problem management is to
resolve the root cause of incidents, minimizing the impact of problems caused by
errors in the IT infrastructure and preventing reoccurrence of similar incidents.
Application support (Problem management) helps business for the following -
Reports reach different business areas on time, money movement transactions
will be processed on time, accounting information will be available on time.
4
Technology offerings for mainframe batch environment
CONTENTS
1. INTRODUCTION.................................................................................................................................6
2. AUTO-RECOVERY OF CRITICAL JOB FAILURES...................................................................7
3. PREVENTING CONTENTION FAILURES....................................................................................9
4. PREVENTING SPACE FAILURES................................................................................................13
5. PREVENTING ALERTS...................................................................................................................15
6. CONCLUSION...................................................................................................................................16
7. ACKNOWLEDGEMENTS...............................................................................................................17
8. Appendix A – REXX and JCL used for auto-recovery of failures.......................................................18
5
Technology offerings for mainframe batch environment
1.Introduction
TCS provides application recovery for one of the largest Insurance companies in
UK. Recovery team works low, medium and high priority incidents. SLA (service
level agreement) for low priority incidents is 16 hours, for medium priority
incidents it is 8 hours and for high priority incidents it is 4 hours. There is a
business impact for each type of incident if the incident is not recovered on time.
New applications are being installed into production environment and the number
of incidents is increasing to recover. Recurring failures are also increasing.
Time taken to resolve HIGH incidents is increasing. A problem management
team is newly introduced at TCS offshore to address these issues.
This white paper details how the permanent solution is provided for various types
of recurrent failures (problem records), how the HIGH incidents are recovered on
time. This white paper details the technology offerings provided for mainframe
batch environment. The solution of problem records is categorized as auto-
recovery, contention failures, space failures and alerts.
6
Technology offerings for mainframe batch environment
2.Auto-recovery of critical job failures
Automatic recovery is provided for certain critical job failures (high incidents) as
explained below by working on problem records.
A batch program in this context reads a dataset of records, does some process
for each input record like getting data from DB2 tables / IMS segments for a
scheme number in input record and writes the extracted data into a new output
dataset. Assume that the job is a critical job. If the input scheme number is not
valid, then the program is designed to fail. Then the manual recovery action is to
remove the invalid scheme record from input dataset and rerun the job. This
process can be automated as below.
1. Take back up of exclude and input datasets into GDG dataset. Refer the
next point to know about the exclude dataset.
2. Use a dataset to store the invalid scheme numbers that needs to be
excluded from the processing. This is called exclude dataset. Having this
exclude dataset is beneficial to skip processing the invalid scheme
numbers. This also helps to notify the failing scheme numbers to the
required group only for the first time. Once the failing scheme number is
fixed by the concerned group, the scheme number can be deleted from
the exclude dataset. Appendix A provides the REXX utility that excludes
the failing scheme numbers from an input dataset.
3. Identify the new failing scheme number from the main program step as
follows. While processing input dataset, copy each input record that is
being processed into a new dataset. When the program fails on a
particular record of input dataset, the last record of the new dataset is the
record that caused the failure of the program.
4. Using a REXX program or COBOL program add the failing scheme
number into exclude dataset and email dataset. Refer appendix A to find
the REXX program for this.
7
Technology offerings for mainframe batch environment
5. Rerun the job using auto-resubmit logic in case of invalid scheme found.
There are several methods for auto-resubmission logic. One such method
is to use OPC cards in job where TWS (Tiwoli Workload Scheduler) is
installed to maintain the jobs.
6. At the end of successful completion of the job, notify the failing cases to
concerned group of people if required. Appendix A provides the utility for
this.
8
Technology offerings for mainframe batch environment
3.Preventing contention failuresProvided permanent solution to prevent contention job failures by working on
problem records as mentioned below.
Contention arises when two jobs are attempting to access a same resource at
the same time, this might prevent one or even all jobs from executing. This
contention usually results in deadlock failures of the jobs. The result of
contention is poor response time in the online environment or slow execution
time in the batch environment.
Resource resulting contention may be either a dataset or a table space
contention. To resolve these contention issues certain guidelines are given below
for reference.
Dataset / DB2 resource contention
1. Identify all the jobs that ran at the time of contention. Failed joblog may
contain these details. For DB2 related jobs, following is useful to identify
the jobs/resources that caused contention.
2. Check spool - SDSF - D*MSTR (Example: DA1MSTR) content, identify
the table space name which results in contention.
Sample Error Message:
DSNT376I
PLAN=plan-name1 WITH CORRELATION-ID=correlation-id1
CONNECTION-ID=connection-id1 LUW-ID=luw-id1 THREAD-
INFO=thread-information1 IS TIMED OUT. ONE HOLDER OF THE
RESOURCE IS PLAN=plan-name2 WITH CORRELATION-
9
Technology offerings for mainframe batch environment
ID=correlation-id2 CONNECTION-ID=connection-id2 LUW-ID=luw- id2
THREAD-INFO= thread-information2 ON MEMBER member- name
DSNT376I error message details the table space / program & job resulted
in contention.
3. Identify the jobs that utilize the resource which resulted in contention.
4. If failed job has no time dependency & critical successors, modify the
failed job schedule to avoid contention.
5. If failed job is time dependent, check whether dependency can be created
among the contented jobs. On creation of such dependency, ensure no
considerable delay in failed job’s execution
6. If none of the above option helps to resolve contention, exclusive lock
access for the contented resource can be issued to resolve contention.
But provision of such exclusive access will have huge impact in batch
performance, if contented dataset usage is heavy among batch jobs.
DB2 Resource Contention
Below are some guidelines specific to resolve table space (DB2 resource)
contention:
1. If program uses SELECT query to fetch content from the table space
using a cursor, verify whether query has “FOR FETCH ONLY/FOR
READ ONLY” option. If not, modify the query to add these options to
resolve lock contention issue. This is applicable if the program is
having only read-only SQLs.
2. If both programs use SELECT query with “FOR FETCH
ONLY/FOR READ ONLY” option to fetch from table space, modify
10
Technology offerings for mainframe batch environment
programs to include retry logic. On query execution verify for -911
SQLCODE, if so retry the same SQL statement to resolve contention.
Considerable retry count (Say 5) is acceptable.
[Note: Query with “FOR FETCH ONLY/FOR READ ONLY” option is
similar to uncommitted read. That is, the retrieved data may not have
up-to-date information.]
3. Failed jobs can be modified to include re-submission logic on such
deadlock issue. This option too will resolve table space contention.
4. If contented programs use UPDATE query, verify failed job details via
job scheduler. If failed job has no time dependency & critical
successors, modify the failed job schedule to avoid contention.
5. If one program uses “Update” & other uses “SELECT” query, then
modify the program(which use SELECT query) to include WITH UR
option in the SELECT query. This “WITH UR” option helps to resolve
lock issue. But SELECT with Uncommitted Read option is considered
as “dirty read”, as it allows an application to read while acquiring few
locks, at the risk of reading uncommitted data. UR isolation applies
only to read-only operations: SELECT, SELECT INTO, or FETCH from
a read-only result table.
6. Contention may arise, if a program does frequent updates on same
table space without commit process. To resolve such issue, commit
logic can be added in the program.
7. Program that has high commit frequency may lead to contention. In
this case, commit frequency can be reduced to prevent contention.
11
Technology offerings for mainframe batch environment
Thing to Consider: When commit logic is included in program that has
update/read cursor, ensure to use WITH HOLD option to prevent the
cursor from being closed while committing. Too high commit
frequency will lead to performance degrade.
12
Technology offerings for mainframe batch environment
4.Preventing space failures
Provided permanent solution to prevent space job failures by working on
problem records as explained below.
1. Calculate the number of records in input dataset and increase primary
and secondary quantities appropriately (in case of AVGREC parameter
is used).
2. Calculate the dataset size in bytes as number of records * record
length. Increase primary and secondary quantities appropriately (in
case of TRK or CYL method). One track indicates 56,664 bytes on a
3390 disk and 47,476 bytes on a 3380 disk. One cyclinder indicates
849,960 bytes (15 tracks) on a 3390 disk and 712,140 bytes (15
tracks) on a 3380 disk.
3. Sometimes the error message could be ‘DATA SET exists on
maximum volumes’. This can be resolved by increasing the space
allocation in a way to reduce number of extents in a volume. Or
number of volumes allowed could be increased by VOLUME/UNIT
parameter.
4. The PDS directory must fit in the first extent of the data set. If the
primary quantity is too small for the directory, or if the system has
allocated the primary quantity over multiple extents and the first extent
is too small for the directory, then the allocation fails.
5. When LIKE parameter is used the following is applicable. Unless the
SPACE parameter is explicitly coded, system determines the space to
be allocated for the new data set by adding up the space allocated in
the first three extents of the model data set. Therefore, the space
allocated for the new data set will generally not match the space that
was specified for the model data set. In case of space failure for the
dataset which has LIKE parameter and which does not have explicit
SPACE parameter, the issue can be resolved either by increasing the
13
Technology offerings for mainframe batch environment
space for the model data set or by specifying the SPACE parameter
(with correct quantity) along with LIKE parameter.
14
Technology offerings for mainframe batch environment
5.Preventing alerts
An alert in this context is a type of incident which usually does not require
immediate action to take. An alert is generated to provide warning messages
too. Provided permanent solution to prevent alerts by working on problem
records as explained below.
1. Suppressed the alerts that do not require any action to take. Filtering
mechanism is used for this. Filtering is a client specific mechanism
which allows to suppress the alerts. A tool is used for this.
2. Modified the programs where certain action is required. For example,
if a batch job is generating too many sysout lines in a job an alert is
generated. These alerts are prevented using OUTLIM parameter.
3. In certain cases, the programs are modified to prevent alerts. For
example, if a batch job is trying to insert same row into a DB2 table
more than one time, then an alert is generated. These alerts are fixed
by fixing the related programs.
15
Technology offerings for mainframe batch environment
6.Conclusion
Below graphs show the benefits of problem management in graphical view
Number of incidents is reduced from month to month.
Count of incidents
0
500
1000
1500
2000
2500
3000
3500
4000
Count of incidents
SLA percentage is increased from month to month.
SLA %
93
94
95
96
97
98
99
mon
th 1
mon
th 2
mon
th 3
mon
th 4
mon
th 5
mon
th 6
mon
th 7
mon
th 8
mon
th 9
mon
th 1
0
mon
th 1
1
mon
th 1
3
mon
th 1
5
mon
th 1
6
mon
th 1
7
mon
th 1
8
mon
th 1
9
mon
th 2
0
mon
th 2
1
mon
th 2
2
mon
th 2
3
mon
th 2
4
SLA %
16
Technology offerings for mainframe batch environment
7.AcknowledgementsMy sincere thanks to Emmanuel Vasanthakumar who has supported me in
writing this white paper.
17
Technology offerings for mainframe batch environment
8.Appendix A – REXX and JCL used for auto-recovery of failures
18