Impact 2009 1783 Achieving Availability With W A Sz User Experience

Session 1783 – Part II

Achieving High Availability with WebSphere on z/OS - user experience

Presented by Elena Nanos IBM Certified Advanced System Administrator - WebSphere Application Server ND V6.1 IBM Certified Solution Expert - CICS Web Enablement IBM Certified System Specialist - WebSphere MQSeries

Email - [email protected]

Health Care Service Corporation WebSphere Engineering and Support services

Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 1

Why WebSphere on z/OS?

WebSphere on z/OS has been selected as a preferred platform to support development and deployment of new Java mission-critical Applications for the following reasons:

z/OS Hardware, Software, Storage, and Network are all designed for maximum application availability

WebSphere on z/OS is designed to support very high transactional volume

WebSphere on z/OS provides highest Quality of Service: - Performance - Scalability - Recovery/failover capability - High Availability - Stability - Manageability - Maintainability - Security/Integrity

By using WebSphere on z/OS you can minimize the number of physical tiers to get to backend data

Use of single tier removes Network layer and additional overhead associate with it

Tight integration with DB2, MQ and CICSSession 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 2

Features and Technology Unique to z/OS Server Architecture - Control/Servant Region Split - Multiple Servant Region Workload Management - Leverages Workload Manager (WLM) - WLM/RMF integration - Work classified according to importance & performance goals - Work is selected from WLM queue and managed to goal - Provides Failover to available Servants - Automatic servant restart after an outage - Automatic startup of additional servants, as needed, based on Policies WebSphere on z/OS Network Deployment Clustering across z/OS LPARs - Horizontal scaling for increased throughput - Continuous availability & fail-over MQ Queue Sharing using Shared Queues across LPARs and XM memory

communication for optimum performance DB2 Data Sharing across LPARs SYSPlex Distributor- workload management and distribution across multiple systems Coupling Facility - high-speed inter-system communication, used with MQ Queue Sharing & DB2 Data

sharing Resource Recovery Services - required for 2-phase commits zSeries Application Assist Processor (zAAP) - specialty assist processor dedicated exclusively

to execution of Java workloads under z/OS Mainframe security


Implemented very solid, scalable, high availability WebSphere on z/OS infrastructure that satisfies data integrity, system performance and system availability objectives.

Architected and established ‘best practice’ WebSphere on z/OS implementation using Network Deployment Cluster configuration, crossing LPARs, with proven fail over capabilities.

This scalable design allows us to quickly adapt to new business requirements and growth.

Established excellent standards, naming conventions and procedures for building and supporting WebSphere on z/OS infrastructure.

Developed and exercised WebSphere on z/OS infrastructure failover and error recovery plan.

Automated startup and shutdown at IPL time, notification of various issues related to system availability, infrastructure and application health check, monitoring commands and deployment in non-Prod environments.


Infrastructure Design with Focus on High Availability

http://images.google.com/imgres?imgurl=http://servisco.com.au/images/image7.jpg&imgrefurl=http://servisco.com.au/taf.htm&h=368&w=600&sz=46&hl=en&start=5&tbnid=HiAfjQ5M9On6CM:&tbnh=83&tbnw=135&prev=/images%3Fq%3Dsystem%2BAvailability%26gbv%3D2%26hl%3Den

WebSphere on z/OS Failover and RecoveryOur WebSphere z/OS infrastructure can handle the following outages:

WebSphere on z/OS servant, server or Node down on one LPAR – using MQ Queue Sharing, requests will automatically go to WebSphere on z/OS server that is available on the other LPAR.

MQ down on one LPAR – we make use of Shared Queues, where one physical copy of the queue exists in CF or DB2. If one MQ Queue Manager is unavailable, WebSphere on z/OS server (on either side of the Cluster) can get data from Shared Queue via available MQ and can send reply back to CICS, where request initiated.

LPAR down – if one of 2 LPARs in the Cluster is down WebSphere on z/OS can continue processing, without any manual intervention. With our current Application design, request can come from WebSphere or CICS on the LPAR that is up.

TCP/IP down on one LPAR – using MQ Queue Sharing, requests will automatically go to WebSphere on z/OS server that is available on the other LPAR.

DB2 down on one LPAR – we make use of JDBC Type 4 driver and if one DB2 is down request continue processing.


HCSC WebSphere on z/OS environments

Unit Test


Unit and String Integration Test

Integrated Test Build

System Integration Test

User Acceptance

Integrated Acceptance

Load and Performance

Production

Development Path to

Production

Our WebSphere on z/OS infrastructure has been architected to support development, testing and Production deployment in the following environments:

Sample WebSphere on z/OS Cells configuration(naming convention has been changed to protect our environment)


LPAR - A LPAR - B

DB2ASYS

JDBC

JDBC JDBC

JDBC

JDBC

DB2BSYS

Q1DMGR

Appl X System

Integration Cluster Q1SI1

Cell - Q1SI

Q1NDA node Q1NDB node

Q1DMNOD Deployment Manager

Q1DMDaemon

Q1NABNode Agent

HFS

HFS

HFS

HFS

Q1DMDaemon

Q1NAANode Agent

Q2DMGR

Appl X UserAcceptance Cluster Q2UA1

Cell - Q2UA


Q2DMNOD Deployment Manager

Q2DMDaemon

Q2NABNode Agent

HFS

HFS

HFS

HFS

Q2DMDaemon

Q2NAANode Agent

Admin console

Q3DMGR

Appl X Intergrated Acceptance Cluster Q3LP1

Cell - Q3LP


Q3DMNOD Deployment Manager Q3DM

Daemon

Q3NABNode Agent

HFS

HFS

HFS

HFS

Q3DMDaemon

Q3NAA Node Agent

CICSAORA1

MQ/JMSMQ

QMA1

CICSAORA2

MQ/JMS

MQQMA2

CICSAORA3

MQ/JMSMQ

QMA3

CICSAORB1

MQ/JMSMQ

QMB1

CICSAORB2

MQQMB2

CICSAORB3

MQ/JMS

MQQMB3

Managed Servers Q1M01A - Appl X Q1M02A - Appl X Q1M03A - Appl Z

Managed Servers Q1M05B - Appl X Q1M06B - Appl Z

Appl X Intergrated Acceptance Cluster Q3LP2

Appl X Load & Performance Cluster Q3LP3

Appl Z Load & Performance Cluster Q3LP4

Appl Y Load & Performance Cluster Q3LP5

Appl Y UserAcceptance Cluster Q2UA2

Appl Z UserAcceptance Cluster Q2UA3

CICS Routing region A

CICSAORA4

CICSAORA5

CICS Routing region B

EXCI

EXCI

EXCI

Server Q1SI1A

Server Q1SI1B

Appl Y System Integration Cluster Q1SI2

Appl Y System Integration Cluster Q1SI3

CTG

CTG

CTG

CTG

Server Q1SI2B

Server Q1SI2A

Server Q1SI3B

Server Q1SI3A

Appl Y UserAcceptance Cluster Q2UA4



Server Q2UA1A

EXCI

EXCI

EXCI

EXCI

EXCI

EXCI EXCI

CICSAORB5

CICSAORB4

Server Q2UA1B

Server Q2UA2ACTG

Server Q2UA3A

Server Q2UA3B

Server Q2UA2B

Server Q2UA4B

Server Q2UA5B

Server Q2UA6B

CTG

CTG Server

Q2UA4A

Server Q2UA5A

Server Q2UA6A

CTG

Server Q3LP1A

Server Q3LP2A

Server Q3LP3A

Server Q3LP5A

Server Q3LP4A

CTG

Server Q3LP1A

Server Q3LP2B

Server Q3LP3B

Server Q3LP4B

Server Q3LP5B CTG

Managed Server Q1M04A - Appl Y

EXCI

CTG

Failover - Servant Outage


LPAR A LPAR B

Q3LP5 Cluster

Undispatched work

The z/OS WebSphere architecture consists of a clustered controller and a server regions per LPAR. Each server on each part of the cluster consists of several servant regions.

In production we have up to 10 servants per LPAR (min=5, max=10). Server stays up during the servant outage. Workload manager works very closely with

WebSphere on z/OS and detects the thread going down within the JVM and creates a new servant automatically.

This architecture spans the LPARs within the Cluster, so there is automatic failover from one LPAR to another.

Minimizing Effects of Timeout

WAS timeouts sometimes are unavoidable, when long running query is running or Network problem occurs. To avoid punishing "innocent bystanders" along with guilty requests, WebSphere on z/OS allows you to attempt to defer terminating a servant until its other in-flight requests have completed. You can do this by setting the variable control_region_timeout_delay to the number of seconds that the server is to wait after a timeout before abending the servant.

If the server_use_wlm_to_queue_work property is set to 0, during the time period specified for the control_region_timeout_delay property, work requests that were not yet dispatched but were queued without affinity to the terminating servant, are requeued to another available servant after the servant termination process completes.

To minimize the effects of timeouts we have added the following WebSphere variables:

- server_use_wlm_to_queue_work set to 0 (default is 1) - control_region_timeout_delay set to 5 seconds (default is 0)

For more details please reference: - Techdoc WP101233 titled "Configuration Options for Handling Application Dispatch Timeouts” . - WebSphere on z/OS Infocenter under ‘Application server custom properties that are unique for the z/OS platform’. - PK60264: Documentation clarification on request processing during CONTROL_REGION_TIMEOUT_DELAY


Setting Read timeout on Client Call MDB timeouts should be avoided whenever possible. Detect misbehaved thread within the Application. Making use of this approach

increases system availability and prevents servant restarts when timeout occurs. General recommendations, when setting timeouts at Application level: - Value set should be lower than timeout value set at WebSphere on z/OS

Controller - Timeouts values should be set to min 75% to max 200% of expected average

Application backend response time.

Example below is from http://forum.springframework.org/showthread.php?t=25577 and it shows how to set timeout on Axis client code via JaxRpcPortProxyFactoryBean –


import org.springframework.remoting.jaxrpc.JaxRpcPortProxyFactoryBean;

import javax.xml.rpc.Stub;

public class MyJaxRpcPortProxyFactoryBean extends JaxRpcPortProxyFactoryBean {

private static final String TIMEOUT_PROPERTY_KEY = "axis.connection.timeout";

protected void preparePortStub(Stub stub) {

super.preparePortStub(stub);

stub._setProperty(TIMEOUT_PROPERTY_KEY, new Integer(60));

System.out.println("In the preparePortStub method");

Setting the queryTimeout on JDBC calls

MDB timeouts can be caused by SQL call that takes too long to complete due to:

- Database being unavailable - Locks on data not being timely released, causing DB2 deadlocks or timeouts - Poorly written long running query - TCP/IP connectivity issues, when using JDBC Type 4 driver. Setting queryTimeout on JDBC calls within Application code can prevent MDB

timeouts.

Below is an example of how to set queryTimeout using Spring JDBC Template -


<bean id="errorDao" parent="baseDaoProxyParent">

<property name="target">

<bean class="com.appl.integration.daoimpl.jdbc.ErrorDaoJdbcImpl"

parent="applBaseDaoJdbcParent">

<property name="sqlMap" ref="sqlMap"/>

<property name="queryTimeout" value=“60"/>

<property name="ignoreWarnings" value="true" />

</bean>

</property>

</bean>

<bean id="baseDaoProxyParent" class="org.springframework.aop.framework.ProxyFactoryBean" abstract="true"> …

Health Check Procedures Pro-active approach in detecting issues early and preventing problem

whenever possible, to ensure high availability.

Automated Infrastructure Health Check across all environments, which reports on the following :

- Cell infrastructure Status, Deployment Manager, Nodes - Application Servers Status - MQ Listener ports status - Application status - WebSphere on z/OS HFS files status

Automated Application check procedure to verify environment after any Application change and to test impact of system tuning changes

- MQ connectivity and MDB functionality is tested - JDBC calls are exercised


http://www.nchealthystart.org/outreach/logos/HCheck_bw_jpg.jpg

The following alerts are sent automatically via Email to WebSphere z/OS support team:

WebSphere on z/OS started task went down Heartbeat check of all STCs up/down status SVC dump is taken for any WebSphere on z/OS started task High CPU usage of any WebSphere on z/OS started task WebSphere on z/OS started task is down for over 10 minutes No WebSphere on z/OS HFS is mounted at expected mount

point 95%+ WebSphere on z/OS HFS space allocation (can be

altered as needed) WebSphere on z/OS connection to MQ terminated, usually

due to MDB timeout


Alerts Auto Notification

Understanding Native storage usage You can only specify a limited amount

for User Region size (1,600 Meg in this example), because of z/OS system storage allocation in 31 bit mode and shared memory used by other Applications.

Note: Drastically higher limits can be set with running WebSphere in 64 bit mode

JVM size in allocated out of User Region size, leaving less that 1GB available in Extended Local System Queue Area (ELQSA) to load the following:

• MQ, DB2 & CICS connectors storage• Cached Classes• JITed code• JNI objects• Application classes copied by LE into

Native Heap Each time your Applications are stopped

and restarted, without restarting the server, the classes get reloaded.

Storage usage is also related to the volume and number of threads allocated to MQ, DB2 & CICS.

Depleting ELSQA storage will result in 878-10 abend for WebSphere server.

You need to ensure that enough virtual storage is left in ELSQA .


System control blocksELSQA & SWA

(size depends on what is currently loaded in the system)

JNI objects

Cached ClassesSize depends on cache size and when these classes are removed

from memory( for example it depends on cache size you set on datasources managed by WASz)

JVM

Container for Java ApplicationJAR, WAR & JSP files loaded here

Application Java properties files and Classes

GC eligible

Size= 616 Meg (size above is an example of setting for WASz servants)

MVS

LE

NATIVE

HEAP

JVM

HEAP

MQ, DB2 & CICS Connectors related storage

Size depends on the number of active threads and pool sizes

ELSQA storage FREE

Depleting ELSQA storage will result in 878-10 abends for WASz server

Application classes loaded by LE

Classes loaded using ThreadLocal, outside of WASz management

ELSQA shared memory for system-shared DLLs Defined in BPXPRMxx SHRLIBRGNSIZE setting

JITed code

Application classes loaded by LE

Classes loaded using ThreadPools, which are managed by WASz

RegIon

SIze

1600

Meg

2

GB

Address

Space

LImIt

Memory Leak issue using ThreadLocals

ThreadLocal class doesn't work well with Thread pools in J2EE environment

We observed memory leaks in Native Storage caused by using ThreadLocal threads, which do not get cleanup automatically in MVS Native Heap. ThreadLocal does not interact well with thread pooling in WebSphere Application servers. Since there is no Garbage Collection in MVS Native Heap, classes loaded by ThreadLocal threads can remain in storage after Application is stopped. This problem is compounded by Class loader, when ThreadLocal classes are reloaded each time Application is restarted, without restarting the server.

Best coding practice recommendations

To avoid Native Storage leaking, which depletes ELSQA storage, you have the following options:

Use Thread pool threads, which are managed by WebSphere on z/OS Avoid the use of ThreadLocals threads Clear all ThreadLocals before returning control from an EJB or Servlet

invocation


Finding Java Threads outside of WebSphere on z/OS management


Dump Analyzer can be used to find Threads that are allocated outside of WebSphere Thread pool, as shown in example below -

Reference material

I have published an article in February/March issue tilted –

“Hidden Gems: Free IBM Tools to Help You Manage WebSphere on z/OS”

This article covers the following: Support Issues: Lessons Learned Memory Leak issue using ThreadLocals Best coding practice recommendations Clearing storage when Threadlocal is used Finding Java Threads outside of WebSphere on z/OS management Debugging timeouts svcdump.jar utility Minimizing the effects of timeouts Setting timeouts at Application level Garbage Collection Policies With Java 5.0 FFDC Logs Summary of Tools Available in IBM Support Assistant Tivoli Performance Viewer (TPV) z/OS Console commands WebSphere on z/OS V7.0 enhancements

Web link - http://zjournal.com/index.cfm?section=article&aid=1142


Questions


??