Upload
elena-nanos
View
1.553
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Session 1783 – Part II
Achieving High Availability with WebSphere on z/OS - user experience
Presented by Elena Nanos IBM Certified Advanced System Administrator - WebSphere Application Server ND V6.1 IBM Certified Solution Expert - CICS Web Enablement IBM Certified System Specialist - WebSphere MQSeries
Email - [email protected]
Health Care Service Corporation WebSphere Engineering and Support services
Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 1
Why WebSphere on z/OS?
WebSphere on z/OS has been selected as a preferred platform to support development and deployment of new Java mission-critical Applications for the following reasons:
z/OS Hardware, Software, Storage, and Network are all designed for maximum application availability
WebSphere on z/OS is designed to support very high transactional volume
WebSphere on z/OS provides highest Quality of Service: - Performance - Scalability - Recovery/failover capability - High Availability - Stability - Manageability - Maintainability - Security/Integrity
By using WebSphere on z/OS you can minimize the number of physical tiers to get to backend data
Use of single tier removes Network layer and additional overhead associate with it
Tight integration with DB2, MQ and CICSSession 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 2
Features and Technology Unique to z/OS Server Architecture - Control/Servant Region Split - Multiple Servant Region Workload Management - Leverages Workload Manager (WLM) - WLM/RMF integration - Work classified according to importance & performance goals - Work is selected from WLM queue and managed to goal - Provides Failover to available Servants - Automatic servant restart after an outage - Automatic startup of additional servants, as needed, based on Policies WebSphere on z/OS Network Deployment Clustering across z/OS LPARs - Horizontal scaling for increased throughput - Continuous availability & fail-over MQ Queue Sharing using Shared Queues across LPARs and XM memory
communication for optimum performance DB2 Data Sharing across LPARs SYSPlex Distributor- workload management and distribution across multiple systems Coupling Facility - high-speed inter-system communication, used with MQ Queue Sharing & DB2 Data
sharing Resource Recovery Services - required for 2-phase commits zSeries Application Assist Processor (zAAP) - specialty assist processor dedicated exclusively
to execution of Java workloads under z/OS Mainframe security
Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 3
Implemented very solid, scalable, high availability WebSphere on z/OS infrastructure that satisfies data integrity, system performance and system availability objectives.
Architected and established ‘best practice’ WebSphere on z/OS implementation using Network Deployment Cluster configuration, crossing LPARs, with proven fail over capabilities.
This scalable design allows us to quickly adapt to new business requirements and growth.
Established excellent standards, naming conventions and procedures for building and supporting WebSphere on z/OS infrastructure.
Developed and exercised WebSphere on z/OS infrastructure failover and error recovery plan.
Automated startup and shutdown at IPL time, notification of various issues related to system availability, infrastructure and application health check, monitoring commands and deployment in non-Prod environments.
Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 4
Infrastructure Design with Focus on High Availability
WebSphere on z/OS Failover and RecoveryOur WebSphere z/OS infrastructure can handle the following outages:
WebSphere on z/OS servant, server or Node down on one LPAR – using MQ Queue Sharing, requests will automatically go to WebSphere on z/OS server that is available on the other LPAR.
MQ down on one LPAR – we make use of Shared Queues, where one physical copy of the queue exists in CF or DB2. If one MQ Queue Manager is unavailable, WebSphere on z/OS server (on either side of the Cluster) can get data from Shared Queue via available MQ and can send reply back to CICS, where request initiated.
LPAR down – if one of 2 LPARs in the Cluster is down WebSphere on z/OS can continue processing, without any manual intervention. With our current Application design, request can come from WebSphere or CICS on the LPAR that is up.
TCP/IP down on one LPAR – using MQ Queue Sharing, requests will automatically go to WebSphere on z/OS server that is available on the other LPAR.
DB2 down on one LPAR – we make use of JDBC Type 4 driver and if one DB2 is down request continue processing.
Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 5
HCSC WebSphere on z/OS environments
Unit Test
Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 6
Unit and String Integration Test
Integrated Test Build
System Integration Test
User Acceptance
Integrated Acceptance
Load and Performance
Production
Development Path to
Production
Our WebSphere on z/OS infrastructure has been architected to support development, testing and Production deployment in the following environments:
Sample WebSphere on z/OS Cells configuration(naming convention has been changed to protect our environment)
Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 7
LPAR - A LPAR - B
DB2ASYS
JDBC
JDBC JDBC
JDBC
JDBC
DB2BSYS
Q1DMGR
Appl X System
Integration Cluster Q1SI1
Cell - Q1SI
Q1NDA node Q1NDB node
Q1DMNOD Deployment Manager
Q1DMDaemon
Q1NABNode Agent
HFS
HFS
HFS
HFS
Q1DMDaemon
Q1NAANode Agent
Q2DMGR
Appl X UserAcceptance Cluster Q2UA1
Cell - Q2UA
Q2NDA node Q2NDB node
Q2DMNOD Deployment Manager
Q2DMDaemon
Q2NABNode Agent
HFS
HFS
HFS
HFS
Q2DMDaemon
Q2NAANode Agent
Admin console
Q3DMGR
Appl X Intergrated Acceptance Cluster Q3LP1
Cell - Q3LP
Q3NDA node Q3NDB node
Q3DMNOD Deployment Manager Q3DM
Daemon
Q3NABNode Agent
HFS
HFS
HFS
HFS
Q3DMDaemon
Q3NAA Node Agent
CICSAORA1
MQ/JMSMQ
QMA1
CICSAORA2
MQ/JMS
MQQMA2
CICSAORA3
MQ/JMSMQ
QMA3
CICSAORB1
MQ/JMSMQ
QMB1
CICSAORB2
MQQMB2
CICSAORB3
MQ/JMS
MQQMB3
Managed Servers Q1M01A - Appl X Q1M02A - Appl X Q1M03A - Appl Z
Managed Servers Q1M05B - Appl X Q1M06B - Appl Z
Appl X Intergrated Acceptance Cluster Q3LP2
Appl X Load & Performance Cluster Q3LP3
Appl Z Load & Performance Cluster Q3LP4
Appl Y Load & Performance Cluster Q3LP5
Appl Y UserAcceptance Cluster Q2UA2
Appl Z UserAcceptance Cluster Q2UA3
CICS Routing region A
CICSAORA4
CICSAORA5
CICS Routing region B
EXCI
EXCI
EXCI
Server Q1SI1A
Server Q1SI1B
Appl Y System Integration Cluster Q1SI2
Appl Y System Integration Cluster Q1SI3
CTG
CTG
CTG
CTG
Server Q1SI2B
Server Q1SI2A
Server Q1SI3B
Server Q1SI3A
Appl Y UserAcceptance Cluster Q2UA4
Appl X UserAcceptance Cluster Q2UA5
Appl X UserAcceptance Cluster Q2UA6
Server Q2UA1A
EXCI
EXCI
EXCI
EXCI
EXCI
EXCI EXCI
CICSAORB5
CICSAORB4
Server Q2UA1B
Server Q2UA2ACTG
Server Q2UA3A
Server Q2UA3B
Server Q2UA2B
Server Q2UA4B
Server Q2UA5B
Server Q2UA6B
CTG
CTG Server
Q2UA4A
Server Q2UA5A
Server Q2UA6A
CTG
Server Q3LP1A
Server Q3LP2A
Server Q3LP3A
Server Q3LP5A
Server Q3LP4A
CTG
Server Q3LP1A
Server Q3LP2B
Server Q3LP3B
Server Q3LP4B
Server Q3LP5B CTG
Managed Server Q1M04A - Appl Y
EXCI
CTG
Failover - Servant Outage
Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 8
LPAR A LPAR B
Q3LP5 Cluster
Undispatched work
The z/OS WebSphere architecture consists of a clustered controller and a server regions per LPAR. Each server on each part of the cluster consists of several servant regions.
In production we have up to 10 servants per LPAR (min=5, max=10). Server stays up during the servant outage. Workload manager works very closely with
WebSphere on z/OS and detects the thread going down within the JVM and creates a new servant automatically.
This architecture spans the LPARs within the Cluster, so there is automatic failover from one LPAR to another.
Minimizing Effects of Timeout
WAS timeouts sometimes are unavoidable, when long running query is running or Network problem occurs. To avoid punishing "innocent bystanders" along with guilty requests, WebSphere on z/OS allows you to attempt to defer terminating a servant until its other in-flight requests have completed. You can do this by setting the variable control_region_timeout_delay to the number of seconds that the server is to wait after a timeout before abending the servant.
If the server_use_wlm_to_queue_work property is set to 0, during the time period specified for the control_region_timeout_delay property, work requests that were not yet dispatched but were queued without affinity to the terminating servant, are requeued to another available servant after the servant termination process completes.
To minimize the effects of timeouts we have added the following WebSphere variables:
- server_use_wlm_to_queue_work set to 0 (default is 1) - control_region_timeout_delay set to 5 seconds (default is 0)
For more details please reference: - Techdoc WP101233 titled "Configuration Options for Handling Application Dispatch Timeouts” . - WebSphere on z/OS Infocenter under ‘Application server custom properties that are unique for the z/OS platform’. - PK60264: Documentation clarification on request processing during CONTROL_REGION_TIMEOUT_DELAY
Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 9
Setting Read timeout on Client Call MDB timeouts should be avoided whenever possible. Detect misbehaved thread within the Application. Making use of this approach
increases system availability and prevents servant restarts when timeout occurs. General recommendations, when setting timeouts at Application level: - Value set should be lower than timeout value set at WebSphere on z/OS
Controller - Timeouts values should be set to min 75% to max 200% of expected average
Application backend response time.
Example below is from http://forum.springframework.org/showthread.php?t=25577 and it shows how to set timeout on Axis client code via JaxRpcPortProxyFactoryBean –
Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 10
import org.springframework.remoting.jaxrpc.JaxRpcPortProxyFactoryBean;
import javax.xml.rpc.Stub;
public class MyJaxRpcPortProxyFactoryBean extends JaxRpcPortProxyFactoryBean {
private static final String TIMEOUT_PROPERTY_KEY = "axis.connection.timeout";
protected void preparePortStub(Stub stub) {
super.preparePortStub(stub);
stub._setProperty(TIMEOUT_PROPERTY_KEY, new Integer(60));
System.out.println("In the preparePortStub method");
Setting the queryTimeout on JDBC calls
MDB timeouts can be caused by SQL call that takes too long to complete due to:
- Database being unavailable - Locks on data not being timely released, causing DB2 deadlocks or timeouts - Poorly written long running query - TCP/IP connectivity issues, when using JDBC Type 4 driver. Setting queryTimeout on JDBC calls within Application code can prevent MDB
timeouts.
Below is an example of how to set queryTimeout using Spring JDBC Template -
Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 11
<bean id="errorDao" parent="baseDaoProxyParent">
<property name="target">
<bean class="com.appl.integration.daoimpl.jdbc.ErrorDaoJdbcImpl"
parent="applBaseDaoJdbcParent">
<property name="sqlMap" ref="sqlMap"/>
<property name="queryTimeout" value=“60"/>
<property name="ignoreWarnings" value="true" />
</bean>
</property>
</bean>
<bean id="baseDaoProxyParent" class="org.springframework.aop.framework.ProxyFactoryBean" abstract="true"> …
Health Check Procedures Pro-active approach in detecting issues early and preventing problem
whenever possible, to ensure high availability.
Automated Infrastructure Health Check across all environments, which reports on the following :
- Cell infrastructure Status, Deployment Manager, Nodes - Application Servers Status - MQ Listener ports status - Application status - WebSphere on z/OS HFS files status
Automated Application check procedure to verify environment after any Application change and to test impact of system tuning changes
- MQ connectivity and MDB functionality is tested - JDBC calls are exercised
Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 12
The following alerts are sent automatically via Email to WebSphere z/OS support team:
WebSphere on z/OS started task went down Heartbeat check of all STCs up/down status SVC dump is taken for any WebSphere on z/OS started task High CPU usage of any WebSphere on z/OS started task WebSphere on z/OS started task is down for over 10 minutes No WebSphere on z/OS HFS is mounted at expected mount
point 95%+ WebSphere on z/OS HFS space allocation (can be
altered as needed) WebSphere on z/OS connection to MQ terminated, usually
due to MDB timeout
Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 13
Alerts Auto Notification
Understanding Native storage usage You can only specify a limited amount
for User Region size (1,600 Meg in this example), because of z/OS system storage allocation in 31 bit mode and shared memory used by other Applications.
Note: Drastically higher limits can be set with running WebSphere in 64 bit mode
JVM size in allocated out of User Region size, leaving less that 1GB available in Extended Local System Queue Area (ELQSA) to load the following:
• MQ, DB2 & CICS connectors storage• Cached Classes• JITed code• JNI objects• Application classes copied by LE into
Native Heap Each time your Applications are stopped
and restarted, without restarting the server, the classes get reloaded.
Storage usage is also related to the volume and number of threads allocated to MQ, DB2 & CICS.
Depleting ELSQA storage will result in 878-10 abend for WebSphere server.
You need to ensure that enough virtual storage is left in ELSQA .
Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 14
System control blocksELSQA & SWA
(size depends on what is currently loaded in the system)
JNI objects
Cached ClassesSize depends on cache size and when these classes are removed
from memory( for example it depends on cache size you set on datasources managed by WASz)
JVM
Container for Java ApplicationJAR, WAR & JSP files loaded here
Application Java properties files and Classes
GC eligible
Size= 616 Meg (size above is an example of setting for WASz servants)
MVS
LE
NATIVE
HEAP
JVM
HEAP
MQ, DB2 & CICS Connectors related storage
Size depends on the number of active threads and pool sizes
ELSQA storage FREE
Depleting ELSQA storage will result in 878-10 abends for WASz server
Application classes loaded by LE
Classes loaded using ThreadLocal, outside of WASz management
ELSQA shared memory for system-shared DLLs Defined in BPXPRMxx SHRLIBRGNSIZE setting
JITed code
Application classes loaded by LE
Classes loaded using ThreadPools, which are managed by WASz
RegIon
SIze
1600
Meg
2
GB
Address
Space
LImIt
Memory Leak issue using ThreadLocals
ThreadLocal class doesn't work well with Thread pools in J2EE environment
We observed memory leaks in Native Storage caused by using ThreadLocal threads, which do not get cleanup automatically in MVS Native Heap. ThreadLocal does not interact well with thread pooling in WebSphere Application servers. Since there is no Garbage Collection in MVS Native Heap, classes loaded by ThreadLocal threads can remain in storage after Application is stopped. This problem is compounded by Class loader, when ThreadLocal classes are reloaded each time Application is restarted, without restarting the server.
Best coding practice recommendations
To avoid Native Storage leaking, which depletes ELSQA storage, you have the following options:
Use Thread pool threads, which are managed by WebSphere on z/OS Avoid the use of ThreadLocals threads Clear all ThreadLocals before returning control from an EJB or Servlet
invocation
Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 15
Finding Java Threads outside of WebSphere on z/OS management
Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 16
Dump Analyzer can be used to find Threads that are allocated outside of WebSphere Thread pool, as shown in example below -
Reference material
I have published an article in February/March issue tilted –
“Hidden Gems: Free IBM Tools to Help You Manage WebSphere on z/OS”
This article covers the following: Support Issues: Lessons Learned Memory Leak issue using ThreadLocals Best coding practice recommendations Clearing storage when Threadlocal is used Finding Java Threads outside of WebSphere on z/OS management Debugging timeouts svcdump.jar utility Minimizing the effects of timeouts Setting timeouts at Application level Garbage Collection Policies With Java 5.0 FFDC Logs Summary of Tools Available in IBM Support Assistant Tivoli Performance Viewer (TPV) z/OS Console commands WebSphere on z/OS V7.0 enhancements
Web link - http://zjournal.com/index.cfm?section=article&aid=1142
Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 17
Questions
Session 1783 – Part II Achieving High Availability with WebSphere on z/OS - user experience 18
??