50
TIBCO Software Inc. http://www.tibco. com ©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and A Technical Document This document is a part of Enterprise Integration Framework at “COMPANY XYZ”. This document describes the rationale for deploying TIBCO Enterprise Message Service (EMS) on Sun Cluster, along with the detailed steps to achieve a functioning implementation. Document Revisions Versio n Date Author Comments 1.0 11/01/2 004 Initial Draft Portions taken from EMS Best Practices Sun Cluster section taken from Sun Cluster Overview for Solaris Enterprise Integration Framework: TIBCO EMS on Sun Cluster

EMS on Sun Cluster

Embed Size (px)

Citation preview

Page 1: EMS on Sun Cluster

TIBCO Software Inc.

http://www.tibco.com

3303 Hillview Avenue

Palo Alto, CA 94304

1-800-420-8450

©2004 TIBCO Software Inc.All Rights Reserved.TIBCO Confidential and Proprietary

A Technical Document

This document is a part of Enterprise Integration Framework at

“COMPANY XYZ”. This document describes the rationale for

deploying TIBCO Enterprise Message Service (EMS) on Sun Cluster,

along with the detailed steps to achieve a functioning implementation.

Document RevisionsVersion Date Author Comments

1.0 11/01/2004 Initial DraftPortions taken from EMS Best PracticesSun Cluster section taken from Sun Cluster Overview for Solaris

Enterprise Integration Framework: TIBCO EMS on Sun Cluster

Page 2: EMS on Sun Cluster

EIF- EMS on Sun Cluster 1.0 1

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

404040

Document ApprovalsName Signature Date

Document OwnersName

Proprietary and ConfidentialThis document contains information that is confidential to

both “COMPANY XYZ” and TIBCO Software Inc.

Page 3: EMS on Sun Cluster

EIF- EMS on Sun Cluster 1.0 2

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

404040

Copyright Notice

COPYRIGHT© 2004 TIBCO Software Inc. This document is unpublished and the foregoing

notice is affixed to protect TIBCO Software Inc. in the event of inadvertent publication. All rights

reserved. No part of this document may be reproduced in any form, including photocopying or

transmission electronically to any computer, without prior written consent of TIBCO Software

Inc. The information contained in this document is confidential and proprietary to TIBCO

Software Inc. and may not be used or disclosed except as expressly authorized in writing by

TIBCO Software Inc. Copyright protection includes material generated from our software

programs displayed on the screen, such as icons, screen displays, and the like.

TrademarksTechnologies described herein are either covered by existing patents or patent applications are in

progress. All brand and product names are trademarks or registered trademarks of their respective

holders and are hereby acknowledged.

ConfidentialityThe information in this document is subject to change without notice. This document contains

information that is confidential and proprietary to TIBCO Software Inc. and may not be copied,

published, or disclosed to others, or used for any purposes other than review, without written

authorization of an officer of TIBCO Software Inc. Submission of this document does not

represent a commitment to implement any portion of this specification in the products of the

submitters.

Content WarrantyThe information in this document is subject to change without notice. THIS DOCUMENT IS

PROVIDED "AS IS" AND TIBCO MAKES NO WARRANTY, EXPRESS, IMPLIED, OR

STATUTORY, INCLUDING BUT NOT LIMITED TO ALL WARRANTIES OF

MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. TIBCO Software Inc.

shall not be liable for errors contained herein or for incidental or consequential damages in

connection with the furnishing, performance or use of this material.

For more information, please contact:

TIBCO Software Inc.

3303 Hillview Avenue

Palo Alto, CA 94304

USA

Page 4: EMS on Sun Cluster

EIF- EMS on Sun Cluster 1.0 3

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

404040

Table of Contents

1 Introduction..............................................................................................................................51.1 Audience..................................................................................................................................... 5

1.2 Purpose...................................................................................................................................... 5

1.3 Related Documentation..............................................................................................................5

2 Enterprise Message Server (EMS)........................................................................................62.1 Datastore.................................................................................................................................... 6

2.2 Client Fault Tolerance.................................................................................................................6

2.3 Connection Factory.....................................................................................................................7

2.4 Log Files..................................................................................................................................... 8

3 Sun Cluster Software.............................................................................................................93.1 Introduction................................................................................................................................. 9

3.2 Cluster Nodes........................................................................................................................... 10

3.3 Cluster Interconnect..................................................................................................................10

3.4 Cluster Membership..................................................................................................................11

3.5 Cluster Configuration Repository..............................................................................................11

3.6 Fault Monitors........................................................................................................................... 11

3.7 Quorum Devices....................................................................................................................... 12

3.8 Data Integrity............................................................................................................................ 12

3.9 Failure Fencing......................................................................................................................... 13

3.10 Data Services........................................................................................................................... 13

4 EMS on Sun Cluster.............................................................................................................164.1 Conceptual Architecture...........................................................................................................16

4.2 EMS Configuration.................................................................................................................... 16

4.3 Cluster Configuration................................................................................................................18

4.4 Domain Design......................................................................................................................... 20

4.5 Hawk......................................................................................................................................... 20

5 Installation Steps.................................................................................................................225.1 Pre-Requisites.......................................................................................................................... 22

5.2 Message Server Installation.....................................................................................................22

5.3 Domain Membership.................................................................................................................22

5.4 Test Run Servers...................................................................................................................... 23

5.5 Placing Servers under Cluster Control......................................................................................23

5.6 Final Configuration Changes....................................................................................................23

5.7 Restart the Server Instance......................................................................................................25

5.8 Set the Server Password..........................................................................................................25

5.9 Restart the Server Instance......................................................................................................26

6 Testing Results....................................................................................................................276.1 Test Clients............................................................................................................................... 27

6.2 Failover Testing........................................................................................................................ 28

Page 5: EMS on Sun Cluster

EIF- EMS on Sun Cluster 1.0 4

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

404040

7 Conclusions.........................................................................................................................297.1 Failover time............................................................................................................................. 29

7.2 Use of hardware....................................................................................................................... 29

7.3 Monitoring and Management....................................................................................................29

7.4 Local tibemsd.conf....................................................................................................................29

8 Appendices...........................................................................................................................308.1 tibco_ems.sh............................................................................................................................. 30

8.2 tibco_ems_cluster.sh................................................................................................................32

8.3 Sun Cluster Configuration ( scstat –p ).....................................................................................33

8.4 tibjmsFactoryQueueSender.java...............................................................................................34

8.5 tibjmsMsgProducer.java...........................................................................................................37

Page 6: EMS on Sun Cluster

EIF-Developer’s Guide for HEB 5

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

1 Introduction

This document is part of the Enterprise Integration Framework (EIF) for “COMPANY XYZ”.

“COMPANY XYZ” have chosen TIBCO Enterprise Message Service (EMS) as the messaging backbone for all of their integration projects. To this end, it is imperative that EMS be implemented in such a way as to deliver the Level Of Service required by the business.

In addition “COMPANY XYZ” want to make best use of server hardware by not having hardware tied up waiting to be brought into use in the event of server failure. They are also current and reasonably experienced users of Sun Cluster Software.

1.1 Audience

The audience of this document are:

Developers attempting to understand the rationale for a clustered deployment

Administration staff involved in implementing or supporting such a deployment

1.2 Purpose

This document addresses the following questions:

What do we mean by the terms Fault Tolerant and Highly Available in the context of EMS? What benefits does Sun Cluster provide? How do we install the EMS components into a Sun Cluster? What role will TIBCO Administrator play? What role will TIBCO Hawk play?

Although not intended to be study on Sun Cluster software, there are certain principles that Sun Cluster uses to achieve it’s objectives that are common across other forms of cluster software and can therefore be treated as patterns or templates for re-use.

1.3 Related Documentation

TIBCO Product Documentation for the following products:

TIBCO Enterprise Message Service TIBCO Administrator TIBCO Runtime Agent TIBCO Domain Utility

Sun Documentation for the following products:

TIBCO Enterprise Message Service

EIF Documentation:

Server Installation Guide – Message Server Package

Page 7: EMS on Sun Cluster

EIF-Developer’s Guide for HEB 6

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

2 Enterprise Message Server (EMS)

EMS is the TIBCO implementation of the JMS (Java Messaging Service) specification (v1.1), which is

an API specification from Sun and co-written by many other companies including TIBCO.

However, TIBCO EMS provides a more complete suite of messaging products and tools than simply

the JMS API. It also provides a critical infrastructure component that has implications on servers,

storage, and network infrastructure.

This section will not cover JMS basics or API, which can be found at a variety of other sources. It

assumes that readers have basic knowledge of JMS and familiarity with the TIBCO EMS product.

Instead it will focus on those features and capabilities available within the product to provide an

industrial-strength messaging foundation for the applications.

This will provide a foundation on which to build an understanding of the reasons behind decisions

made in the Sun Cluster implementation described in Chapter 4.

2.1 Datastore

The TIBCO EMS Server requires storage for persistent messages, state metadata and configuration data, known as it’s datastore. This consists of three disk files as follows:

meta.db – stores information required by the server, but stores no messages

sync-msgs.db – stores data for queues or topics defined as failsafe

async-msgs.db - stores data for queues or topics NOT defined as failsafe

It is obvious that a large amount of business data will pass through the EMS Server and be stored on disk. This places some stringent requirements on the disk storage:

performance – it must be fast, robust and reliable

size – the storage allocated should be able to grow dynamically over time

recovery – in the event of a disaster, some if not all of the data should be recoverable

SAN based storage is the best option due to it’s ability to deliver each of the above requirements.

2.2 Client Fault Tolerance

TIBCO EMS delivers a truly Fault Tolerant solution from the perspective of the client. In the event of brief interruptions in service from an individual EMS Server, another physical server process can take it’s place with no loss of data from the perspective of the client.

From an infrastructure perspective, the most robust solution is to have the second process hosted on a separate physical server. This allows for the complete hardware failure of the first server without any loss of data. However this introduces some complexities around storage of message data that is in transit.

The TIBCO EMS Server requires storage for persistent messages, state metadata and configuration data, known as it’s datastore (see above). However, only a single active server process can access the datastore at any one time without corruption occurring due to overlapping writes.

There are two basic ways to manage the use of the datastore:

Allow TIBCO EMS to control access to the datastore through the use of file locking protocols

Use an external means to control access to the datastore

Page 8: EMS on Sun Cluster

EIF-Developer’s Guide for HEB 7

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

The first option is achieved by running the Fault Tolerant pair of EMS servers simultaneously and configuring them to be aware of each other via a tcp connection. In the event of the primary server failing, the backup server will be aware and attempt to gain control of the datastore by locking it.

This option works well in situations where the datastore is local to the servers but is complicated when the datastore resides on a network device or on a SAN. Cheap network locking protocols such as NFS are notoriously unreliable whereas commercial products that provide this functionality reliably are prohibitively expensive.

The second option is achieved through the use of TIBCO Hawk or clustering software. TIBCO Hawk can be used to ensure that only a single instance of a process is running, but Clustering software has the added advantage that it can detect network malfunctions and can also guarantee, through the mounting and unmounting of disk partitions, that only a single server has access to the datastore.

NOTE: Even though the combination of the above features provides a reasonable level of Fault Tolerance it cannot mitigate every possible failure mode. Failures involving both physical server nodes or prolonged network outages will result in client disconnects eventually. However these situations can be handled by TIBCO Hawk rulebases running locally to the client in conjunction with good process design to shutdown clients and restart them when the EMS Servers come back up, thus preventing many spurious error conditions.

2.3 Connection Factory

In order to make use of the Client Fault Tolerance capabilities of TIBCO EMS, clients must ensure that their connections are created correctly and have appropriate values for retry and timeout settings.

A Fault Tolerant connection URL takes the form:

tcp://<server 1>:<port>,tcp://<server 2>:<port>

The retry count and timeout settings are properties on the Tibjms object which control the attempts by the client to reconnect to one of the possible servers as follows:

reconnect_attempt_count After losing its server connection, a client program iterates through its URL list until it re-establishes a connection with an EMS server.

This property determines the maximum number of iterations. When absent, the default is 4.

reconnect_attempt_delay When attempting to reconnect, the client sleeps for this interval (in milliseconds) between iterations through the URL list.

When absent, the default is 500 milliseconds.

While this can be done on an individual basis through code, the recommended method is to use a JNDI call to a Connection Factory object. This allows the retrieval of the above parameters from the server, thus centralizing control and administration.

A Fault Tolerant JNDI Connection Factory URL takes the form:

tibjmsnaming://<server 1>:<port>, tibjmsnaming://<server 2>:<port>

Page 9: EMS on Sun Cluster

EIF-Developer’s Guide for HEB 8

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

2.4 Log Files

Depending upon the tracing and logging parameters the server can generate different amounts of log file output. It is recommended for performance reasons that these files reside on a fast disk.

The settings for logging should be set as follows for production systems:

Message Tracing should not be permanently enabled in production

At a minimum, the WARNING mode should be set

A maximum file size should be specified to prevent uncapped growth

Where possible these files should reside on the SAN. This will allow access to diagnose issues in the event that the server node cannot be immediately recovered.

Page 10: EMS on Sun Cluster

EIF-Developer’s Guide for HEB 9

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

3 Sun Cluster Software

A full description of the capabilities of Sun Cluster software is beyond the scope of this document. However, an understanding of it’s operation is essential to comprehending the decisions made in deploying TIBCO EMS on such a cluster.

3.1 Introduction

A cluster is two or more systems, or nodes, that work together as a single, continuously available system to provide applications, system resources, and data to users. Each node on a cluster is a fully functional standalone system. However, in a clustered environment, the nodes are connected by an interconnect and work together as a single entity to provide increased availability and performance.

Figure 1. Sun Cluster Hardware Components

Highly available clusters provide nearly continuous access to data and applications by keeping the cluster running through failures that would normally bring down a single server system. No single failure—hardware, software, or network—can cause a cluster to fail. By contrast, fault-tolerant hardware systems provide constant access to data and applications, but at a higher cost because of specialized hardware. Fault-tolerant systems usually have no provision for software failures.

An application is highly available if it survives any single software or hardware failure in the system. Failures that are caused by bugs or data corruption within the application itself are excluded. The following apply to highly available applications:

Recovery is transparent from the applications that use a resource.

Resource access is fully preserved across node failure.

Applications cannot detect that the hosting node has been moved to another node.

Failure of a single node is completely transparent to programs on remaining nodes that use the files, devices, and disk volumes attached to this node.

A failover service provides high availability through redundancy. When a failure occurs, you can configure an application that is running to either restart on the same node, or be moved to another node in the cluster, without user intervention.

Page 11: EMS on Sun Cluster

EIF-Developer’s Guide for HEB 10

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

The Sun Cluster system makes the path between users and data highly available by using multihost disks, multipathing, and a global file system. The Sun Cluster system monitors failures for the following:

Applications – Most of the Sun Cluster data services supply a fault monitor that periodically probes the data service to determine its health. A fault monitor verifies that the application daemon or daemons are running and that clients are being served. Based on the information that is returned by probes, a predefined action such as restarting daemons or causing a failover can be initiated.

Disk-Paths – Sun Cluster software supports disk-path monitoring (DPM). DPM improves the overall reliability of failover and switchover by reporting the failure of a secondary disk path.

Internet Protocol (IP) Multipath – Solaris IP network multipathing software on Sun Cluster systems provide the basic mechanism for monitoring public network adapters. IP multipathing also enables failover of IP addresses from one adapter to another adapter when a fault is detected.

The following sections describe some of the key terms and definitions used when discussing clustering using Sun Cluster software.

3.2 Cluster Nodes

A cluster node is a server running both Solaris and the Sun Cluster software. This can be an entire physical server, or a Solaris domain within a larger server. Sun Cluster allows from two to eight nodes in a cluster.

Cluster nodes are generally attached to one or more disks. Nodes not attached to disks use the cluster file system to access the multihost disks.

Every node in the cluster is aware when another node joins or leaves the cluster. Also, every node in the cluster is aware of the resources that are running locally as well as the resources that are running on the other cluster nodes.

Nodes in the same cluster should have similar processing, memory, and I/O capability to enable failover to occur without significant degradation in performance. Because of the possibility of failover, each node should have sufficient capacity to meet service level agreements if a node fails..

3.3 Cluster Interconnect

The cluster interconnect is the physical configuration of devices that are used to transfer cluster-private communications and data service communications between cluster nodes.

Redundant interconnects enable operation to continue over the surviving interconnects while system administrators isolate failures and repair communication. The Sun Cluster software detects, repairs, and automatically reinitiates communication over a repaired interconnect.

All nodes must be connected by the cluster interconnect through at least two redundant physically independent networks, or paths, to avoid a single point of failure. While two interconnects are required for redundancy, up to six can be used to spread traffic to avoid bottlenecks and improve redundancy and scalability. The Sun Cluster interconnect uses Fast Ethernet, Gigabit-Ethernet, Sun Fire Link, or the Scalable Coherent Interface (SCI, IEEE 1596-1992), enabling high-performance cluster-private communications.

The reliable detection of interconnect issues is one area where cluster software is superior to the traditional use of Hawk to control processes on geographically dispersed systems. While Hawk can detect loss of network connectivity, it lacks the multipath facilities and quorum features of the cluster where failing components are removed or ‘fenced’ to prevent them attempting to regain control of system resources.

Page 12: EMS on Sun Cluster

EIF-Developer’s Guide for HEB 11

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

3.4 Cluster Membership

The Cluster Membership Monitor (CMM) is a distributed set of agents that exchange messages over the cluster interconnect to complete the following tasks:

Enforcing a consistent membership view on all nodes (quorum)

Driving synchronized reconfiguration in response to membership changes

Handling cluster partitioning

Ensuring full connectivity among all cluster members by leaving unhealthy nodes out of the cluster until it is repaired

The main function of the CMM is to establish cluster membership, which requires a cluster-wide agreement on the set of nodes that participate in the cluster at any time. The CMM detects major cluster status changes on each node, such as loss of communication between one or more nodes. The CMM relies on the transport kernel module to generate heartbeats across the transport medium to other nodes in the cluster. When the CMM does not detect a heartbeat from a node within a defined time-out period, the CMM considers the node to have failed and the CMM initiates a cluster reconfiguration to renegotiate cluster membership.

To determine cluster membership and to ensure data integrity, the CMM performs the following tasks:

Accounting for a change in cluster membership, such as a node joining or leaving the cluster

Ensuring that an unhealthy node leaves the cluster

Ensuring that an unhealthy node remains inactive until it is repaired

Preventing the cluster from partitioning itself into subsets of nodes.

3.5 Cluster Configuration Repository

The Cluster Configuration Repository (CCR) is a private, cluster-wide, distributed database for storing information that pertains to the configuration and state of the cluster. To avoid corrupting configuration data, each node must be aware of the current state of the cluster resources. The CCR ensures that all nodes have a consistent view of the cluster. The CCR is updated when error or recovery situations occur or when the general status of the cluster changes.

The CCR structures contain the following types of information:

Cluster and node names

Cluster transport configuration

The names of Solaris Volume Manager disk sets or VERITAS disk groups

A list of nodes that can master each disk group

Operational parameter values for data services

Paths to data service callback methods

DID device configuration

Current cluster status

3.6 Fault Monitors

Sun Cluster system makes all components on the ”path” between users and data highly available by monitoring the applications themselves, the file system, and network interfaces.

The Sun Cluster software detects a node failure quickly and creates an equivalent server for the resources on the failed node. The Sun Cluster software ensures that resources unaffected by the failed node are constantly available during the recovery and that resources of the failed node become available as soon as they are recovered.

Page 13: EMS on Sun Cluster

EIF-Developer’s Guide for HEB 12

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

3.6.1 Data Services MonitoringEach Sun Cluster data service supplies a fault monitor that periodically probes the data service to determine its health. A fault monitor verifies that the application daemon or daemons are running and that clients are being served. Based on the information returned by probes, predefined actions such as restarting daemons or causing a failover, can be initiated.

3.6.2 Disk-Path MonitoringSun Cluster software supports disk-path monitoring (DPM). DPM improves the overall reliability of failover and switchover by reporting the failure of a secondary disk-path.

3.6.3 IP Multipath MonitoringEach cluster node has its own IP network multipathing configuration, which can differ from the configuration on other cluster nodes. IP network multipathing monitors the following network communication failures:

The transmit and receive path of the network adapter has stopped transmitting packets.

The attachment of the network adapter to the link is down.

The port on the switch does not transmit-receive packets.

The physical interface in a group is not present at system boot..

3.7 Quorum Devices

A quorum device is a disk shared by two or more nodes that contributes votes that are used to establish a quorum for the cluster to run. The cluster can operate only when a quorum of votes is available. The quorum device is used when a cluster becomes partitioned into separate sets of nodes to establish which set of nodes constitutes the new cluster.

Both cluster nodes and quorum devices vote to form quorum. By default, cluster nodes acquire a quorum vote count of one when they boot and become cluster members. Nodes can have a vote count of zero when the node is being installed, or when an administrator has placed a node into the maintenance state.

Quorum devices acquire quorum vote counts that are based on the number of node connections to the device. When you set up a quorum device, it acquires a maximum vote count of N-1 where N is the number of connected votes to the quorum device. For example, a quorum device that is connected to two nodes with nonzero vote counts has a quorum count of one (two minus one).

3.8 Data Integrity

The Sun Cluster system attempts to prevent data corruption and ensure data integrity. Because cluster nodes share data and resources, a cluster must never split into separate partitions that are active at the same time. The CMM guarantees that only one cluster is operational at any time.

Two types of problems can arise from cluster partitions:

Split Brain

Amnesia

Split brain occurs when the cluster interconnect between nodes is lost and the cluster becomes partitioned into subclusters, and each subcluster believes that it is the only partition. A subcluster that is not aware of the other subclusters could cause a conflict in shared resources such as duplicate network addresses and data corruption.

Amnesia occurs if all the nodes leave the cluster in staggered groups. An example is a two-node cluster with nodes A and B. If node A goes down, the configuration data in the CCR is updated on node B only, and not node A. If node B goes down at a later time, and if node A is rebooted, node A will be running with old contents of the CCR.

Page 14: EMS on Sun Cluster

EIF-Developer’s Guide for HEB 13

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

This state is called amnesia and might lead to running a cluster with stale configuration information.

Sun Cluster avoids split brain and amnesia by giving each node one vote and mandating a majority of votes for an operational cluster. A partition with the majority of votes has a quorum and is enabled to operate. This majority vote mechanism works well if more than two nodes are in the cluster. In a two-node cluster, a majority is two. If such a cluster becomes partitioned, an external vote enables a partition to gain quorum. This external vote is provided by a quorum device. A quorum device can be any disk that is shared between the two nodes..

3.9 Failure Fencing

A major issue for clusters is a failure that causes the cluster to become partitioned (called split brain). When this situation occurs, not all nodes can communicate, so individual nodes or subsets of nodes might try to form individual or subset clusters. Each subset or partition might “believe” it has sole access and ownership to the multihost disks. Attempts by multiple nodes to write to the disks can result in data corruption.

Failure fencing limits node access to multihost disks by preventing access to the disks. When a node leaves the cluster (it either fails or becomes partitioned), failure fencing ensures that the node can no longer access the disks. Only current member nodes have access to the disks, ensuring data integrity.

3.10 Data Services

A data service is the combination of software and configuration files that enables an application to run without modification in a Sun Cluster configuration. When running in a Sun Cluster configuration, an application runs as a resource under the control of the Resource Group Manager (RGM). A data service enables you to configure an application such as Sun Java System Web Server or Oracle database to run on a cluster instead of on a single server.

The software of a data service provides implementations of Sun Cluster management methods that perform the following operations on the application:

Starting the application

Stopping the application

Monitoring faults in the application and recovering from these faults

The configuration files of a data service define the properties of the resource that represents the application to the RGM. The RGM controls the disposition of the failover and scalable data services in the cluster. The RGM is responsible for starting and stopping the data services on selected nodes of the cluster in response to cluster membership changes. The RGM enables data service applications to utilize the cluster framework.

The RGM controls data services as resources. These implementations are either supplied by Sun or created by a developer who uses a generic data service template, the Data Service Development Library API (DSDL API), or the Resource Management API (RMAPI). The cluster administrator creates and manages resources in containers that are called resource groups. RGM and administrator actions cause resources and resource groups to move between online and offline states.

3.10.1 Resource TypesA resource type is a collection of properties that describe an application to the cluster. This collection includes information about how the application is to be started, stopped, and monitored on nodes of the cluster. A resource type also includes application-specific properties that need to be defined in order to use the application in the cluster. Sun Cluster data services has several predefined resource types. For example, Sun Cluster HA for Oracle is the resource type SUNW.oracle-server and Sun Cluster HA for Apache is the resource type SUNW.apache.

Page 15: EMS on Sun Cluster

EIF-Developer’s Guide for HEB 14

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

3.10.2 ResourcesA resource is an instance of a resource type that is defined cluster wide. The resource type enables multiple instances of an application to be installed on the cluster. When you initialize a resource, the RGM assigns values to application-specific properties and the resource inherits any properties on the resource type level.

Data services utilize several types of resources. Applications such as Apache Web Server or Sun Java System Web Server utilize network addresses (logical hostnames and shared addresses) on which the applications depend. Application and network resources form a basic unit that is managed by the RGM.

3.10.3 Resource GroupsResources that are managed by the RGM are placed into resource groups so that they can be managed as a unit. A resource group is a set of related or interdependent resources. For example, a resource derived from a SUNW.LogicalHostname resource type might be placed in the same resource group as a resource derived from an Oracle database resource type. A resource group migrates as a unit if a failover or switchover is initiated on the resource group.

3.10.4 Data Service TypesData services enable applications to become highly available and scalable services help prevent significant application interruption after any single failure within the cluster.

When a data service is configured, the data service must be configured as one of the following data service types:

Failover data service

Scalable data service

Parallel data service

3.10.4.1 Failover Data Services

Failover is the process by which the cluster automatically relocates an application from a failed primary node to a designated redundant secondary node. Failover applications have the following characteristics:

Capable of running on only one node of the cluster

Not cluster-aware

Dependent on the cluster framework for high availability

If the fault monitor detects an error, it either attempts to restart the instance on the same node, or to start the instance on another node (failover), depending on how the data service has been configured. Failover services use a failover resource group, which is a container for application instance resources and network resources (logical hostnames). Logical hostnames are IP addresses that can be configured up on one node, and later, automatically configured down on the original node and configured up on another node.

Clients might have a brief interruption in service and might need to reconnect after the failover has finished. However, clients are not aware of the change in the physical server that is providing the service.

3.10.4.2 Scalable Data Services

The scalable data service enables application instances to run on multiple nodes simultaneously. Scalable services use two resource groups. The scalable resource group contains the application resources and the failover resource group contains the network resources (shared addresses) on which the scalable service depends. The scalable resource group can be online on multiple nodes, so multiple instances of the service can be running simultaneously. The failover resource group that hosts the shared address is online on only one node at a time. All nodes that host a scalable service use the same shared address to host the service.

Page 16: EMS on Sun Cluster

EIF-Developer’s Guide for HEB 15

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

The cluster receives service requests through a single network interface (the global interface). These requests are distributed to the nodes, based on one of several predefined algorithms that are set by the load-balancing policy. The cluster can use the load-balancing policy to balance the service load between several nodes.

3.10.4.3 Parallel Applications

Sun Cluster systems provide an environment that shares parallel execution of applications across all the nodes of the cluster by using parallel databases. Sun Cluster Support for Oracle Parallel Server/Real Application Clusters is a set of packages that, when installed, enables Oracle Parallel Server/Real Application Clusters to run on Sun Cluster nodes. This data service also enables Sun Cluster Support for Oracle Parallel Server/Real Application Clusters to be managed by using Sun Cluster commands.

A parallel application has been instrumented to run in a cluster environment so that the application can be mastered by two or more nodes simultaneously. In an Oracle Parallel Server/Real Application Clusters environment, multiple Oracle instances cooperate to provide access to the same shared database. The Oracle clients can use any of the instances to access the database. Thus, if one or more instances have failed, clients can connect to a surviving instance and continue to access the database.

Page 17: EMS on Sun Cluster

EIF-Developer’s Guide for HEB 16

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

4 EMS on Sun Cluster

4.1 Conceptual Architecture

Each EMS Business Service is implemented as a separate Sun Cluster Data Service along with it’s associated Logical Host, Datastore and Application Resources.

Figure 2. Conceptual Architecture

Additionally, TIBCO Runtime Agent is installed on each server node in the cluster and bound to the physical name and IP address of each server. TIBCO Runtime Agent is NOT under cluster control and is started at system boot via the usual init.d mechanism.

4.2 EMS Configuration

EMS is installed normally on each server, then a series of changes are made. Note that this section only describes the reason for and nature of each changes and does not describe the overall sequence of events. This is detailed in the next chapter.

Page 18: EMS on Sun Cluster

EIF-Developer’s Guide for HEB 17

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

4.2.1 Control ScriptsWhen an EMS server is registered into the TIBCO Administration Domain, a control shell script is created as follows:

$TIBCO_HOME/ems/bin/domain/<domain name>/ TIBCOServers-E4JMS_<port number>.sh

When a second server is added anywhere in the domain with the same port number, the control script is created with a different name as follows:

$TIBCO_HOME/ems/bin/domain/<domain name>/ TIBCOServers-E4JMS-1_<port number>.sh

The “COMPANY XYZ” EMS installation package creates a Unix shell script tibco_ems.sh that is the main script used to start/stop/check an EMS service. It takes a single argument, the EMS listening port number and utilizes whichever of the above shell scripts is present to start/stop a given EMS server.

The use of the tibco_ems.sh script whenever interacting with EMS at the command line ensures that a consistent state will always be reported in TIBCO Administrator. It is also used by the Sun Cluster software to check whether EMS is running and to start/stop it as necessary.

The contents of the tibco_ems.sh script are listed in Appendix 8.1

4.2.2 Configuration FilesConfiguration files are created in advance for each server and contain the Queue, Topic and ACL definitions modeled in lower environments and promoted through change management procedures.

These files are originally located in the $CONFIG_ROOT root folder as designated in the tibco.sh environment control file. Under the ems sub-folder there is a folder for each individual server containing the set of configuration files required by EMS. This is the location specified in the Domain Utility when adding the EMS server to the TIBCO Administration Domain.

$CONFIG_ROOT ems 7020 tibemsd.conf factories.conf users.conf … 7030 7040 hawk…

Config files for Merch Business Domain

Config files for Supply Chain Business Domain

Figure 3. EMS Configuration Files Directory Structure

When first installed on the server the configuration files have the following important characteristics:

Page 19: EMS on Sun Cluster

EIF-Developer’s Guide for HEB 18

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

A copy physically resides on each server

They point to logfiles local to each server

They point to a datastore local to each server

They contain the same server name, which is the name of the Business Domain (e.g. EMS-MERCH)

They contain the listen parameter tcp://<port number> which binds EMS to the default interface for the given server

They do NOT contain any Fault-Tolerant setup parameters

This configuration allows each server to be registered into the domain and tested prior to placing them under Sun Cluster control. Once the Sun Cluster configuration has been created and tested, the following modifications are made:

A single copy of the configuration files is copied to the Sun Cluster partition

A logical link is created from the original config folder to the above folder

The central tibemsd.conf file is edited to place the datastore on the Sun Cluster partition

The central tibemsd.conf file is edited to place logfiles on the Sun Cluster partition

The central tibemsd.conf file is edited to configure the FT Connection Factories

Note that the servers are NOT configured to be aware of each other in the traditional Fault-Tolerant setup configuration. At no time will the two servers ever be allowed to be both running simultaneously. This is controlled by the configuration of the Sun Cluster software.

The centralized configuration file factories.conf that controls the Connection Factory parameters is modified to add the reconnect_attempt_count and reconnect_attempt_delay parameters as follows:

[FTTopicConnectionFactory]

type = topic

url = tcp://<server a>:<port number>,tcp://<server b>:<port number>

reconnect_atempt_count = 60

reconnect_atempt_delay = 5

[FTQueueConnectionFactory]

type = queue

url = tcp://<server a>:<port number>,tcp://<server b>:<port number>

reconnect_atempt_count = 60

reconnect_atempt_delay = 5

The settings above will allow for the client libraries to attempt to re-connect every 5 seconds for up to 5 minutes. These settings will be subject to change with further experience and testing.

4.3 Cluster Configuration

The installation and configuration of the Sun Cluster software on each node is beyond the scope of this document and will normally be carried out by the Unix Services team.

This section covers the salient points of the configuration of the Resource Groups to support the EMS servers and any configuration changes or scripts required.

The following Resources are required within an EMS Resource Group:

Logical Host Resource – although not used to connect to EMS, this is required

Datastore Resource – this is the disk partition that will be mounted on only the single active primary node for each EMS server

Page 20: EMS on Sun Cluster

EIF-Developer’s Guide for HEB 19

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

Application Resource – this defines the application in terms of how to start/stop it and how to check it’s status

For example, the following Resource Groups were created to support the Merchandising and Supply Chain Business Domains:

Resource Group Resources Description

ctibco_merch_rg lh_cert_tibcomerch

HA_ctibco_merch_store

ctibco_merch_app

Logical Host resource for Merchandising

Datastore resource for Merchandising

Application Resource for Merchandising

ctibco_suppch_rg lh_cert_tibcosuppch

HA_ctibco_suppch_store

ctibco_suppch_app

Logical Host resource for Supply Chain

Datastore resource for Supply Chain

Application Resource for Supply Chain

Figure 4. Resource Groups created for testing

These resources are setup such that the ctibco_merch_rg items are active on node a and inactive on node b and the ctibco_suppch_rg items are active on node b and inactive on node a.

When creating an Application Resource Sun Cluster requires three parameters

A script to start the Application

A script to stop the Application

A script to return 1 if the application is running correctly and 0 otherwise

These requirements are fulfilled by the tibco_ems_cluster.sh script which provides all three services via a single command line argument which is either ‘start’, ‘stop’ or ‘check’. It utilizes the tibco_ems.sh script and translates the output of that script into the format required by Sun Cluster.

The tibco_ems_cluster.sh script is listed in full in Appendix 8.2 and is installed by the Unix Services team as part of the cluster Resource Groups creation. During the installation they will rename the script as appropriate, e.g. tibco_ems_merch.sh and change the port number as required.

The cluster software is configured to automatically start the Application Resource items on their primary cluster nodes at startup. It is also configured to check them at regular intervals and attempt to restart them if not running. If the application will not restart after a given number of attempts then it will be failed over to the other cluster node. The monitoring interval and number of restart attempts are configurable and were set to 15 seconds and 3 restart attempts during testing.

It is also important to note that the Application Resource is created as “Non-network aware”. This means that the cluster software will not attempt to assess the status of the EMS servers by connecting to a tcp port at regular intervals. Instead it will rely on the information returned from the check script as configured above.

A full listing of the Sun Cluster configuration as returned by the command ‘scstat –p’ can be found in Appendix 8.3.

Note that the EMS listen port is not bound to the logical or virtual IP address of the logical host. Due to the limitations of the current EMS Administrator Plugin, the server must be listening on the default interface of the host on which it is running in order to be administered through TIBCO Administrator.

This limitation may be removed in future releases.

Page 21: EMS on Sun Cluster

EIF-Developer’s Guide for HEB 20

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

4.4 Domain Design

TIBCO Runtime Agent is installed on each node in the cluster and is NOT placed under cluster control. Both machines are added to the TIBCO Administration Domain and this allows the monitoring of the resources and health of each server independently.

The EMS servers on each cluster node for a given Business Domain (Merchandising, Supply Chain etc) are added to the TIBCO Administration Domain.

Thus, within TIBCO Administrator, for each Business Domain, the EMS server instances on both cluster nodes will appear in the display. Under normal conditions, one of them will appear as “Running”, the other as “Stopped” as shown below:

Figure 5. Display of Clustered EMS Servers in TIBCO Administrator GUI

Figure 6. Display after failover of the 7020 server instance to it’s secondary cluster node

Note: It is recommended that, for consistency, the proposed Primary Server node be added first, then the secondary due to that fact the Administrator will add the ‘-1’ suffix to the second server. This will ensure that whenever an operator sees a service with a ‘-1’ suffix running they know that the server is a secondary server.

Should a failover occur then the TIBCO Administrator display will automatically update to show the correct status of the two servers. This ensures that operators always receive coherent information on the status of the servers regardless of whether they use TIBCO Administrator, the Sun Cluster Manager or the command line shell scripts.

4.5 Hawk

TIBCO Hawk is not used to control the lifecycle of the EMS servers as they are controlled by the cluster software.

Page 22: EMS on Sun Cluster

EIF-Developer’s Guide for HEB 21

©2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

However, Hawk can be and is used to monitor the health of the cluster nodes and to notify clients if for some reason the EMS Service becomes completely unavailable.

These rulebases are deployed into the TIBCO Runtime Agents running on each cluster node.

Page 23: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 22

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

5 Installation Steps

The step-by-step instruction guide can be found in the Message Server Installation Guide.

The following sections describe the reason for each step and the order in which they must be completed.

5.1 Pre-Requisites

5.1.1 Configuration FilesPrior to installation, a set of configuration files will be created which will control the installation of all the “COMPANY XYZ” Packages based on the servers purpose.

These files are un-tarred into the $CONFIG_ROOT folder and will contain (amongst other things) a folder for each EMS Service to be created, denoted by it’s EMS port number. See section 4.2.2 for details.

5.1.2 Systems Management PackageBefore the Message Server Package can be installed the Systems Management Package (consisting of TIBCO Runtime Agent and Domain Membership) must be installed.

The Systems Management Package will create the TIBCO Runtime Agent for the server nodes based on the configuration files loaded on the machine in the previous step.

5.2 Message Server Installation

TIBCO EMS can now be installed on each cluster node using the “COMPANY XYZ” Message Server Package.

This will install the EMS software along with the control script tibco_ems.sh.

5.3 Domain Membership

Each server should now be added to the TIBCO Administration Domain using the TIBCO Domain Utility. Note that this utility must be run via an X-Windows session.

Ensure that for each Business Domain, the EMS server on the intended Primary Cluster Node is added to the domain first for the reasons described in sections 4.2.1 and 4.4.

In the case of a standard “COMPANY XYZ” install the Domain Utility parameters will look like the following:

EMS Version : 4.1.0

Port : See Tables below

Home Path : /opt/tibco/ems

Configuration File : /opt/tibco/config/ems/<port number>/tibemsd.conf

User Name : admin

Password : *****

The password will be set to the current administrator password for the environment. This will be set in the EMS Servers in a subsequent step.

Page 24: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 23

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

It is imperative that this information, especially the password, be specified correctly as the only way of changing it is to remove the EMS Server from the domain and add it again.

5.4 Test Run Servers

It is prudent at this stage to confirm that each of the Server Instances can be started. Due to the fact that they are currently using local configuration files, all of the servers can be started simultaneously or individually.

For each Server Instance, start the server from TIBCO Administrator and confirm that the Service displays “Started”. Then proceed to the command line and confirm that tibco_ems.sh reports the service as running as follows:

$ cd /opt/tibco/scripts

$ ./tibco_ems.sh <port number> check

TIBCO Enterprise Messaging Server (<port number>) running with pid <pid>

Now stop the Server Instance from the command line as follows:

$ ./tibco_ems.sh <port number> stop

TIBCO Enterprise Messaging Server (<port number>) stopping

Confirm that the Administrator eventually shows the Server Instance as “Stopped”.

Repeat this process for each Server Instance on each Cluster Node.

5.5 Placing Servers under Cluster Control

The next step is to have the Unix Services team create the cluster configuration and place the servers under cluster control.

Once this is finished, have the Unix Services team start the Server Resource Groups. The servers should display in TIBCO Administrator as “Running” on their allocated primary Cluster Node and “Stopped” on their allocated Secondary Cluster Node.

It is useful at this stage to verify that the Server Instances can be failed between their Primary and Secondary Cluster Nodes from the Cluster Management console. Also verify that the cluster Data Store Resource fails over between the nodes and is mounted correctly on only the currently active node.

It is also prudent to check that if the currently active Server Instance is stopped from within TIBCO Administrator, that the Cluster Software restarts it.

Remember that at this stage, clicking on the link in Administrator for a given Server Instance will not work as the password has not yet been set on the EMS Server Instance.

5.6 Final Configuration Changes

Due to the fact that the EMS Server only accesses the configuration files at startup or when changes are made, the following changes can be made with the Server Instance running.

5.6.1 Move Configuration Files to Cluster File SystemFor a given Server Instance, on the currently active Cluster Node, move the configuration folder to the mounted folder for that Server Instance and replace it with a link to the folders new location. For example, on the testing servers:

$ cd /opt/tibco/config/ems

Page 25: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 24

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

$ mkdir /var/ems_data/ems_merch/config

$ mv 7020/* /var/ems_data/ems_merch/config/

$ rmdir 7020

$ ln –s /var/ems_data/ems_merch/config 7020

On the currently inactive cluster node, simply delete the existing configuration files and create a link to where the folder will be mounted. Even though the folder is not currently mounted, it will be valid when the Cluster Software fails the Server Instance over along with it’s Data Store Resource.

$ cd /opt/tibco/config/ems

$ rm -rf 7020

$ ln –s /var/ems_data/ems_merch/config 7020

Repeat this process for each pair of EMS Server Instances using the correct port numbers and mount point folders.

5.6.2 Point Datastore to Cluster File SystemOn the currently active Cluster Node for a given Server Instance, edit the tibemsd.conf file and set the “store” parameter to point to the desired folder under the mounted partition.

########################################################################

# Persistent Storage.

store = /var/tibco_ems/ems_merch/datastore

EMS will create the datastore folder at startup if it does not already exist.

5.6.3 Point Log File to Cluster File SystemOn the currently active Cluster Node for a given Server Instance, edit the tibemsd.conf file and set the logfile parameter to point to the desired folder under the mounted partition.

####################################################################### # Log file name and tracing parameters.

logfile = /var/tibco_ems/ems_merch/logs

Create the logs folder manually as follows:

$ mkdir /var/tibco_ems/ems_merch/logs

5.6.4 Reset the Admin user passwordAlthough the configuration files should be pre-built with a blank Admin password, it is worthwhile confirming this as follows.

On the currently active Cluster Node for a given Server Instance, edit the users.conf file and identify the following line:

admin:<misc text>:"Administrator"

Page 26: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 25

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

Remove any text between the two colons to leave the line as follows:

admin::"Administrator"

5.6.5 Other Configuration File EntriesAlthough the configuration files are pre-built, it is worthwhile confirming that the following settings are correct for each Server Instance:

The listen parameter in tibemsd.conf is set to tcp://<port number>

The Fault-tolerant Setup parameters in tibemsd.conf are all empty

The “server” parameter in tibemsd.conf is correct

5.7 Restart the Server Instance

Go to TIBCO Administrator and stop the currently active Server Instance. The Cluster Software will restart it. At this point the configuration changes entered above will have taken effect.

5.8 Set the Server Password

The last remaining step is to set the EMS Server password to the same value as was entered in the TIBCO Domain Utility in section 5.3.

At the command line on either Cluster Node, start the TIBCO EMS Administration Tool as follows:

$ cd /opt/tibco/ems/bin

$ ./tibemsadmin

TIBCO Enterprise Message Service Administration Tool.

Copyright 2003-2004 by TIBCO Software Inc.

All rights reserved.

Version 4.1.0 V6 6/21/2004

Type 'help' for commands help, 'exit' to exit:

> connect tcp://<node a>:<port number>, tcp://<node b>:<port number>

Press return twice to login as the admin user with no password:

Login name (admin): (Press Return)

Password: (Press Return)

Connected to: tcp://<server>:7020

tcp://<server>:7020>

Page 27: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 26

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

5.9 Restart the Server Instance

Go to TIBCO Administrator and stop the currently active Server Instance. The Cluster Software will restart it. At this point the link in TIBCO Administrator will display the EMS Server Administration pages.

Confirm that the following settings show the correct values:

Server Name on the General Page

Log File Name on the General Page.

Server Store Directory on the Server Page

FTQueueConnectionFactory on the Resources Page

FTTopicConnectionFactory on the Resources Page

Page 28: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 27

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

6 Testing Results

6.1 Test Clients

6.1.1 Java Client (Connection Factory)This client connected to the EMS Server Instance via a Connection Factory URL of the form.

tibjmsnaming://<server a>:<port>,tibjmsnaming:// <server b>:<port>

The test program tibjmsFactoryQueueSender.java was created from the existing sample program tibjmsMsgProducer.java and was modified to create the test queue from the class factory as follows:

String providerContextFactory = "com.tibco.tibjms.naming.TibjmsInitialContextFactory";

String defaultTopicConnectionFactory = "FTTopicConnectionFactory";

String defaultQueueConnectionFactory = "FTQueueConnectionFactory";

String providerUrls ="tibjmsnaming://localhost:7222,tibjmsnaming://localhost:7222";

Hashtable env = new Hashtable ();

env.put ( Context.INITIAL_CONTEXT_FACTORY, providerContextFactory );

env.put ( Context.PROVIDER_URL, providerUrls );

InitialContext jndiContext = new InitialContext ( env );

QueueConnectionFactory factory = (QueueConnectionFactory)jndiContext.lookup ( defaultQueueConnectionFactory );

QueueConnection connection = factory.createQueueConnection ( userName, password );

In addition, the message sending code was modified to send the same test message 1000 times in a loop.

The full listing is contained in Appendix 8.4

During testing, the program prints the following output

Sent message(1):

Sent message(2):

Sent message(3):

During the failover from one Cluster Node to the other, the output pauses, then continues on uninterrupted.

6.1.2 Java Client (FT URL)This client connects to the EMS Server Instance via a Fault-Tolerant URL of the form.

tcp://<server a>:<port>,tcp:// <server b>:<port>

Page 29: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 28

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

The test program tibjmsFactoryQueueSender.java was modified only slightly to incorporate the message sending loop described in the previous section and to increase the default Fault-Tolerant connection retry and timeout settings as follows.

String reconnect = new String ( "60, 5000" );

Tibjms.setReconnectAttempts ( reconnect );

System.out.println ( "After change for reconnections: " + Tibjms.getReconnectAttempts () );

ConnectionFactory factory = new com.tibco.tibjms.TibjmsConnectionFactory ( serverUrl );

The full listing is contained in Appendix 8.5

This test client behaved identically to the one using the Connection Factory URL.

6.1.3 BW ClientThis test client consisted of a BW process using a timer instance to create and send a JMS message to a test queue every second. A second process subscribed to the same queue and pulled off the message.

The number of process instances created was monitored via TIBCO Administrator. No errors were seen during the failover testing.

6.2 Failover Testing

6.2.1 ManualIn this mode of testing the Unix Services Team used the Sun Cluster console to force the migration of a Server Instance from one node to the other.

The observed migration time from the clients perspective was approximately 15 seconds.

6.2.2 Process FailureIn order to simulate a real-world problem, the executable permissions were removed from the $TIBCO_HOME/ems/bin/tibemsd file and the running process terminated with a kill signal.

After going through it’s retry loop, the Cluster Software failed the process over to the other Cluster Node. The test clients were paused for a longer period of time, approximately 60 seconds which is three retry periods of 15 seconds plus the failover time of 15 seconds.

6.2.3 Machine FailureIn an effort to simulate a catastrophic machine failure, the active Cluster Node for one of the EMS Server Instances was forcefully rebooted.

The Cluster Software detected the failure after the configured timeout period and migrated the EMS Server Instance to the other Node. The test clients were paused for approximately 30 seconds.

Page 30: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 29

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

7 Conclusions

7.1 Failover time

The low failover time (circa 15 seconds) in conjunction with the uninterrupted operation of clients makes the use of more expensive distributed lock manager systems unnecessary at “COMPANY XYZ”.

It is felt that this solution meets the business needs for “COMPANY XYZ” at the present time.

Other parameters will affect the failover time, such as:

Size of datastore file system

Number of messages in datastore

Number of clients attempting to reconnect at failover.

However, these factors are common to both a clustered and distributed locking solution and are therefore excluded from the decision making process.

7.2 Use of hardware

The solution detail here makes good use of all servers in the cluster to house primary EMS servers therefore making best use of available hardware.

7.3 Monitoring and Management

The configuration detailed here provides good consistent visibility from TIBCO Administrator, Unix command line and Sun Cluster Console.

In addition, the fact that the TIBCO Runtime Agent is installed and configured on each Cluster Node with no unusual changes allows TIBCO Hawk rules to be used to monitor the health of each Cluster Node and the EMS Server Instances.

7.4 Local tibemsd.conf

tibemsd.conf may have to reside locally to each server and point to centralized users.conf, factories.conf etc under the following conditions:

SSL parameters required are unique to each physical server

A specific interface must be entered into the listen parameter

Page 31: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 30

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

8 Appendices

8.1 tibco_ems.sh

#!/bin/sh# For all Unix platforms## ########################################################################## Boot script for Unix platforms# This script takes one argument: "start", "stop" or check.## #########################################################################

# Copyright 2004 TIBCO Software Inc. All rights reserved.

TIBCO_ROOT=/opt/tibcoexport TIBCO_ROOT

# All environment variables are set in tibco.sh. Can't proceed further# if file is missingif [ -f $TIBCO_ROOT/tibco.sh ]; then . $TIBCO_ROOT/tibco.shelse echo "File not found $TIBCO_ROOT/tibco.sh" exit 1fi

# Check that the correct number of options have been passedif [ $# -ne 2 ]then echo "Usage: $0 [EMS Port] [start|stop|check]" 1>&2 exit 1fi

EMS_PORT=$1EMS_BIN=$TIBCO_ROOT/ems/bin/domain/$TIBCO_DOMAIN_NAME

# Find the script that controls the server on the given Port Number# Secondary servers have '-1', '-2' etc inserted in the script name

SCRIPT_FILE=`/usr/bin/find $EMS_BIN -name TIBCOServers-E4JMS*_$EMS_PORT.sh -print`

# Check that an EMS Server has been installed for this Port Numberif [ -z "$SCRIPT_FILE" ]; then echo "EMS Server for port $EMS_PORT not installed" exit 1fi

NOHUP="nohup"

OS_TYPE=`uname -a | awk '{print $1}'`

case $OS_TYPE in

'SunOS')ulimit -n 256

;;

*) ;;

esac

# ########################################################################### This function checks for a running process.#

Page 32: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 31

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

# Takes a single argument : A string to search for in the process table# # WARNING: PS COMMAND IN UNIX TRUNCATES OUTPUT IF IT EXCEEDS# 80 COLUMNS IN WIDTH, THEREFORE, IF THE PATH POINTING# TO THE JRE/JAVA IS TOO LONG THEN THE GREP FOR# "tibcoadmin --" BELOW MAY FAIL.# # #########################################################################findPid(){

case $OS_TYPE in'Linux')

procpid=`/bin/ps awxf | /bin/grep "$1" |/bin/fgrep -v '\\_'| awk '{print $1}'` echo $procpid ;;

*) procpid=`/usr/bin/ps -ef | grep "$1" | grep -v "grep" | awk '{print $2}'`

echo $procpid ;; esac}

case "$2" in# ########################################################################## # Start TIBCO Enterprise Messaging Server## #########################################################################'start')

procname="$EMS_PORT/tibemsd.conf"pid=`findPid "$procname"`

if [ "$pid" != "" ]; thenecho "TIBCO Enterprise Messaging Server ($EMS_PORT) already running"

else

cd $CONFIG_ROOT/ems if [ -x $SCRIPT_FILE ]; then

echo "TIBCO Enterprise Messaging Server ($EMS_PORT) starting..."$NOHUP $SCRIPT_FILE >/dev/null 2>&1 &echo "Started TIBCO Enterprise Messaging Server ($EMS_PORT)"

elseecho "EMS Server for port $EMS_PORT not installed"

fi

fi;;

# ########################################################################## # Stop TIBCO Enterprise Messaging Server## #########################################################################'stop')

procname="$EMS_PORT/tibemsd.conf"pid=`findPid "$procname"`

if [ "$pid" != "" ]; thenecho "TIBCO Enterprise Messaging Server ($EMS_PORT) stopping."kill $pid

elseecho "TIBCO Enterprise Messaging Server ($EMS_PORT) not running"

fi;;

# ########################################################################## # Check if TIBCO Enterprise Messaging Server is running## #########################################################################'check')

Page 33: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 32

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

procname="$EMS_PORT/tibemsd.conf"pid=`findPid "$procname"`

if [ "$pid" != "" ]; thenecho "TIBCO Enterprise Messaging Server ($EMS_PORT) running with pid $pid"

elseecho "TIBCO Enterprise Messaging Server ($EMS_PORT) not running"

fi;;

# ########################################################################## # Unrecognized Option## #########################################################################*)

echo "usage: $0 [EMS Port] [start|stop|check]" 1>&2;;

esac

8.2 tibco_ems_cluster.sh

#!/bin/sh

# Cluster control script for TIBCO EMS # Greg Mabrito - Oct 25, 2004

TIBCO_HOME="/opt/tibco"TIBCO_SCRIPTS="$TIBCO_HOME/scripts"TIBCO_EMS_PORT="7020"

# process command line parameters, if any case "$1" instart)

su - tibco -c "$TIBCO_SCRIPTS/tibco_ems.sh $TIBCO_EMS_PORT start";;

stop)

su - tibco -c "$TIBCO_SCRIPTS/tibco_ems.sh $TIBCO_EMS_PORT stop";;

check)RET_VAL=`su - tibco -c "$TIBCO_SCRIPTS/tibco_ems.sh $TIBCO_EMS_PORT check" | grep

"running with pid"` if [ -n "$RET_VAL" ] ; then exit 0 else exit 1 fi

;;*)

echo "Usage: $0 {start|stop|check}"exit 1;;

esac

exit 0

Page 34: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 33

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

8.3 Sun Cluster Configuration ( scstat –p )

This is the printout from the testing configuration on SYS99115 and SYS99116

------------------------------------------------------------------

-- Cluster Nodes --

Node name Status --------- ------ Cluster node: sys99115 Online Cluster node: sys99116 Online

------------------------------------------------------------------

-- Cluster Transport Paths --

Endpoint Endpoint Status -------- -------- ------ Transport path: sys99115:qfe1 sys99116:qfe1 Path online Transport path: sys99115:eri0 sys99116:eri0 Path online

------------------------------------------------------------------

-- Quorum Summary --

Quorum votes possible: 3 Quorum votes needed: 2 Quorum votes present: 3

-- Quorum Votes by Node --

Node Name Present Possible Status --------- ------- -------- ------ Node votes: sys99115 1 1 Online Node votes: sys99116 1 1 Online

-- Quorum Votes by Device --

Device Name Present Possible Status ----------- ------- -------- ------ Device votes: /dev/did/rdsk/d8s2 1 1 Online

------------------------------------------------------------------

-- Device Group Servers --

Device Group Primary Secondary ------------ ------- --------- Device group servers: tibco_ems_data_merch sys99115 sys99116 Device group servers: tibco_ems_data_suppch sys99116 sys99115

-- Device Group Status --

Device Group Status ------------ ------ Device group status: tibco_ems_data_merch Online Device group status: tibco_ems_data_suppch Online

------------------------------------------------------------------

-- Resource Groups and Resources --

Group Name Resources ---------- --------- Resources: ctibco_merch_rg lh_cert_tibcomerch HA_ctibco_merch_store ctibco_merch_app Resources: ctibco_suppch_rg lh_cert_tibcosuppch HA_ctibco_suppch_store ctibco_suppch_app

-- Resource Groups --

Group Name Node Name State ---------- --------- ----- Group: ctibco_merch_rg sys99115 Online Group: ctibco_merch_rg sys99116 Offline

Group: ctibco_suppch_rg sys99116 Online Group: ctibco_suppch_rg sys99115 Offline

Page 35: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 34

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

-- Resources --

Resource Name Node Name State Status Message ------------- --------- ----- -------------- Resource: lh_cert_tibcomerch sys99115 Online Online - LogicalHostname online. Resource: lh_cert_tibcomerch sys99116 Offline Offline - LogicalHostname offline.

Resource: HA_ctibco_merch_store sys99115 Online Online Resource: HA_ctibco_merch_store sys99116 Offline Offline

Resource: ctibco_merch_app sys99115 Online Online Resource: ctibco_merch_app sys99116 Offline Offline

Resource: lh_cert_tibcosuppch sys99116 Online Online - LogicalHostname online. Resource: lh_cert_tibcosuppch sys99115 Offline Offline

Resource: HA_ctibco_suppch_store sys99116 Online Online Resource: HA_ctibco_suppch_store sys99115 Offline Offline

Resource: ctibco_suppch_app sys99116 Online Online Resource: ctibco_suppch_app sys99115 Offline Offline

------------------------------------------------------------------

-- IPMP Groups --

Node Name Group Status Adapter Status --------- ----- ------ ------- ------ IPMP Group: sys99115 ipmp827 Online qfe0 Online

IPMP Group: sys99116 ipmp827 Online qfe0 Online

8.4 tibjmsFactoryQueueSender.java

import javax.jms.*;import javax.naming.*;import java.util.*;

public class tibjmsFactoryQueueSender implements ExceptionListener{ String userName = null; String password = null;

String queueName = "queue.sample";

Vector data = new Vector ();

static final String providerContextFactory = "com.tibco.tibjms.naming.TibjmsInitialContextFactory";

static final String defaultProviderURLs = "tibjmsnaming://localhost:7222, tibjmsnaming://localhost:7222";

static final String defaultTopicConnectionFactory = "FTTopicConnectionFactory";

static final String defaultQueueConnectionFactory = "FTQueueConnectionFactory";

String providerUrls = defaultProviderURLs;

public tibjmsFactoryQueueSender ( String[] args ) { parseArgs ( args );

/* print parameters */ System.out.println ( "\n------------------------------------------------------------------------" ); System.out.println ( "tibjmsQueueSender SAMPLE" ); System.out.println ( "------------------------------------------------------------------------" ); System.out.println ( "Provider URL................. " + providerUrls ); System.out.println ( "User......................... " + ( userName != null ? userName:"(null)" ) ); System.out.println ( "Queue........................ " + queueName ); System.out.println ( "------------------------------------------------------------------------\n" );

if ( queueName == null ) { System.err.println ( "Error: must specify queue name" ); usage (); }

if ( 0 == data.size () ) { System.err.println ( "Error: must specify at least one message text" );

Page 36: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 35

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

usage (); }

System.err.println ( "Publishing into queue: '" + queueName + "'\n" );

try { /* * Init JNDI Context. */ Hashtable env = new Hashtable ();

env.put ( Context.INITIAL_CONTEXT_FACTORY, providerContextFactory ); env.put ( Context.PROVIDER_URL, providerUrls );

if ( null != userName ) { env.put ( Context.SECURITY_PRINCIPAL, userName );

if ( null != password ) { env.put ( Context.SECURITY_CREDENTIALS, password ); } }

InitialContext jndiContext = new InitialContext ( env );

QueueConnectionFactory factory = (QueueConnectionFactory)jndiContext.lookup ( defaultQueueConnectionFactory );

QueueConnection connection = factory.createQueueConnection ( userName, password );

connection.setExceptionListener ( this ); Tibjms.setExceptionOnFTSwitch ( true );

QueueSession session = connection.createQueueSession ( false,javax.jms.Session.AUTO_ACKNOWLEDGE ); /* * Use createQueue() to enable sending into dynamic queues. */ javax.jms.Queue queue = session.createQueue ( queueName );

QueueSender sender = session.createSender ( queue );

javax.jms.TextMessage message = session.createTextMessage (); String text = (String)data.elementAt ( 0 ); message.setText ( text );

/* publish messages */ for ( int i=0; i < 1000 ; i++ ) { sender.send ( message ); System.err.println ( "Sent message(" + i + "): " + text ); try { Thread.sleep ( 1000 ); } catch ( Exception e ) { } }

connection.close (); } catch ( NamingException e ) { e.printStackTrace (); System.exit ( 0 ); } catch ( JMSException e ) { e.printStackTrace (); System.exit ( 0 ); } }

public static void main ( String args[] ) { tibjmsFactoryQueueSender t = new tibjmsFactoryQueueSender ( args ); }

void usage () { System.err.println ( "\nUsage: java tibjmsQueueSender [options]" ); System.err.println ( " <message-text1 ... message-textN>" ); System.err.println ( "" ); System.err.println ( " where options are:" );

Page 37: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 36

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

System.err.println ( "" ); System.err.println ( " -provider <provider URL> - EMS server URL, default is local server" ); System.err.println ( " -user <user name> - user name, default is null" ); System.err.println ( " -password <password> - password, default is null" ); System.err.println ( " -queue <queue-name> - queue name, default is \"queue.sample\"" ); System.err.println ( " -help-ssl - help on ssl parameters\n" ); System.exit ( 0 ); }

void parseArgs ( String[] args ) { int i = 0;

while ( i < args.length ) { if ( args[i].compareTo ( "-provider" ) == 0 ) { if ( (i+1) >= args.length ) usage (); providerUrls = args[i+1]; i += 2; } else if ( args[i].compareTo ( "-queue" ) == 0 ) { if ( (i+1) >= args.length ) usage (); queueName = args[i+1]; i += 2; } else if ( args[i].compareTo ( "-user" ) == 0 ) { if ( (i+1) >= args.length ) usage (); userName = args[i+1]; i += 2; } else if ( args[i].compareTo ( "-password" ) == 0 ) { if ( (i + 1) >= args.length ) usage (); password = args[i+1]; i += 2; } else if ( args[i].compareTo ( "-help" ) == 0 ) { usage (); } else if ( args[i].compareTo ( "-help-ssl" ) == 0 ) { tibjmsUtilities.sslUsage (); } else if ( args[i].startsWith ( "-ssl" ) ) { i += 2; } else { data.addElement ( args[i] ); i++; } } }

public void onException ( JMSException exception ) { String strErrCode = exception.getErrorCode (); String strFTSwitch = "FT-SWITCH";

if ( true == strErrCode.startsWith ( strFTSwitch ) ) { String strNewServer = strErrCode.substring ( strFTSwitch.length () + 2 ); System.out.println ( "FT Connection switched to: " + strNewServer ); } else { exception.printStackTrace (); } }}

Page 38: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 37

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

8.5 tibjmsMsgProducer.java

import javax.jms.*;import javax.naming.*;import com.tibco.tibjms.Tibjms;

public class tibjmsMsgProducer implements ExceptionListener{ /*----------------------------------------------------------------------- * Parameters *----------------------------------------------------------------------*/

String serverUrl = null; String userName = null; String password = null; String name = "topic.sample"; Vector data = new Vector(); boolean useTopic = true;

/*----------------------------------------------------------------------- * Variables *----------------------------------------------------------------------*/ Connection connection = null; Session session = null; MessageProducer msgProducer = null; Destination destination = null;

public tibjmsMsgProducer ( String[] args ) { parseArgs ( args );

try { tibjmsUtilities.initSSLParams ( serverUrl, args ); } catch ( JMSSecurityException e ) { System.err.println ( "JMSSecurityException: "+e.getMessage ()+", provider=" + e.getErrorCode () ); e.printStackTrace (); System.exit ( 0 ); }

/* print parameters */ System.err.println ( "\n------------------------------------------------------------------------" ); System.err.println ( "tibjmsMsgProducer SAMPLE" ); System.err.println ( "------------------------------------------------------------------------" ); System.err.println ( "Server....................... " +((serverUrl!=null)?serverUrl:"localhost" ) ); System.err.println ( "User......................... " +((userName !=null)?userName : "(null)" ) ); System.err.println ( "Destination.................. " + name ); System.err.println ( "Message Text................. " ); for ( int i = 0 ; i < data.size () ; i++ ) { System.err.println ( data.elementAt ( i ) ); } System.err.println ( "------------------------------------------------------------------------\n" );

try { if ( data.size () == 0 ) { System.err.println ( "***Error: must specify at least one message text\n" ); usage (); }

/* Increase FT Reconnection Settings */ String reconnect = new String ( "60, 5000" ); Tibjms.setReconnectAttempts ( reconnect ); System.out.println ( "After change for reconnections: " + Tibjms.getReconnectAttempts () );

System.err.println ( "Publishing to destination '" + name + "'\n" );

ConnectionFactory factory = new com.tibco.tibjms.TibjmsConnectionFactory ( serverUrl );

connection = factory.createConnection ( userName, password );

connection.setExceptionListener ( this ); Tibjms.setExceptionOnFTSwitch ( true );

/* create the session */ session = connection.createSession ( false, javax.jms.Session.AUTO_ACKNOWLEDGE );

/* create the destination */

Page 39: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 38

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

if ( useTopic ) destination = session.createTopic ( name ); else destination = session.createQueue ( name );

/* create the producer */ msgProducer = session.createProducer ( null );

TextMessage message = session.createTextMessage (); String text = (String)data.elementAt ( 0 ); message.setText ( text );

/* publish messages */ for ( int i=0; i < 1000 ; i++ ) { /* publish message */ msgProducer.send ( destination, message );

System.err.println ( "Sent message(" + i + "): " + text ); try { Thread.sleep ( 1000 ); } catch ( Exception e ) { } }

/* close the connection */ connection.close (); } catch ( JMSException e ) { e.printStackTrace (); System.exit ( -1 ); } }

/*----------------------------------------------------------------------- * usage *----------------------------------------------------------------------*/ private void usage () { System.err.println ( "\nUsage: java tibjmsMsgProducer [options] [ssl options]" ); System.err.println ( " <message-text-1>" ); System.err.println ( " [<message-text-2>] ..." ); System.err.println ( "\n" ); System.err.println ( " where options are:" ); System.err.println ( "" ); System.err.println ( " -server <server URL> - EMS server URL, default is local server" ); System.err.println ( " -user <user name> - user name, default is null" ); System.err.println ( " -password <password> - password, default is null" ); System.err.println ( " -topic <topic-name> - topic name, default is \"topic.sample\"" ); System.err.println ( " -queue <queue-name> - queue name, no default" ); System.err.println ( " -help-ssl - help on ssl parameters" ); System.exit ( 0 ); }

/*----------------------------------------------------------------------- * parseArgs *----------------------------------------------------------------------*/ void parseArgs(String[] args) { int i=0;

while(i < args.length) { if (args[i].compareTo("-server")==0) { if ((i+1) >= args.length) usage(); serverUrl = args[i+1]; i += 2; } else if (args[i].compareTo("-topic")==0) { if ((i+1) >= args.length) usage(); name = args[i+1]; i += 2; } else if (args[i].compareTo("-queue")==0) { if ((i+1) >= args.length) usage(); name = args[i+1];

Page 40: EMS on Sun Cluster

EIF-Developer’s Guide for COMPPANY XYZ 39

©2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.

i += 2; useTopic = false; } else if (args[i].compareTo("-user")==0) { if ((i+1) >= args.length) usage(); userName = args[i+1]; i += 2; } else if (args[i].compareTo("-password")==0) { if ((i+1) >= args.length) usage(); password = args[i+1]; i += 2; } else if (args[i].compareTo("-help")==0) { usage(); } else if (args[i].compareTo("-help-ssl")==0) { tibjmsUtilities.sslUsage(); } else if(args[i].startsWith("-ssl")) { i += 2; } else { data.addElement(args[i]); i++; } } }

/*----------------------------------------------------------------------- * main *----------------------------------------------------------------------*/ public static void main ( String[] args ) { tibjmsMsgProducer t = new tibjmsMsgProducer ( args ); }

public void onException ( JMSException exception ) { String strErrCode = exception.getErrorCode (); String strFTSwitch = "FT-SWITCH";

if ( true == strErrCode.startsWith ( strFTSwitch ) ) { String strNewServer = strErrCode.substring ( strFTSwitch.length () + 2 ); System.out.println ( "FT Connection switched to: " + strNewServer ); } else { exception.printStackTrace (); } }}