15
1 | Page Monitoring Remedy with BMC Solutions Overview How does BMC Software monitor Remedy with our own solutions? The challenge is many fold with a solution like Remedy – and this does not only apply to Remedy, but also competing solutions as well as other web based enterprise solutions. Analysis The source of performance (or lack of it) can be attributed to a large variety of factors, some within the software itself, and some within the immediate infrastructure, as well as within the greater environment – e.g. the internet. BMC Solutions The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted that these solutions will help pinpoint where the issues are (and usually slow performance is a combination of issues – a set of bottlenecks) allowing them to be addressed, and may not in themselves resolve the issue. BMC has multiple modules that can monitor Remedy (7.6.04 is the oldest version we monitor). A complete Remedy stack monitoring should include the following: From the End-user perspective o BMC APM – EUEM (web-interface only for http/https traffic), additional watchpoints may be required o Borland Silk Performer Synthetic Transaction Monitoring for BMC Software (newer replacement of TM-ART) Mid-tier/application tier o BMC APM- Application diagnostics o BPPM for Internet Servers (monitors the web server, such as Apache, Microsoft IIS etc.) Remedy Applications (i.e. Incident Management, Change Management etc.) o BMC PATROL Knowledge Module for Remedy AR Server Back-end database (Oracle, Sybase, MS SQL Server etc.) o BPPM for Databases (monitors all databases that Remedy supports) Any Network equipment, such as F5 load balancers o Entuity or any other network monitoring tool. Operating system where remedy is running on (i.e. Windows, Linux, VMs etc.) o BPPM for Servers/Virtual Servers (monitors health of the system, like CPU, Memory, Disk, Remedy processes/Services, logs) Hardware platform, storage o BMC Performance Manager for Hardware by Sentry Software

Monitoring Remedy with BMC Solutions · The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Monitoring Remedy with BMC Solutions · The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted

1 | P a g e

Monitoring Remedy with BMC Solutions

Overview How does BMC Software monitor Remedy with our own solutions? The challenge is many fold with a solution like Remedy – and this does not only apply to Remedy, but also competing solutions as well as other web based enterprise solutions.

Analysis The source of performance (or lack of it) can be attributed to a large variety of factors, some within the software itself, and some within the immediate infrastructure, as well as within the greater environment – e.g. the internet.

BMC Solutions The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted that these solutions will help pinpoint where the issues are (and usually slow performance is a combination of issues – a set of bottlenecks) allowing them to be addressed, and may not in themselves resolve the issue. BMC has multiple modules that can monitor Remedy (7.6.04 is the oldest version we monitor). A complete Remedy stack monitoring should include the following:

From the End-user perspective o BMC APM – EUEM (web-interface only for http/https traffic), additional watchpoints

may be required o Borland Silk Performer Synthetic Transaction Monitoring for BMC Software (newer

replacement of TM-ART)

Mid-tier/application tier o BMC APM- Application diagnostics o BPPM for Internet Servers (monitors the web server, such as Apache, Microsoft IIS etc.)

Remedy Applications (i.e. Incident Management, Change Management etc.) o BMC PATROL Knowledge Module for Remedy AR Server

Back-end database (Oracle, Sybase, MS SQL Server etc.) o BPPM for Databases (monitors all databases that Remedy supports)

Any Network equipment, such as F5 load balancers o Entuity or any other network monitoring tool.

Operating system where remedy is running on (i.e. Windows, Linux, VMs etc.) o BPPM for Servers/Virtual Servers (monitors health of the system, like CPU, Memory,

Disk, Remedy processes/Services, logs)

Hardware platform, storage o BMC Performance Manager for Hardware by Sentry Software

Page 2: Monitoring Remedy with BMC Solutions · The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted

2 | P a g e

Some Causes of Performance Issues This is NOT an exclusive list, but illustrates the complexity and some of the points that can cause performance issues. They are not mutually exclusive, quite probably the reverse, with a combination of them causing the solution to be ‘slow’.

Browser version / type / cache – some browsers, especially older versions of those browsers are significantly slower than others. Internet Explorer is slower than Chrome. Caching settings may alter the performance

LAN / WAN / Internet connectivity is another recurrent source of performance issues

Server set up, clustering, number of users per JVM in the mid-tier configuration

Hardware specification and balancing (memory, CPU, storage)

Database hardware / configuration – indices, field settings, I/O, values and filters

Frequency of polling / cron tasks

Query efficiency

Default queries pulling too many records at one time

Reporting requirements and efficiency of queries underlying reports All of the above could have an impact, and can be examined in far greater depth in order to get to a resolution to performance issues

The Reality Before looking at an example of how Remedy is monitored it is really important to understand that there is no one solution, and one that may be good today may not be so good tomorrow – things change such as (not an exclusive list):

Other traffic on the network

Additional volume of records

Alterations to configuration o With or without change management o New / update reports are written, o New / updated queries are deployed o New functionality deployed

Archiving may be performed periodically

Use of new / different browsers / caching setting Furthermore, even with just 2 deployments of Remedy, used in similar organizations there are enough potential environmental as well as configuration settings that some monitoring setting that may work in environment 1 will not necessarily be as useful in environment 2. Again, differences could be

Geographic coverage

Other network traffic

Hardware differences (e.g. a different manufacturer’s database or version on database is deployed)

Different settings (SLA’s may be heavily deployed in one environment and not so in another)

Volumes of data transacted as well as stored (Archiving may not be taking place)

Different reports and KPI’s configured with more or less efficient queries

Different levels of automation deployed

Complexity of the deployment (e.g. Approval process)

Page 3: Monitoring Remedy with BMC Solutions · The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted

3 | P a g e

Levels of notifications As a consequence the details below must be viewed as a guideline and no more when determining what to monitor, how to monitor and what solution is used to monitor the system. The details below apply to deploying BMC Software monitoring solutions, but could probably be adapted to other monitoring solutions. This document covers each monitoring at high level for Production Environment.

AR SERVER MONITORING: The following OS KM parameters are set to alert when the set thresholds are breached.

Windows OS Monitoring Parameters Thresholds

Occurrences Incident Ticket

Polling Cycle

Logical Disks [Free space%] Major Event Critical Event

D: < 15% < 10% Immediate Yes 2 mins

C: < 15% < 10 % Immediate Yes 2 mins

Memory Major Event Critical Event

Memory Used in % > 85 % > 95% 11 Yes 2 mins

CPU Major Event Critical Event

Total Processor Utilization in % > 85 % > 95 %

9 Yes 2 mins

AR Services Status (Up/down) Major Event Critical Event

BMC Remedy Action Request System Server onbmc-s - Service down

Immediate Yes 5 mins

BMC Remedy Flashboards Server - onbmc-s - Service down

Immediate Yes 5 mins

McAfee Framework Service - Service down

Immediate Yes 5 mins

McAfee McShield - Service down Immediate Yes 5 mins

McAfee Task Manager - Service down Immediate Yes 5 mins

Remote Procedure Call (RPC) - Service down

Immediate Yes 5 mins

BMC Remedy Email Engine - onbmc-s 1 - Service down

Immediate Yes 5 mins

Email monitoring using Email Script for Servers Critical Event

Page 4: Monitoring Remedy with BMC Solutions · The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted

4 | P a g e

which has Email engine Running

Number of emails that have been incorrectly flagged as delivered - > 0

Immediate Yes 2 mins

Time since oldest Pending email - > 900 Seconds

Immediate Yes 2 mins

Emails that are pending delivery - > 0

Immediate Yes 2 mins

AR KM Monitoring Service down Immediate Yes 2 mins

LDAP Port Monitoring Port Down Immediate Yes 2 mins

TCP Established Connections >3000

Immediate Yes 2 mins

Tomcat SSO Down Immediate Yes 5 mins

Assignment Engine On Demand Monitoring

Approval Engine On Demand Monitoring

Reconciliation Jobs On Demand Monitoring

Page 5: Monitoring Remedy with BMC Solutions · The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted

5 | P a g e

However, there are lot many other parameters in monitoring which are used for analysis.

Page 6: Monitoring Remedy with BMC Solutions · The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted

6 | P a g e

Page 7: Monitoring Remedy with BMC Solutions · The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted

7 | P a g e

MID TIER MONITORING: Individual mid tiers are monitored in the same way as AR server are along with the new set of process and service for the Mid-Tier Server.

Windows OS Monitoring Parameters

Thresholds

Occurrences Incident Ticket

Polling Cycle

Logical Disks [Free space%] Major Event Critical Event

D: < 15% < 10% Immediate Yes 2 mins

C: < 15% < 10 % Immediate Yes 2 mins

Memory Major Event Critical Event

Memory Used in % > 85 % > 95% 11 Yes 2 mins

CPU Major Event Critical Event

Total Processor Utilization in % > 85 % > 95 %

9 Yes 2 mins

Mid-tier Services Status (Up/down) Major Event Critical Event

Apache Tomcat Tomcat6 - Service down

Immediate Yes 5 mins

McAfee Framework Service - Service down

Immediate Yes 5 mins

McAfee McShield - Service down Immediate Yes 5 mins

McAfee Task Manager - Service down Immediate Yes 5 mins

Remote Procedure Call (RPC) - Service down

Immediate Yes 5 mins

TCP Established Connections >3000

Immediate Yes 10 mins

Page 8: Monitoring Remedy with BMC Solutions · The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted

8 | P a g e

There are lot of other parameters monitored using the OS KM as shown below.

Page 9: Monitoring Remedy with BMC Solutions · The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted

9 | P a g e

DASHBOARD AND ANALYTICS SERVER MONITORING: Along with the standard monitoring of the OS following service and processes are monitored for the Dashboard servers:

Processes Status ( Up/down) Major Event Critical Event

Occurrences Incident Ticket

Polling Cycle

CIA NA Process down Immediate Yes 5 mins

Dash board Services Status (Up/down) Major Event Critical Event

Apache Tomcat NA Service down Immediate Yes 5 mins

BOE120MySQL NA Service down Immediate Yes 5 mins

BMC Atrium DIL Repository NA Service down

Immediate Yes 5 mins

Page 10: Monitoring Remedy with BMC Solutions · The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted

10 | P a g e

McAfee Framework Service NA Service down

Immediate Yes 5 mins

McAfee McShield NA Service down Immediate Yes 5 mins

McAfee Task Manager NA Service down Immediate Yes 5 mins

Remote Procedure Call (RPC) NA Service down

Immediate Yes 5 mins

BMC Atrium DIL Server NA Service down

Immediate Yes 5 mins

Server Intelligence Agent (onbmc_ada) NA Service down

Immediate Yes 5 mins

Report Execution On Demand

A configured a report which is executed at regular intervals to identify if there is an issue with the BO and DB.

F5 LOAD BALANCER MONITORING USING CUSTOMIZED PATROL KM: With the help of F5 load balancer KM, we are monitoring the status of active pool members in F5. If any of the pool member goes down an alert is sent to BPPM.

DATABASE MONITORING: SQL database is monitored for multiple parameters and the below one’s are used for alerting to keep an eye on the heart of AR systems.

Parameters Alarm conditions Alarm Occurrences Incident

Ticket Polling Cycle

Suspect Database Any Database Yes Immediate Yes 4 hours

SQL Server Agent Job Failures

Any Job Failure Yes Immediate Yes 15 mins

Blocker Procs

For any blocking processes if the blocking persists for more than 30 secs.

Yes

Immediate Yes 5 mins

SQL Agent Status When service is down

Yes Immediate Yes 5 mins

SQL Server Status When service is down

Yes Immediate Yes 5 mins

Cache Hit Ratio <90 Yes Immediate Yes 5 mins

Long Running Trans

>300Sec The alert should contain the session id executing the transaction along with the user

Yes

Immediate Yes 5 mins

Page 11: Monitoring Remedy with BMC Solutions · The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted

11 | P a g e

name for the session

Deadlock

Warning alert for any deadlock. The alert needs to have the session ids of all the sessions that are involved in deadlock.

Yes

Immediate Yes 5 mins

Disk Space Monitoring for Databases:

Maintain 50% free storage space for all production DB servers

Warning alert threshold set at 25% free space o Email, alerting and escalation

Critical alert threshold set at 20% free space o Email, alerting and escalation

TMART MONITORING: We are running synthetic transaction by the name HLAL which goes to Homepage, Login, Application Listing and Logout to check if the AR application is working fine and measure any performance degradation. The availability is checked within the data centre and on case to case basis we run the transactions to run from remote data centres.

Page 12: Monitoring Remedy with BMC Solutions · The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted

12 | P a g e

ALERTING: Production: HLAL: Availability or Accuracy < 100 % for consecutively 2 cycles. This is done keeping in mind that there should not be an increase in false alerts because of any network glitch, browser or monitoring application related issue. HLAL_perf: Login Response Time > 10 seconds for consecutively 5 cycles. An analysis has been done over multiple ITSM systems and 10 seconds login time have been found to be the benchmark for deciding if the performance is indeed getting worse. We run the script every 2 seconds. Dev & QA: HLAL: Availability or Accuracy < 100 % for consecutively 2 cycles. No performance alerting for Dev and QA URL’s.

INDIVIDUAL AR SERVER MONITORING USING AR SERVER KM: The Patrol Knowledge Module for AR Server is used to monitor the individual AR Server availability. This is configured in case of AR Server group is implemented. The KM uses Java based drivers to connect to the individual AR Server. The KM detects basic performance and availability of the AR Server.

DEV & QA ENVIRONMENT MONITORING: Development and Quality Assurance environments are monitored for Availability using TMART and only , Disk utilization and McAfee Services are monitored in BPPM. Availability of URL’s is monitored using the TMART transaction HLAL which goes to the Homepage, do a Login, does the Application Listing and finally Logout.

Page 13: Monitoring Remedy with BMC Solutions · The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted

13 | P a g e

BPPM is used only to monitor the Disk Space and McAfee related services. Following parameters are monitored.

Windows OS Monitoring Parameters Thresholds

Occurrences Incident Ticket

Polling Cycle

Logical Disks [Free space%]

Major Event Critical Event

D: < 15% < 10% Immediate Yes 2 mins

C: < 15% < 10 % Immediate Yes 2 mins

Services Status (Up/down)

Major Event Critical Event

McAfee Framework Service - Service down

Immediate Yes 5 mins

McAfee McShield - Service down Immediate Yes 5 mins

McAfee Task Manager - Service down Immediate Yes 5 mins

Acknowledgements With thanks to: Franco Ferrero Bob Mosely Nick Goff Theodore Cory

Page 14: Monitoring Remedy with BMC Solutions · The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted

14 | P a g e

1. For each of the monitoring layers outlined below we want to know the specific monitoring targets and their default thresholds

a. Mid-tier/application tier i. BMC APM- Application diagnostics What app threshold

do you monitor for? ii. BPPM for Internet Servers (monitors the web server,

such as Apache, Microsoft IIS etc.) Which Apache thresholds? JMX monitoring points? Etc..

b. Remedy Applications (i.e. Incident Management, Change Management etc.) i. BMC PATROL Knowledge Module for Remedy AR

Server What specific things is the KM for Remedy AR Server monitoring? We want the details and default thresholds please

KM monitors AR Application status and AR Server Statistics. As of now there is no thresholds set.

Metrics of AR Server Statistics

c. Back-end database (Oracle, Sybase, MS SQL Server etc.) i. BPPM for Databases (monitors all databases that

Remedy supports) We use Oracle Enterprise Manager (OEM) for DB monitoring. We want to ensure we have the default Oracle DB monitoring points and thresholds provided to ensure we sync them

Database monitoring include availability of database, tablespace usage, also database related filesystems for utilization

d. Operating system where remedy is running on (i.e. Windows, Linux, VMs etc.) i. BPPM for Servers/Virtual Servers (monitors health of the

system, like CPU, Memory, Disk, Remedy processes/Services, logs) Again, what Remedy processes, services, and logs are monitored, what are the default thresholds

OS monitoring for Windows including Total CPU, % of Memory used, Disk Freespace. Processes include arcmdbd,armonitor,arplugin,arrecond,arserver,arsvcdsp,slmbrsvc,slmcollsvc. arerror log is monitored for plugin errors. Default OS threshold for Windows

Page 15: Monitoring Remedy with BMC Solutions · The following stack of solutions at the present time – June 2014 - can be used for full stack monitoring of the solution. It must be noted

15 | P a g e

Apart from this, we use a custom Patrol KM to create

blackout for Change suppressions.