Polygraph: System for Dynamic Reduction of False …...Polygraph: System for Dynamic Reduction of False Alerts in Large-Scale IT Service Delivery Environments Sangkyum Kim UIUC [email protected]

Polygraph: System for Dynamic Reduction of False Alerts in Large-Scale ITService Delivery Environments∗

Sangkyum KimUIUC

[email protected]

Winnie Cheng, Shang Guo,Laura Luan, Daniela Rosu

IBM Research{wcheng,sguo,luan,drosu}@us.ibm.com

Abhijit BoseGoogle

[email protected]

Abstract

In order to avoid critical SLA violations, serviceproviders use monitoring technology to automate theidentification of relevant events in the performance ofmanaged components and forward them as incident tick-ets to be resolved by system administrators (SAs) be-fore a critical failure occurs. For optimal cost and per-formance, monitoring policies must be finely tuned tothe behavior of the managed components, such that SAsare not engaged for investigation of false alerts. Exist-ing approaches to tuning monitoring policy rely heavilyon high skilled SA work, with high costs and long com-pletion times. Polygraph is a novel architecture for au-tomated tuning of monitoring policies towards reducingfalse alerts. Polygragh integrates multiple types of ser-vice management data into an active-learning approachto automated generation of new monitoring policies. SAscan only be involved in the verification of policies withlow projected scores. Experiments with a trace of 60Kmonitoring events from a large IT service delivery in-frastructure compare methods for threshold adjustmentin alert policy predicates with respect to potential forfalse alert reduction. Select methods reduce false alertsby up to 50% compared to baseline methods.

1 IntroductionProactive prevention and timely response to failures withminimal operational costs is a major target for serviceproviders in large-scale IT infrastructures. In order toachieve this target, service providers use monitoring in-frastructures such as IBM Tivoli 7 [5] and HP Service-Center 7 [4], to monitor the performance of the managedcomponents and identify critical events. Such events areforwarded to incident management systems, and actions

∗Partially supported by US National Foundation grant IIS-0905215and the Blue Waters sustained-petascale computing project under NSFgrant OCI 07-25070 and the state of Illinois.

are taken before Service Level Agreement (SLA) viola-tions occur. The monitoring agents deployed on man-aged components use pre-defined policies to generateevents based on Key Performance Indicators (KPIs) andcomponent execution contexts. Other nodes in the mon-itoring infrastructure perform event aggregation and in-cident ticket generation based on policies that aggregatein time and space (i.e., across multiple systems). Even-tually, SAs analyze the auto-generated incident tickets,called alerts, to prevent or solve failures.

The effectiveness of the monitoring policies in captur-ing critical events with limited false positives and neg-atives has impact on the service management costs, in-cluding the cost of SA time spent with incident man-agement, and the cost of SLA penalties. Previous work[2] and our observation of large-scale delivery infrastruc-tures confirm the difficulty of configuring the monitoringpolicies for an effective cost balance because of the com-plexity of the monitored systems and applications.

This paper addresses the problem of effective config-uration of the monitoring policies with an automated ap-proach to dynamically modeling system behavior, gener-ating monitoring policies and deploying them. The nov-elty draws from the integration of diverse service man-agement content (e.g., incident tickets, system vitals)and operational domains (e.g., customers, clusters) andfrom the use of machine learning to model system be-havior and to assess policy effectiveness. As a result, oursystem, called Polygraph, can automatically reduce thenumber of false-positive alerts, called false alerts, withlimited or no impact on the identification of true alerts.Thus, Polygraph reduces SA work with both alert han-dling and policy tuning.

Polygraph builds on two novel principles. First princi-ple is to learn from SAs, namely, learn from the resolu-tions of historical incident tickets handled by SAs aboutalert instances that can be safely ignored. This principleenables effective assessment of policies without elabo-rate, resource consuming system analysis. The second

Resource Utilization and Performance

Alert policy generator

False alert detector

System Configuration Data

Alert policy evaluator/simulator

Monitoring Rule Dispatcher

Monitoring System Management

Monitoring Agent

Monitoring Rules

Rule change Alerts

Monitored server

CPUDisk

MemoryApp

Events

Incident Management

System/Event Source

Policy deployment

Alerts

Problem tickets

Events

Proposed alert policies

New/modified alert policies

Tune alert policies

Ticket Source

False alerts

Polygraph

Expert Reviewon new rules

Events

Operation Data (SLA, Maintenance

Schedule, …)

Figure 1: Polygraph system architecture and environ-ment.

principle is to leverage component similarity in largescale environments in order to expand the input sizefor Polygraph learning tasks, and thus improve accuracy.Similarity is also used for policy deployment such thatfalse alerts are reduced even for servers that have not ex-perienced similar events yet, but are likely to experiencethem in the future. Towards this end, Polygraph usestemporal correlation of system vitals, configuration de-tails, and change operation details.

Figure 1 illustrates the integration of Polygraph withina typical service management infrastructure and its maincomponents. Polygraph has a close interaction with themonitoring infrastructure and leverages data from con-figuration management databases, repositories of histor-ical system vitals, incident and change management sys-tems, repositories of SLA and maintenance data.

The Polygraph prototype described and evaluated inthis paper is focused on the analysis of alerts and gener-ation of new policy. The implementation uses the IBMTivoli policy specification language. However, our pro-posal does not depend on a specific monitoring technol-ogy, and can be extended to any IT service delivery en-vironment. The evaluation compares several approachesfor generation of new monitoring policies., and it is basedon over 60K monitoring events, and related incident tick-ets and system vitals from a large IT service provider.

Overall, this work makes several contributions:(1) Design system architecture for continual refinementand assessment of policies based on integrated exploita-tion of diverse service management data, (2) Proposetechniques for identification of false alerts by mining his-torical incident resolutions, (3) Propose techniques forgeneration and assessment of new monitoring policiesthat can significantly reduce the volume of false alerts.

In this paper, true alert identifies an alert instance forwhich SA intervention is required to solve a critical sit-uation, and false alert identifies an alert instance that iscleared without any fix performed by SA, yet involves

the SA for system status checks.Next section discusses related work, Section 3 de-

scribes Polygraph architecture and implementation de-tails, Section 4 presents the evaluation of Polygraph pro-totype. Finally, Section 5 summarizes our results. Referto [7] for more extensive evaluation results and relatedwork.

2 Related WorkThere are several well-known commercial andcommunity-supported platforms for monitoring data-center and IT infrastructure components, including IBMTivoli Monitoring [6], and HP OpenView [4]. Mostof these products support alert-suppression based onstatically specified policies. Our proposal uses policiesdynamically generated by mining alert streams and otherservice management data.

Previous work in the area of incident prevention andresolution in large scale IT infrastructures used incidentclassification techniques based on performance metrics[1]. Polygraph uses a similar approach of learning fromhistorical service management data in order to discoveralert patterns and generate new policies.

In the area of dynamic generation of alert policy, theclosest to our work is the approach in [2] for automatedand adaptive threshold setting based on application Ser-vice Level Objectives (SLOs). This approach is hard toapply in large IT infrastructure due to the complexity ofthe method and the difficulty to acquire component de-pendency graphs.

A large body of work relates to false-alert reductionin intrusion detection systems (IDS). The use of datamining methods applied to historical data [9], and multi-stage analysis architectures render our work similar tosome IDS solutions [8]. However, the binary IDS alertmodel (real vs. false attack) is different from the modelof performance monitoring alerts, which includes quanti-fied resource usage levels. Thus, this work cannot lever-age the IDS methods.

Previous work [3] underlines the difficulty in classi-fying rare events because traditional learning classifiersare often biased for the most common events. The au-thors argue that if the events are rare and not too costly,the learning algorithms can do little to improve. Whenevents have high costs (e.g., SLA penalties), a largernumber of false alarms must be tolerated. This matchesa best practice in IT Service Delivery for management ofhigh-risk SLOs.

3 Polygraph FrameworkFor a competitive IT service delivery infrastructure, themonitoring infrastructure must minimize the number offalse alerts in order to keep operational costs low and

must maximize the number of true alerts in order to pre-vent SLA failures. This goal can be achieved with de-tailed analysis and fine-tuning of alert policies. However,even with state-of-the-art tooling, this approach requiressubstantial SA effort and skills, prohibiting large-scaleadoption.

Polygraph, the framework proposed in this paper,specifically addresses these limitations through the au-tomation of monitoring policy evaluation and fine-tuning. The goal of Polygraph is to identify false alertsand design new monitoring policies that lower the occur-rences of false alerts with negligible impact of true alerts.

The Polygraph method for false-alert reduction com-prises: (1) learning the pattern of true and false alerts,(2) generation of new policy based on the alert patterns,(3) assessment of new policy impact. Figure 1 illustratesthe component architecture that implements this methodby integration with the overall service delivery infras-tructure. Polygraph comprises of four functional compo-nents (see Figure 1): False Alert Detector performs theanalysis of current alert specification effectiveness andfalse-alert detection. The module distinguishes false andtrue alerts by learning from SA’s assessment of past inci-dent resolutions. Alerts are associated with details aboutrelated policy and KPI thresholds, which are used in nextstages.

Monitoring Policy Generator performs the genera-tion of monitoring policies based on observed false-alertpatterns. Data from similar servers is integrated in orderto improve result quality.

Monitoring Policy Evaluator performs the evalua-tion of newly generated monitoring policies with focuson SLA impact (minimize missed true alerts) and workreduction (maximize eliminated false alerts). Evaluationis based on simulation against historical system vitalsand monitoring events. Policies with acceptable balanceof SLA impact and reduction move to deployment whileothers can be passed to SAs for further evaluation.

Monitoring Policy Deployment performs the deploy-ment of new monitoring policies by close interactionwith the monitoring infrastructure. Urgency of deploy-ment is assessed base on server profiles, and used forscheduling policy deployment.

The reminder of this section is focused on select ele-ments of Polygraph implementation.

3.1 Policy ModelIn general, an alert policy is defined by a user-specifiedpredicate over component KPIs or context parameters.The policy may also include threshold conditions on theevent reoccurrence in order to delay alert generation.Polygraph prototype works with three types of policy:[BASIC] IF A; [AND] IF A AND B; and [OR] IF A ORB. Here, A and B are predicate units consisting of one

KPI reference and related threshold value. Our methodsapply to more complex predicate expressions by conver-sion to conjunctive normal form.

In the Polygraph prototype, we consider onlythreshold-based alert policies, which have predicatesthat refer only to KPIs and threshold values. The ana-lyzed alert traces show that most false-alerts result fromthreshold-based policies. Sample policies for each type:

• IF (System.V irtualMemoryUtilization > 90)

• IF (NTPhysical Disk.Disk T ime > 80) AND(NT Physical Disk.Disk T ime <= 90)

• IF (SMP CPU.CPU Status = ‘off-line’) OR(SMP CPU.Avg CPU Busy 15 > 95)

3.2 Learning Alert PatternsAlerts are identified by analysis of incident ticket de-tails. The alerts-related tickets are automatically gen-erated, and are characterized by a text pattern that mayinclude details about the policy, monitoring event (KPIsand values). Polygraph includes methods for detection oftext patterns for alert identification. For each alert, it ex-tracts KPI values, alert policy references, and managedcomponent references. Further, text analysis of incidentresolution description helps classify an alert as true alertor false alert.

Given a scope defined by a target alert policy P and aserver or group of serversH , Polygraph scans the relatedhistorical content to identify alerts and constructs the trueset, comprising the identified true alerts, and false set,comprising the false alerts. A sample approach to capturealert pattern is based on the KPI value patterns, i.e., theranges of KPI values for each of the true and false sets.Polygraph includes also methods for discovery of timepatterns, discussed next.

3.3 Policy Generation with Value PatternsAn alert value pattern is described by the true valuerange, the range of KPI values that corresponds to alertsin the true set. Without loss of generality, we assumea monotonic behavior for KPI values; if a value corre-sponds to a true alert, than all alerts for higher KPI valuesare true alerts.

Proposition 1 For a given alert policy P consisting ofone KPI parameter p and its corresponding threshold θ,let T be the true set of P , and t = min(T ). If t > θ, Pmodified to include the threshold t will not generate anyfalse negatives for the given dataset.

Proposition 1 describes how to tune the thresholdvalue for a BASIC alert policy. If the smallest KPI valuet for true alerts is bigger than the original threshold, then

the new threshold value is set to t and no true alerts aremissed. False alerts related to KPI values between theold and new threshold are eliminated while those for KPIvalues above the new threshold remain. As an example,given a policy governing CPU threshold, if all true alertshappen for thresholds greater than or equal to 95%, wecan safely raise the original threshold of 90% to 95%. Ifa false alert occurs for CPU load of 98%, it is not elimi-nated by the new threshold setting.

3.4 Policy Generation with Time PatternsTo further improve the false alert reduction, Polygraphtakes into account the time patterns. Periodic patternsof jobs, like daily, weekly, or even monthly, can signif-icantly affect resource consumption and trigger alerts,false or true. The method comprises the finding of pe-riodic patterns based on learning set and extrapolation ofthese patterns in the analysis to remaining historical con-tent. In order to limit the risk of missing true alerts byextrapolation, we focus on the periodicity of true alerts,rather than on false alerts.

Given a scope of policy and server(s), periodicity anal-ysis starts with the related true set of the policy and pro-duces a set of periodic time intervals, called true timeranges, during which all of the occurring alerts are con-sidered true alerts. Alerts that occur outside the true timeranges are considered false alerts. Periodicity analysisrequires specification of a threshold for the width of atrue range to mine a set of these true ranges for eachalert policy. For example, suppose host H has three trueevents at 3pm, 4pm, and 8pm. Given a true time rangethreshold of 1 hour, the analysis results in two true timeranges (2pm-5pm) and (7pm-9pm). Smaller thresholdslead to more false alert detection, but increase the risk ofmissing true alerts.

3.5 Server-Based Policy TuningThe typical approach to defining alert policy in servicedelivery infrastructures is to use the same best-practicespolicies for all servers with similar installed softwareand workloads. While this approach yields acceptablealert accuracy during early server lifetime, it will nolonger efficient later in server lifetime, across capacitychanges (e.g., memory or hard disk upgrades), dynamicre-clustering for workload balancing, and other events.For example, for a server with increase of hard disk ca-pacity from 100GB to 10TB, an alert threshold of 90%utilization will generate alerts when 1TB of disk space isstill available.

To address these limitations, Polygraph takes the ap-proach of tuning alert policy thresholds for each serverand our experiments show significant benefits. Poly-graph does not exclude the tuning of alerts for groups ofservers with similar configurations and workloads, which

we plan to integrate in the future.

3.6 New Policy EvaluationIn the evaluation phase, Polygraph scans the set of alertsfor the current scope (i.e., union of true and false sets)and compares the KPI values against the new policythreshold. The new counters for new true and false alertsare compared with the old counters in order to assessthe risk (true alert misses) and benefit (false alert reduc-tion). Installation-specific thresholds on the impact met-rics are used to determine if the new policy (1) is highlybeneficial with very low impact and should be automat-ically deployment,(2) has borderline benefits and risksand should be forwarded for SA analysis, or (3) has lim-ited benefits and higher risks, and should be discarded.

4 EvaluationOur empirical results are based on large and detaileddatasets collected from globally distributed productionenvironments serving real clients. We collected 30-daydatasets with around 60K events. We divide the datasetsof system performance metrics, alerts, and events intosix parts (5 days for each) based on their occurred time:older data are to be used for learning purpose, and recentdata are for test purpose. To show the effects of trainingdata size on our alert threshold adjustment schemes, weuse the first five datasets to make five differently sizedtraining data (datasets of 5, 10, 15, 20, and 25 days) andthe last part as test data. By default, the true time rangethreshold is 1 hour.

The experiments in this section compare various meth-ods for new policy generation with respect the effect ofthe automatically generated policies to eliminate falsealerts. Namely, the experiments present the ratio of falsealerts for which the generated policies do not triggeralerts. We refer to the not-triggered false alerts as ‘de-tected false alerts’. The larger the shares of detectedfalse alert, the larger the reduction enabled by Polygraphframework.

4.1 Data CharacteristicsIn the target service delivery environment, alert policyspecification comprises several fields including the targetserver and predicate descriptions. The threshold valueshave never been changed since initial deployed (which istypical in most environments).

System KPI values are collected at one-minute inter-vals and are aggregated over different time windows suchas 15 minute, 1 hour, 1 day, and further.

We use two widely deployed alert policies, say P1 andP2, in our experiments to show the effectiveness of thePolygraph framework. Table 1 shows their characteris-tics, including total alert count, share of total alert trace,and distribution across true and false alerts; actual policy

Table 1: Characteristics of select alert policiesAlert Policy Count Ratio(%) True Alerts False Alerts

P1 23355 40.48 1026 (4.39%) 22329 (95.61%)P2 3344 5.80 1526 (45.63%) 1884 (56.34%)

0

20

40

60

80

100

5 10 15 20 25

Rate

(%)

Train Data Size (Day)

P1 Total Detected False EventsP1 Detected False Events from Hosts with True SetsP2 Total Detected False EventsP2 Detected False Events from Hosts with True Sets

Figure 2: Host-based false alert detection with varyinglearning set size

expressions are not disclosed for privacy issues. P1 has alarge volume of alerts, of which 95% are false alerts. P1

is a good example of how Polygraph can automaticallyand effectively tune an alert policy threshold. P2 illus-trates the case when Polygraph needs expert SA reviewsto prevent abuse of automation.

4.2 Basic Threshold AdjustmentThe basic threshold adjustment method comprises ofpolicy generation using value patterns and including allservers, i.e., the entire data set is used for assessment ofthe true set. Both P1 and P2 alert policies do not haveany gain when applying this method, even if the currentpolicy thresholds are not optimal. This is because serverswith same alert policy are not similar with respect to re-lated datasets, as shown by experiments below.

4.3 Host-Based AdjustmentThis experiment compares the basic threshold adjust-ment (i.e., analysis across all servers) with host-based ad-justment (i.e., server-based analysis) including only theservers whose training data includes true alerts. Figure 2illustrates false alert detection rates of P1 and P2 as thetraining set increases. Note that the difference betweenthe plots for each policy indicates the false alert detectionrate for servers whose training data do not have any trueevents. In general, we observe that more false alerts aredetected as training set gets smaller. P1 shows a high rateof false alert detection where as P2 shows a low detec-tion rate. For P2, we see that at least 10 days of trainingdata is needed for reliable performance. The spike of P2

is due to the large number of servers with no true eventsin the training set.

40

50

60

70

80

90

100

5 10 15 20 25

Rate

(%)

Train Data Size (Day)


Figure 3: Host and time-based false alert detection withvarying learning set size

50

60

70

80

90

100

30 60 120 180Ra

te (%

)True Range Threshold (min)


Figure 4: Host and time-based false alert detection withvarying true time threshold

4.4 Host and Time-Based AdjustmentThis experiment compares the basic adjustment methodwith a method using the time-based pattern for false alertdetection and selection of the servers that experiencetrue alerts. Figure 3 shows false alert detection ratesof P1 and P2 for the two methods when increasing thetraining set. In general, the host-and-time-based schemashows higher false alert detection than the time-basedschema (see Figure 2). Similarly, the host-and-time-based scheme shows higher false alert detection ratesthan the host-based scheme. Based on the experimentsdescribed above, P1 can be safely tuned by Polygraphwith no human interaction, but P2 needs to be shown tothe system administrator before deployment.

4.5 Impact of True Time Range ThresholdThis experiment evaluates the impact of the true timerange threshold on the rate of false-alert detection whenapplied to host-and-time-based scheme. The experimentuses the 10-day training dataset and varies the thresholdvalue from 30 to 180 minutes.

The results are illustrated in Figure 4. As the thresholdvalue increases and true time ranges of P1 and P2 getlarger, the rate of false-alert detection decreases.

4.6 DiscussionBased on our experiments, we identify extensions thatcan improve the effectiveness of Polygraph in reducing

false-alerts by automated tuning of alert policies.

Leverage operational data for tuning alert policy.The analysis of monitoring event data demonstrates theneed to leverage operational data in the tuning of alertpolicy. For instance, Polygraph integrates operationaldata that describes when the scheduled maintenance ac-tivities are expected to generate significant workload onthe managed servers, such as anti-virus scans and back-ups. Our analysis shows that 20.3% of a customer’salerts are due to virus scan that caused higher CPU usagethan the normal state. In order to eliminate these falsealerts, one has to exploit operational data and include inthe new policies predicates related to the execution con-text. Another relevant type of operational data includesSLA specifications and attainment statistics. One can ex-ploit SLA information to delay generation of alerts whenworkload varies significantly but yet the SLAs are un-likely to be missed.

Emphasize more recent history. When long eventhistory is available, such as in typical production envi-ronments, false-alert detection is likely to exhibit poorquality if all data samples receive equal weight in theanalysis. This is because the detector misses to recognizethe most recent trends in the occurrence of false alerts. Inorder to improve decision quality, a weighted scheme canbe employed to emphasize recent input. Weights shouldbe carefully chosen such that the discard of old contentdoes not cause the missing of true alerts.

Scalable policy deployment. Polygraph can generatenew policies for a server profile that matches a very largenumber of servers. In order to avoid the disruption of themonitoring infrastructure, the policy deployment shouldbe staged. The staging order and timeline is based on theassessment of server-specific risks/costs due to delayingthe policy deployment.

Impact of change operations. The behavior of a man-aged system changes over time due to infrastructurechanges for the server itself (e.g., hardware, software)or the environment in which the server is running (e.g.,workload). As a result, monitoring policy becomes out-dated. Integration of service management data that de-scribes change operations, such as new patch/softwareinstallation, memory expansion, and subnet change,should be used to trigger analysis for policy tuning anddetermine the relevant historical content to consider.

Leverage server similarity. Our experiments revealthe potential benefit of grouping similar servers in alertpolicy tuning. This is helpful in cases when the train-ing dataset collected on an individual server does nothave sufficient data points for rare events; grouping sim-ilar servers provides a better training dataset, hence bet-ter policy tuning. A sample similarity criterion includes

the servers in the same web-application cluster, whichall have the same server configuration and the sameworkload characteristics. For more general cases, Poly-graph must use a server similarity measure that integratesserver resource profiles and workload characteristics.

5 ConclusionThis paper introduces Polygraph, a framework for auto-mated reduction of false alerts in large scale IT infras-tructures based on an active-learning approach. Poly-graph mines historical incident ticket content to learnfrom SAs about which alerts are false and to correlatethis information with other service management content,such as system vitals and server similarities, to modifythreshold-based alert policies and eliminate false alerts.In experiments with real-life traces from a large and di-verse service delivery environment, Polygraph performsvery well in reducing false alerts while not missing truealerts. Polygraph achieves this by combining host andtime-based tuning of alert policies.

In future work, we plan to extend the Polygraph modelfor automated generation of new policy by includingconditions on execution context and methods for auto-matic identification of durations when maintenance pro-cesses generate temporary spikes in resource consump-tion. Furthermore, for improved quality of false-alert re-duction, we plan the development of methods that em-phasize recent history.

References[1] BODIK, P., GOLDSZMIDT, M., FOX, A., WOODARD, D. B.,

AND ANDERSEN, H. Fingerprinting the datacenter: automatedclassification of performance crises. In EuroSys (2010).

[2] D. BREIGAND, E. H., AND SHEHORY, O. Automated and Adap-tive Threshold Setting: Enabling Technology for Autonomy andSelf-Management. In ICAC (2005).

[3] DRUMMOND, C., AND HOLTE, R. Learning to Live with FalseAlarms. In Proceedings of the KDD Data Mining Methods forAnomaly Detection Workshop (2005).

[4] HEWLETT-PACKARD COMPANY. HP ServiceCenter software.http://www.openview.hp.com/products/ovsc/index.html (2011).

[5] IBM CORPORATION. Tivoli Monitoring. http://www-01.ibm.com/software/tivoli/products/monitor/(2010).

[6] IBM CORPORATION. Tivoli Software. http://www-01.ibm.com/software/tivoli/ (2010).

[7] KIM, S., CHENG, W., GUO, S., LUAN, L., ROSU, D., ANDBOSE, A. Polygraph: System for dynamic reduction of falsealerts in large-scale it service delivery environments. IBM Re-search Technical Report (Apr. 2011).

[8] N. JAN, S. LIN, S. T., AND LIN, N. A decision support systemfor constructing an alert classification model . Expert Systems withApplications Vol. 36, No. 8 (Oct. 2009).

[9] SOLEIMANI, M., AND GHORBANI, A. Critical Episode Miningin Intrusion Detection Alerts. In CNSR (2008).

Documents

Polygraph: System for Dynamic Reduction of False …...Polygraph: System for Dynamic Reduction of False Alerts in Large-Scale IT Service Delivery Environments Sangkyum Kim UIUC [email protected]