37
Why dynamic & adaptive thresholds matters

Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

  • Upload
    nagios

  • View
    754

  • Download
    0

Embed Size (px)

DESCRIPTION

Anders Haal's presentation on using adaptive thresholds with Nagios. The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

Citation preview

Page 1: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Why 

dynamic & adaptivethresholdsmatters

Page 2: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

anders håål, ingenjörsbyn ab [email protected]@thenodon

Page 3: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Bischeck ­  dynamic & adaptivethresholds for Nagios

www.bischeck.org

Page 4: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Threshold

Page 5: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

What is the limitation with static threshold?

✗ Not static✗ Load varies throughout the day, week✗ To many or to few alarms✗ Collecting and thresholding in the same context

✗ Based on the current measurement✗ Do not consider dependency to other services

Page 6: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

How to make thresholdsdynamic & adaptive? 

Page 7: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

“Database table size should not be bigger then 5 % of yesterdays max size “

{example 1}

Page 8: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

“Database table size should not be bigger then 5 % of max size yesterday“

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Table size

Yesterday

table size < max(yesterday)*1.05

Today

{example 1}

Page 9: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

“Number of on­line users should not be more then 10 % higher then the average number of on­line users for the last 10 data points”

{example 2}

Page 10: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

users < avg(X0+X1+.....+X9)*1.1

Where X is the historical on-line users data points

{example 2}

“Number of on­line users should not be more then 10 % higher then the average number of on­line users for the last 10 data points”

Page 11: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

{example 3}

“The number of orders with errors should be lower then 5% of the total number of registered orders”

Page 12: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

“The number of orders with errors should be lower then 5% of the total number of registered orders”

{example 3}

Total #orders

#orders with errors < 0.05 * (Total #orders)

Page 13: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

“Message queue size should be above the defined Friday threshold profile”

 

{example 4}

count

time of the day

Page 14: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

“Message queue size should be above the defined Friday threshold profile”

 

{example 4}

count

time of the day

count

time of the day

Page 15: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

How to make thresholdsdynamic & adaptive? 

✗ Time profiles✗ Historical data points✗ Math and statistical operations

Page 16: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

We did not want a check_XYZ hack 

We wanted a tool

Page 17: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Collecting

Page 18: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Thresholding

Collecting

Separation

Page 19: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Historical data

Collecting

Page 20: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Historical data

Logic

Collecting

Page 21: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Historical data

Logic

Collecting

Day profile

Page 22: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Historical data

Calender Logic

Collecting

Day profile

Page 23: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Historical data

Calender Logic

Collecting

Day profile

Nagios

Page 24: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Historical data

Calender Logic

Collecting

Day profile

Scheduling

ServerInterface

NagiosOpenTSDB XYZ

Page 25: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters
Page 26: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters
Page 27: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

bischeck basics

● Configuration like Nagios – host, service but also service item● Host is just a container of the rest● Service specify the connection and scheduling● Service item specify the “query” and the threshold 

class to use

● Host and service name must be the same as in the Nagios configuration

Page 28: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Threshold – 24 hour day profile 

● Divide the day in 24 hour points, where every point can be:● Static value● Dynamic value 

– Math expression on single value or range of data from the cache

– Based on cached data points retrieved by● Index – single value or index range● Time – single value (closest) or time range (between)

Page 29: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

....

<!­­ 12:00 Static ­­>

<hour>7000</hour>

....

Page 30: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

....

<!­­ 12:00 Static ­­>

<hour>7000</hour>

<!­­ 13:00 Adaptive ­­>     

<hour>erpserver­orders­ediOrders[0] / 3</hour>

....

Page 31: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

....

<!­­ 12:00 Static ­­>

<hour>7000</hour>

<!­­ 13:00 Adaptive ­­>     

<hour>erpserver­orders­ediOrders[0] / 3</hour>

<!­­ 14:00 Adaptive with math function ­­>

<hour>avg(erpserver­orders­ediOrders[­30M:­60M]) / 2</hour>

....

Page 32: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Threshold – 24 hour day profile 

Between every “full” hour a linear equation is calculated 

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0

10000

20000

30000

40000

50000

60000

Day profile

Hour

Page 33: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Threshold – 24 hour day profile 

● Connect calender to the day profile and evaluate according to the following order:

1. Month and day of month 

2.Week and day of week 

3.Day in month 

4.Day in the week 

5.Month 

6.Week

● Holiday – exception days  

Page 34: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

And more....● Multi­threaded and multi­scheduling schema per service

● interval● cron 

● Data collection – jdbc, livestatus, internal cache● Virtual services● Date macros in execution statements ● Customize 

● connection (service classes)● execution (service item classes)● thresholds (threshold classes) ● server integration (server classes)

● XML configuration supported with WEBui (beta)● GPL 2 license

Page 35: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Future● Improved time series database● Patterns/baselines● More statistic functions● “Sensors” ­ alarms on multiple/aggregated data points● Any ideas?

Page 36: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Infrastructure monitoring Application performance monitoring [APM]

Business activity monitoring [BAM]

Operational Business intelligence [OBI]

Page 37: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Questions & Feedback 

Pictures – Creative Commonswww.flickr.com/photos/loneprimate/4017405677www.flickr.com/photos/catatronic/2397319483www.flickr.com/photos/dtrimarchi/6815004766www.flickr.com/photos/bikeracer/6740232