Upload
alois-reitbauer
View
427
Download
0
Embed Size (px)
Citation preview
T h e D a r k A r t o f B u i l d i n g a P r o d u c ti o n I n c i d e n t S y s t e m
@Alois ReitbauerTech. Evangelist & Product Mgr., Compuware
N o b r o ke n c a b l e s
N o d a t a c e n t e r fi r e s
O t h e r t h i n g s c a n h a p p e n a s w e l l
Continuous deployments
Infrastructure changes
other “everyday” stuff
Scaling an incident system
H o w i t f e e l s t o d o w h a t w e d o
D o y o u a l e r t ?
Typical error rate of 3 percent at 10.000 transactions/min
During the night we now have 5 errors in 100 requests.
D o y o u a l e r t ?
Typical response time has been around 300 ms.
Now we see response times up to 600 ms.
We a re g o o d a t fi x i n g p ro b l e m s , b u t n o t re a l l y g o o d a t d e t e c ti n g t h e m .
H o w c a n w e g e t b e tt e r ?.
It is all about statisticsI t ’s a l l a b o u t s t a ti s ti c s
Stati sti cs is about objecti vely lying to yourself
in a meaningful way.
H o w t o d e s i g n a n i n c i d e n t
How to calculatethis value?
I t l o o k s r e a l l y s i m p l e
Which metric to pick?
How to getthis baseline?
How to define thatthis happened?
W h i c h m e t r i c s t o p i c k ?
T h r e e t y p e s o f m e t r i c s
Capacity MetricsDefine how much of resource is used.
Discrete MetricsSimple countable things, like errors or users.
Continuous MetricsMetrics represented by a range of values at any given time.
Capac i ty Metr icsGood for capacity planning, not so good for production alerting
C o n n e c ti o n P o o l s
bett er useConnection acquisition timeTells you, whether anyone needed a connection and did not get it.
C P U U s a g e
bett er useCombination of Load Average and CPU usageeven better correlate the with response times of applications
Discrete Metr i csPretty easy to track and analyze.
Conti nuous Metr i csRequire some extra work as they are not that easy to track.
Conti nuous Metrics – The hope
42
Conti nuous Metrics – The reality
What the average tells us
What the median tells us
H o w t o g e t a b a s e l i n e ?
A baseline is not a numberBaselines define the range of a value combined with a probability
Normal distributi on as baseline
Mean: 500 msStd. Dev.: 100 ms
68 %400ms – 500 ms
95 %300ms – 700 ms
100 200 300 400 500 600 700 800 900
99 %200ms – 800 ms
T h i s c a n g o r e a l l y w r o n g
“Why alerts suck and monitoring solutions need to become better”
H o w t h i s l e a d s t o f a l s e a l e r t s
Many false alerts
Aggressive Baseline
No alerts at all
Moderate Baseline
Find the right distributi on model
However, this can be really hard to impossible
Your distributi on might look l ike this
… or like this
or completely diff erentyou never know …
H o w c a n w e s o l v e t h i s p r o b l e m ?
N o r m a l d i s t r i b u ti o n - a g a i n
50 Percent slower than μ
97.6 Percent slower than μ + 2σ
Median
97th Percentile
The 50 t h and 90 t h percenti le defi ne normal behavior
without needingto know anything about the
distributi on model
Median shows the real problem
H o w t o d e fi n e n o n - n o r m a l b e h a v i o r ?
Fortunately this is not the problem we need to solveWe are only talking about missed expectations
Let’s look at two scenarios
Errors
Is a certain error rate likely to happen or not?
Response Times
Is a certain increase in response time significant
enough to trigger an incident?
The error rate scenario
We have a typical error rate of 3 percent at 10.000 transactions/minute
During the night we now have 5 errors in 100 requests. Should we alert – or not?
W h a t c a n w e l e a r n
S t a ti s ti c s i s e v e r w h e r e
Binomia l D ist r ibuti onTells us how likely it is to see n successes in a certain number of trials
H o w m a n y e r r o r s a r e o k ?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 190.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
Likeliness of at least n errors
18 % probability to see 5 or more errors. Which is within 2 times Std. Deviation. We do not alert.
R e s p o n s e T i m e E x a m p l e
Our median response time is 300 ms
and we measure
200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms
P e r c e n ti l e D r i ft
D e t e c ti o n
Did the median drift signifi cantly?
Check all values above 300 ms
200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms
7 values are higher than the median. Is this normal?
We can again use the Binomial Distribution
A p p l y i n g t h e B i n o m i a l D i s t r i b u ti o n
We have a 50 percent likeliness to see values above the median.
How likely is is that 7 out of 10 samples are higher?
The probability is 17 percent, so we should not alert.
How to calculatethis value?
… a n d w e a r e d o n e !
Which metric to pick?
How to getthis baseline?
How to define thatthis happened?
This was just the beginning
There are many more use things about statistics, probabilities, testing, ….
A l o i s R e i t b a u e [email protected]@AloisReitbauerapmblog.compuware.com
Image Credits
http://commons.wikimedia.org/wiki/File:Network_switches.jpghttp://commons.wikimedia.org/wiki/File:Wheelock_mt.jpghttp://commons.wikimedia.org/wiki/File:Fire-lite-bg-10.jpghttp://commons.wikimedia.org/wiki/File:Estacaobras.jpghttp://commons.wikimedia.org/wiki/File:Speedo_angle.jpghttp://commons.wikimedia.org/wiki/File:WelcomeToVegasNite.JPGhttp://commons.wikimedia.org/wiki/File:Dice_02138.JPGhttp://commons.wikimedia.org/wiki/File:Teadlased_j%C3%A4%C3%A4l.jpg