How to do monitoring that won't make your engineers quit

Preview:

Citation preview

Relaxing picture of Yoga

hunt through logs for 2 hours

Monitoring that will make your engineers give up

Gil Zellner (CloudifyDev at Gigaspaces)

Twitter: @Heathenaspargus

cost of hiring new employee is 1.5-3x their monthly salary

@Heathenaspargus

Easy (days) Intermediate (months)

Hard (years)

- no changes to infrastructure

- just policy

- Small changes to apps

- logging- light

automation

- Design for better operability

- long term

@Heathenaspargus

frustration - I am unable to complete my task

@Heathenaspargus

Time spent inefficiently

@Heathenaspargus

https://www.ergoflex.co.uk/blog/category/sleep-research/sleeponomics-could-sleep-deprivation-be-the-real-reason-politicians-make-bad-decisions

@Heathenaspargus

Mandatory Half day-off after night production issue

@Heathenaspargus

Allocate weekly time to resolve or automate issues that kept us up at night

@Heathenaspargus

Wider rotation (more people do on-call)

@Heathenaspargus

https://www.youtube.com/watch?v=IUoEiDT1nXY

Creating a DevOps Culture: Identifying a “Single Person of Failure”

@Heathenaspargus

Knowledge Matrix

Deploy System Mobile Link Backend

Gil V V

Karen V V

Ari V V

@Heathenaspargus

Easy (days) Intermediate (months)

Hard (years)

- no changes to infrastructure

- just policy

- Small changes to apps

- logging- light

automation

- Design for better operability

- long term

@Heathenaspargus

solution: alert only things that meet the following criteria:

1) Alert on symptoms, not suspected "causes"2) Actionable3) Business breaking

@Heathenaspargus

Solution: direct alerts to relevant parties

@Heathenaspargus

Companies that are doing this as a service:

@Heathenaspargus

Companies that are doing this as a service:

@Heathenaspargus

Picking the right things to measure

Netflix stream starts per second

@Heathenaspargus

What are your KPIs ?stream starts per second

Taxi orders per minute

Api calls per second

@Heathenaspargus

Companies that are doing this as a service:

@Heathenaspargus

Auto-remediation basics1) Make remediation script2) Make diagnosis script3) Connect them

@Heathenaspargus

Heal Workflows - Cloudify

@Heathenaspargus

Easy (days) Intermediate (months)

Hard (years)

- no changes to infrastructure

- just policy

- Small changes to apps

- logging- light

automation

- Design for better operability

- long term

@Heathenaspargus

Incentive for resilient architecture

0.99 uptime: 87.6 hours per year

0.999 uptime: 8.76 hours per year

0.9999 uptime: 52.6 minutes per year

0.99999 uptime: 5.3 minutes per year

@Heathenaspargus

Bad artists copy, great artists steal

email:Gil.Zellner@gmail.com

Twitter: @Heathenaspargus