View
429
Download
2
Category
Preview:
Citation preview
Relaxing picture of Yoga
hunt through logs for 2 hours
Monitoring that will make your engineers give up
Gil Zellner (CloudifyDev at Gigaspaces)
Twitter: @Heathenaspargus
Who am I?Now:
Past:
@Heathenaspargus
cost of hiring new employee is 1.5-3x their monthly salary
@Heathenaspargus
Easy (days) Intermediate (months)
Hard (years)
- no changes to infrastructure
- just policy
- Small changes to apps
- logging- light
automation
- Design for better operability
- long term
@Heathenaspargus
frustration - I am unable to complete my task
@Heathenaspargus
Time spent inefficiently
@Heathenaspargus
Repetitive tasks
@Heathenaspargus
Working Alone
@Heathenaspargus
Yak Shaving
@Heathenaspargus
https://www.ergoflex.co.uk/blog/category/sleep-research/sleeponomics-could-sleep-deprivation-be-the-real-reason-politicians-make-bad-decisions
@Heathenaspargus
Mandatory Half day-off after night production issue
@Heathenaspargus
Allocate weekly time to resolve or automate issues that kept us up at night
@Heathenaspargus
Wider rotation (more people do on-call)
@Heathenaspargus
https://www.youtube.com/watch?v=IUoEiDT1nXY
Creating a DevOps Culture: Identifying a “Single Person of Failure”
@Heathenaspargus
Knowledge Matrix
Deploy System Mobile Link Backend
Gil V V
Karen V V
Ari V V
@Heathenaspargus
Easy (days) Intermediate (months)
Hard (years)
- no changes to infrastructure
- just policy
- Small changes to apps
- logging- light
automation
- Design for better operability
- long term
@Heathenaspargus
solution: alert only things that meet the following criteria:
1) Alert on symptoms, not suspected "causes"2) Actionable3) Business breaking
@Heathenaspargus
Alerte générale!
@Heathenaspargus
Solution: direct alerts to relevant parties
@Heathenaspargus
Companies that are doing this as a service:
@Heathenaspargus
Companies that are doing this as a service:
@Heathenaspargus
Picking the right things to measure
Netflix stream starts per second
@Heathenaspargus
What are your KPIs ?stream starts per second
Taxi orders per minute
Api calls per second
@Heathenaspargus
Companies that are doing this as a service:
@Heathenaspargus
Make heal script
@Heathenaspargus
Auto-remediation basics1) Make remediation script2) Make diagnosis script3) Connect them
@Heathenaspargus
Facebook Auto Remediation
https://www.facebook.com/notes/facebook-engineering/making-facebook-self-healing/10150275248698920
@Heathenaspargus
Heal Workflows - Cloudify
@Heathenaspargus
Easy (days) Intermediate (months)
Hard (years)
- no changes to infrastructure
- just policy
- Small changes to apps
- logging- light
automation
- Design for better operability
- long term
@Heathenaspargus
Incentive for resilient architecture
0.99 uptime: 87.6 hours per year
0.999 uptime: 8.76 hours per year
0.9999 uptime: 52.6 minutes per year
0.99999 uptime: 5.3 minutes per year
@Heathenaspargus
Automated failovers
@Heathenaspargus
The AntiFragile organizationhttps://queue.acm.org/detail.cfm?id=2499552
@Heathenaspargus
Bad artists copy, great artists steal
email:Gil.Zellner@gmail.com
Twitter: @Heathenaspargus
Recommended