Upload
fastly
View
29
Download
1
Embed Size (px)
Citation preview
Incident Management At The EdgeLisa Phillips VP, SRE@lisaphillips [email protected]
Incident Types
• Distributed Denial of Service attacks • Critical security vulnerabilities • Software bugs • Capacity concerns • Upstream network outages • Datacenter issues • “Operator Error” • Third Party service provider events
Incident Response Framework
1. Develop definitions of impact 2. Define severity 3. Define response and communication requirements
4. Define post-incident activities
Severity levels
Severity App Delivery Impact
Business Operations Impact
Scope of Impact
SEV0 Critical Critical All Sites Affected
SEV1 Critical Critical Multiple Sites Affected, or Single Site unavailable or suffering from severe degradation
SEV2 Major Major Multiple Sites Affected, Single Site intermittently available or suffering from minor
degradation
SEV3 Minor Minor Single Site or limited customer impact
The people
• Engineers are Human • Right people at the right time engaged • Randomization is expensive
The Situation: No NOC
• Immediate escalation to engineers is required• Global Customer Service Focused Engineer • Decentralized on-call and monitoring • Engineering teams own their own destiny and have control over their alerts • Empowered to improve
Incident Commander
• Coordinate actions across multiple responders• Alerts and updates stakeholders• Evaluate the high-level issue and understand its impact• Consult with team experts on necessary actions • Calls off or delays other activities • Clearly indicates when an Incident is resolved and leads follow-up
Communication: Wide Collaboration
• Involve all parts of the organization • Quick communication and collaboration - establish next
steps and regroup time • Identify quickly the questions to answer, and
communicate effectively to address them • When in doubt, choose transparency
Lessons
• Start with the basics • Empower your engineers to deal effectively with workload• Exercise your engineers • Continuously Improve • Partner closely with all stakeholders • Continue to let incidents teach you
Thank youLisa Phillips VP, SRE@lisaphillips [email protected]