34
@JPMALEK Retrospective from a startup built in the cloud : top 3 big lessons from the AWS outage on 04.21.2011 plus 4,369 other smaller ones 06/18/2022 1

Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

Embed Size (px)

DESCRIPTION

All about the April 2011 AWS outage, its causes, effects and ways to mitigate the same sorts of issues in the future.

Citation preview

Page 1: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 1

Retrospective from a startup built in the cloud : top 3 big lessons

from the AWS outage on

04.21.2011 plus 4,369 other smaller ones

Page 2: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 2

What a country : entrepreneurial resiliency

Page 3: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 3

“robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs,

AWS, the BD API”

(true story)

Page 4: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 4

Boom

Page 5: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 5

me: previous startupteams in 3 countries

highly transactional systemMS tech : IIS/MS SQL Server

co-located, leased/owned hardware0% in cloud

$75M/yearly rev

Page 6: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 6

me : current startupsystems 100% on AWS

99% free/open-source software

standing on the shoulders of giants

Page 7: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 7

What HappenedRegions and Zones

US-WEST

A

B

C

D

US-EAST

A

B

C

D

Page 8: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 8

What Happened in us-eastIt’s all about the EBS (Elastic Block Store) – apologies for the artistic license, AWS

US-EAST

A

B

C

D

Region

Zones

Control plane services

EBS Cluster

Page 9: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 9

What Happened in us-eastIt’s all about the EBS (Elastic Block Store) – apologies for the artistic license, AWS

EBS Cluster

? ‘re-mirroring storm’

Control plane servicesThread-starved

Regional API brown-out

Region/Zones

Page 10: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 10

fault tolerance: 3 to 47 important failearnings

and 4,369 less important ones

Page 11: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 11

in the context of our startup, of course

YMMV depending on velocity

Page 12: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 12

Ruger

Page 13: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 13

The Ruger Fault Equivalency

time = money

fault tolerance = time²  - risk tolerance

Also known as:

'Fast, good and cheap : pick two‘

Page 14: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 14

system design philosophy:leverage proven, open-source tech

in the cloudto build ascaleablereliablesecure

operational foundationquickly

Page 15: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 15

So how do you achievethe right level of fault tolerance in

the cloud?

3 tenets

Page 16: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 16

Tenet #1

Scripted Repeatability Tenet #2

SPOF Elimination Tenet #3

Clear-Cut Communication

Page 17: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 17

Tenet #1prepare a fault-tolerant foundation with

scripted repeatability

aka automation

Page 18: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 18

Tenet #1 : scripted repeatability

from the start :script the non-interactive install of your tools

and OS

custom AMIDebian : great package management

based on Eric Hammond’s workhttp://alestic.com/

Page 19: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 19

Tenet #1 : scripted repeatability

which will allow you toscript the setup/tear-down of your stack

Page 20: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 20

Tenet #1 : scripted repeatability

which will allow you toscript system tests

integrity (3-4K tests)performance (30-40K tests)

load, capacity (2-4M requests)

Page 21: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 21

Tenet #1 : scripted repeatability

A/B system test results : MySQL Percona Upgrade

Page 22: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 22

That’s how1 person

set up andmanaged a network

comprised of 90+/- server instancesfor 1.5 years

while serving various other roleswithout having to leave their chair

try that with real hardware

Page 23: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 23

Tenet #2SPOF Elimination

We don’t need no stinkin single points of failure.

Page 24: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 24

Tenet #2 : SPOF Elimination

SPOF Examples:Cloud Provider

RegionZone

Load BalancerApp Server Database

Fred

Page 25: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 25

Tenet #2 : SPOF Elimination

Cloud Provider fail-over?

e.g. AWS –> Rackspace

Page 26: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 26

Tenet #2 : SPOF Elimination

Region fail-over?

e.g. useast->uswest within AWSNah.

Page 27: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 27

Tenet #2 : SPOF Elimination

Zone fail-over?Yes.

US-WEST

A

B

C

D

US-EAST

A

B

C

D

Page 28: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 28

Tenet #2 : SPOF Elimination

Zone fail-over best practices:are you using auto-scaling?

no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics

Page 29: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 29

Tenet #2 : SPOF Elimination

Load-balancer (ELB), app server, database fail-over?

Yes.

Page 30: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 30

Tenet #2 : SPOF Elimination

So it’s actually all about reduction of the right SPOFs for

your business context

Just adding the ability to fail-over and have backups within a region is huge!

Probably enough for most.What about Fred?

Page 31: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 31

Tenet #3Clear-Cut Communication

Page 32: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 32

Tenet #3 : Clear-cut Communication

During an outage, communicating the right things at the right time:

hard.But not that hard.

Page 33: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 33

Tenet #1

Scripted Repeatability Tenet #2

SPOF Elimination Tenet #3

Clear-Cut Communication

Three Tenets Revisited

Page 34: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 34

Thank You

Our AWS account rep :"Dylan Peterson" <[email protected]>

(notes attached to this slide)